Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does JDEC support high resolution JPEGs? #7

Open
Arcitec opened this issue Oct 29, 2024 · 10 comments
Open

Does JDEC support high resolution JPEGs? #7

Arcitec opened this issue Oct 29, 2024 · 10 comments

Comments

@Arcitec
Copy link
Contributor

Arcitec commented Oct 29, 2024

So far I have been experimenting with test.py.

I see that it sets size = 112*10 and converts the input image to 1120x1120 (with some mirroring to make the image square if it's not square).

But that might just be a test script decision?

Is it possible to process higher-size JPEGs with this network? Like a 6000x6000 .jpg file?

Or would it require some rewrite to use Tiled Processing, with 1120x1120 tiles?

Edit: Actually I don't even know how tiled processing would work here, since this network doesn't operate on pixels. :D Hope you got some advice for how to use high resolution JPEG input if it's even possible.

@Arcitec
Copy link
Contributor Author

Arcitec commented Oct 29, 2024

Well, I tried to input_ = dm.read_coefficients(jpg_file) directly from a large JPG:

file: ./test_images/hq_highres_jpg/cowboy.jpg (height: 6480, width: 4320)

And this was the result:

  0%|                                                            | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "test_jpg.py", line 141, in <module>
    dqt_swin.unsqueeze(0).cuda())
  File "/home/johnny/.local/share/miniconda/envs/jdec/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "JDEC/models/JDEC.py", line 59, in forward
    pred = self.de_quantization(dctt,qmapp,cbcr)
  File "JDEC/models/JDEC.py", line 38, in de_quantization
    self.feat = self.encoder_dct(dct,cbcr)  
  File "/home/johnny/.local/share/miniconda/envs/jdec/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "JDEC/models/swinirv2.py", line 892, in forward
    x = self.pre_forward_features(y,cbcr)
  File "JDEC/models/swinirv2.py", line 872, in pre_forward_features
    x = layer(x,self.x_size)
  File "/home/johnny/.local/share/miniconda/envs/jdec/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "JDEC/models/swinirv2.py", line 456, in forward
    x = blk(x,x_size)
  File "/home/johnny/.local/share/miniconda/envs/jdec/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "JDEC/models/swinirv2.py", line 308, in forward
    x_windows = window_partition(shifted_x, self.window_size)  # nW*B, window_size, window_size, C
  File "JDEC/models/swinirv2.py", line 48, in window_partition
    x = x.view(B, H // window_size, window_size, W // window_size, window_size, C)
RuntimeError: shape '[1, 231, 7, 154, 7, 256]' is invalid for input of size 447897600

Does this mean the network cannot process JPEG files bigger than 1120x1120? Or that I need to do more code adjustments? Any ideas?

It's starting to seem like every input JPG must be exactly 1120x1120 which would severely restrict the usefulness of this network. I hope that's not true. If so, it's more interesting as a research concept but not actually useful for removing artifacts in real JPG files.

@Arcitec
Copy link
Contributor Author

Arcitec commented Oct 29, 2024

Well, I see that the paper says:

Implementation Detail

We use 112 × 112 patches as inputs to our network. This size is chosen because it is the
least common multiple of the minimum unit size of color
JPEG (16 × 16) and the window size of our Swin architecture [27] (7 × 7). The network is trained for 1000 epochs
with batch size 16. We optimize our network by Adam [20].
The learning rate is initialized as 1e-4 and decayed by factor
0.5 at [200, 400, 600, 800].

So it seems like it can use different input sizes. But that the JPEG files dimensions must be multiples of 112?

I hope you can clarify whether this network can:

  1. Use arbitrary input dimensions which are not multiple of 112, such as 707 x 1231? In other words, real-world JPEGs.
  2. Can use large JPG input dimensions like 6k x 6k?

If it's not possible, would it be possible to adapt the network so it can automatically pad the input JPGs to multiples of 112 so JDEC can support all dimensions?

Or maybe automatically crop it, if that's the only solution?

  • So if width is 707 = 707/112 = 6.3125 = 6x112 = 672 width (discard 707-672 = 35 pixels on right edge).
  • If height is 1231 = 1231/112 = 10.99107142857142857142857143 = 10x112 = 1120 height (discard 111 pixels on bottom edge).

At least then it would process all the 112x112 input slices, if that's what the network needs. It's better than not working at all for random input sizes.

@WooKyoungHan
Copy link
Owner

In conclusion, it's possible, but it's not what you expected it to look like. As I mentioned, we need to treat it so that it's a multiple of the lcm (112) of the swin window and the JDEC block. If you do not keep that size, you'll often see unexpected artifacts.

  1. Maybe if it's 707x1131, it's 784x1232. I remember making and processing images through padding.

  2. The same goes for 6k x 6k. I cut it after padding and processing.

Getting ARB inputs will be quite interesting.

As far as I know, if padding is needed, a jpeg is also required. So, even if the actual size is arb size, if you import the spectral through a module like dct-manip, you will already bring it with padding.

The problem is how to make the process possible without a swin window. In a way, it can be solved simply by replacing it with a module like resblock, but it can't guarantee performance withoa ut transformer. Also, networks that receive spectrum are very sensitive (probably because each point has different energy.) It's a question of what to do with the residual energy for the arb input.

@Arcitec
Copy link
Contributor Author

Arcitec commented Oct 29, 2024

@WooKyoungHan Thank you for explaining the details! I see that we need to be multiples of 112 since that's the smallest multiple of the JDEC block (16x16) + the Swin window (7x7).

The goal for me is to load real-world JPG files into JDEC to clean them up, which is why I am trying to load images with arbitrary size.

I would be happy with any of these two solutions:

  1. Cropping: Loading an image, and doing dimension = int(dimension / 112) * 112 to only process the 112x112 blocks (skipping data at right / bottom edges, if it's outside the block size). This is not a great solution but if it's the only solution, it's worth doing.
  2. Padding: Loading an image and automatically padding all the input blocks to 112x112. You say that dct-manip can do this automatically? Does it pad the raw DCT data with "blank" to always make it 112x112? If this is doable, then it's the best solution. I can then add a post-processing step to crop the output image to match input dimensions again.

For solution 2, you mention quality loss. I am guessing that there's some loss of quality in the padded blocks because the "padded" data is not real JPEG data so JDEC cannot guess what the original pre-compressed input was in the "padded" blocks, because the padding messes up the DCT spectrum?

If that is the case, can JDEC be tweaked so it only gives "attention" to the original pixels in the padded JDEC block? So if a block was 15x80 and is padded to 112x112, JDEC only gives attention to the DCT spectrum inside the 15x80 area of the 112x112 padded block? The algorithm could know this by giving JDEC the original JPEG width/height dimensions (so it knows what the size was before dct-manip padded it).

Edit: I guess my last idea is impossible since the neural network only learns that "112x112 is my input and I need to generate a 112x112 output" and has no understanding about padding. :)

@Arcitec
Copy link
Contributor Author

Arcitec commented Oct 29, 2024

Okay, I tried to look at https://github.com/JeongsooP/RGB-no-more/blob/main/dct_manip/dct_manip.cpp to find how to pad the data being read by input_ = dm.read_coefficients(jpg_file) but I did not find it.

Here's my ideal solution:

  • input_ = dm.read_coefficients(jpg_file) any JPG (with any arbitrary resolution).
  • Automatically pad it to 112x112 block size so that the input JPG data will work in JDEC even when it wasn't 112x112.
  • Process with JDEC.
  • Crop after JDEC to crop out the "black" padded edges at the right/bottom.
  • Save as PNG.

And I understand that the "fake 112x112 block" at the right/bottom edges will not be good quality, but still I think this would be the best solution for now.

PS: I am talking about using JPEG files. Not PNG files. This is for processing real-world JPEGs to de-block them. I don't have any original PNGs for them. If my idea here is not possible with input JPEGs, then perhaps we can at least process all of the 112x112 parts of the JPEG file and CROP/IGNORE the rest of the image (ignore the blocks that are smaller than 112x112)? (That was one of my suggestions above.)

@WooKyoungHan
Copy link
Owner

I agree with a lot of things. There's been a lot of talk about how to use padding properly. As you said, zero padding is a good way to use memory, but mirror/periodic padding can be better quality. Thank you for suggesting a good way.

JDEC is limited by traditional JPEG legacy in some ways. Image size is considered to be one of them. It is inevitable at the moment to digest JPEG files. I'm hoping someone (would love it if I do ) will overcome this with a fancy idea.

I acknowledge that learning-based JPEG encoding is also being studied these days. It would be nice if it could be used in combination with those.

@Arcitec
Copy link
Contributor Author

Arcitec commented Oct 30, 2024

mirror/periodic padding can be better quality.

Yeah, in PNG -> JPG -> JDEC process, mirror padding looks great for quality.

But in REAL JPG WITH ARBITRARY SIZE -> DCT-MANIP -> JDEC process, I guess that we cannot pad the DCT processing?

So if I have understood correctly, our only solution for "real jpg input with arbitrary size" is to process all 112x112 blocks and we must drop/skip all the smaller blocks?

So input JPG 3219 x 3287 => JDEC => 3136 x 3248 PNG output (skip every small non-112x112 blocks). Cropped output to only process the 112x112 squares. That's the only solution, right?

@WooKyoungHan
Copy link
Owner

Now I understand. Thinking it over, I realize that, as you mentioned, JDEC indeed has some ambiguous aspects when handling arbitrary inputs. Additionally, the 112 padding, while essential, is quite large. I don’t immediately have a better idea than what you proposed. Thank you for bringing this up. I believe a better approach could be developed by properly utilizing the transformation formula. I'll update you on this once it’s more thoroughly organized.

@Arcitec
Copy link
Contributor Author

Arcitec commented Nov 3, 2024

@WooKyoungHan Thank you so much for answering. I am sorry, I did not see your message until today!

Indeed, The 112x112 size is pretty large, and it's very unfortunate because JDEC is the best model I have ever seen for JPEG artifact removal (much better than Swin2SR, FBCNN, etc).

But even though I would lose some pixel chunks (the non-112x112 blocks), I would love to use JDEC for personal image restoration. In many situations, an image can be safely cropped on right/bottom edges without losing anything important from the image.

I wonder if you can improve the JDEC repo's file input code to automatically skip JPEG blocks that are not 112x112? Because currently, the repo cannot be used for arbitrary resolution inputs. It errors instead, with shape '[1, 231, 7, 154, 7, 256]' is invalid for input of size 447897600.

It would be great if JDEC's code processes all 112x112 blocks and auto-skips the smaller blocks. Automatically creating a cropped output (multiple of 112). That would be very useful. :)

I believe a better approach could be developed by properly utilizing the transformation formula.

I am curious about this. Do you mean that there may be a solution to process the smaller (< 112x112) blocks too?

Or that the 112x112 window can be smaller via a new architecture in a future version?

@Arcitec
Copy link
Contributor Author

Arcitec commented Nov 3, 2024

I also read your supplementary. JDEC is very impressive.

By the way, since you have looked at the state-of-the-art pixel-based JPEG artifact removers, I wonder if you have any opinions about which pixel-based model is the best right now?

I plan to use JDEC for most of my images, and a pixel-based model only if I need to keep the non-112x112 edges of an image.

I suspect that one of these is the best pixel-based model:

  • SwinIRv2: Apprently official repo and paper. But I may be looking at the wrong repo (Swin Transformer v2 is not SwinIRv2?). But I saw a SwinIR paper for image restoration. Either way it seems like it may be old. Maybe I shouldn't look at SwinIR.
  • Swin2SR: Repo. Seems better than FBCNN but very slow. Their model is "dynamic" though, so it works with every JPEG quality factor. It also seems like this is a better model that came after SwinIR/SwinTransformer.
  • GRL IR: New. They claim better results than FBCNN but their models only work up to Q40 and they have separate models for each quality level. So its processing is probably too destructive for Q80-Q95 real-world images.
  • HyperRes: New. They claim to do JPEG artifact removal, but I see big detail loss (look at the flower and butterfly antennas in their example images).
  • FBCNN (with its parameter QF=80 or higher to preserve details): This one always seems to cause a lot of blurring, texture loss, incorrect reconstructed details etc. And if QF is increased, a lot of the artifacts stay in the image. Doesn't seem good.

From that list of the state-of-the-art models I have heard about, it seems like Swin2SR is the best (but slow), and FBCNN is the 2nd best. Your paper update only compared JDEC vs those 2 models specifically, which also indicates to me that they are the state-of-the-art pixel solutions?

PS: To me, JDEC looks better than all of them. I hope one day that it can support arbitrary resolutions. :)

Edit: It seems like Swin2SR is the best of the publicly available pixel-based solutions. I have created and shared a ChaiNNer workflow to use that artifact removal method here for batch processing with realistic results: chaiNNer-org/chaiNNer#3055 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants