Skip to content

zhuzilin/pdf-with-its-own-md5

Repository files navigation

PDF template with its own MD5

MD5 is a famous cryptographic hash function that was proved to be insecure by Xiaoyun Wang's team in 2004. Due to its insecurity, we could create a pdf that shows its own MD5. However, it's time consuming to create a pdf with its MD5 and that's what this project is for.

example

main.pdf is a example of the template.

Usage

  1. Install pdflatex.

The header of pdf varies among pdflatex versions. To use the processed figures directly, please install the following pdflatex (MacTex in my case):

% pdflatex -v
pdfTeX 3.141592653-2.6-1.40.22 (TeX Live 2021)
kpathsea version 6.3.3
Copyright 2021 Han The Thanh (pdfTeX) et al.
There is NO warranty.  Redistribution of this software is
covered by the terms of both the pdfTeX copyright and
the Lesser GNU General Public License.
For more information about these matters, see the file
named COPYING and the pdfTeX source.
Primary author of pdfTeX: Han The Thanh (pdfTeX) et al.
Compiled with libpng 1.6.37; using libpng 1.6.37
Compiled with zlib 1.2.11; using zlib 1.2.11
Compiled with xpdf version 4.03
  1. Create your own pdf starting from main.tex.

You could change the content in the main.tex to whatever you want, as long as you keep the images in the header part. Those code will insert the binary of the figures right after the header of pdf, from which the figures are created.

Right now, the main.tex is almost the starter code of overleaf.

  1. Compute the MD5 of the main.tex and add it back to the pdf.

After creating the pdf, you could compile the main.tex with:

pdflatex main.tex

and generate the main.pdf.

Then you need to calculate the main.pdf with:

md5 main.pdf

Last, change the 32 figures in the header of main.tex. Run pdflatex main.tex again. Now you have a PDF that shows its own MD5!

Start from scratch

As you can guess, the mystery of the generated PDF files lie in the processed figures. They are caculated collisions of MD5. If you are using different version of pdflatex (or using overleaf), then the header (prefix) of the PDF are different, which will take the magic away from the figures. In those cases, or if you simply want to create your own figures, here are the steps:

  1. Install HashClash

To generate the collision for MD5, we need to use the amazing hashclash:

git clone [email protected]:cr-marcstevens/hashclash.git
cd hashclash
./build.sh
  1. Create the raw digits.

You need to create images of the 16 possible values (0-9, A-F) as we have in raw_nums directory. The shape of the images need to be the same, as the shape will be recorded in the metadata of PDF format. I will call them raw nums for now.

  1. Prepare the header (prefix).

Then we need to prepare the prefix of the images. When using \immediate\pdfximage at the very start of a .tex file, the generated PDF will be:

header -> images_0 -> mid_header_1 -> images_1 -> mid_header_2 -> ...

We need to extract the header and mid_header_i because they are part of the prefix of each images and will determine the generated collision.

To extract the headers, you should use the raw nums in main.tex and compile the files. Then, run extract_headers.py:

# main.tex now only contains raw nums.
pdflatex main.tex

The extracted headers will be in the headers folder.

  1. Create the initial prefix.

The next step is to create the very first prefix with:

python3 get_prefix_0.py

This will generate prefix.txt in prefix_0 folder.

  1. Run prefix_loop.py.

Enter the hashclash folder and run the prefix_loop.py there. The reason to go to the hashclash folder first is that I found the unicoll shell fails in other directory...

cd ~/hashclash
mkdir ipc_workdir
cd ip_workdir
python3 ~/pdf-template-with-its-md5/prefix_loop.py

The script will gradually create the prefix_1, prefix_2, ..., prefix_32. Note that the script can be restarted at any time. We need to sequntially make 16 collision for each digit, and after the i-th collision of digit j finished, there will be a i.jpg file in the prefix_j. This is not the image we need at last, but prefix_loop.py will detect them to know the current progress. And you can delete the certain image to retry that collision.

The script/poc_no.sh script we use in hashclash will stuck sometime. You can kill the process and change the data value in poc_no.sh and restart the prefix_loop.py. Another way to solve the stuck is remove the last generated jpg in the folder.

  1. Extract the processed nums.

Use extract_image_from_prefix.py to extract the image for digits. The script will only extract one digit at a time, for example:

python3 extract_image_from_prefix.py --digit=3

will extract the processed 16 figures for digit 3 and save them to processed/3/.

After extracting the all 32 digits, you will get your version of the processed figures. Now you can create the PDF with its MD5 with them.

Acknowlegement

This project is created with the repo or article from following links:

  1. cr-marcstevens/hashclash https://github.com/cr-marcstevens/hashclash
  2. corkami/collisions https://github.com/corkami/collisions
  3. PoC||GTFO issue 14 https://www.alchemistowl.org/pocorgtfo/pocorgtfo14.pdf
  4. 能否构造一个含有自己哈希或MD5等的文件? - 某某的回答 - 知乎 https://www.zhihu.com/question/411191287/answer/1384197672

About

A PDF template that contains its own MD5!

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published