Skip to content

w-i-l/huffman-text-compression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Huffman compression

An algorithm to compress data using Huffman encode



About it

Current program is based of Huffman encode which ensure a good data compression. It can be use for both encoding and decoding a file.



How to use it

It has an CLI using an intuitive menu:

  • Read from file - reads the data from a file
  • Encode data - generates the Huffman encoding of the data
  • Write data - writes the input file encoded
  • Decode data - writes the input file decoded
  • Display data - shows how the symbols were encoded
  • Display entropy - shows the Shannon entropy along with the current encode entropy
  • Test the files - compares two given files and shows the content differences

To encode a file you would read from a file, encode the data and then write to another file.

Note that the output file should be a binary one!

To decode a file simply run decode data.

How to run

Compile, here I'm using gcc.

After compiling, lunch the executable, depending on the os:

Windows

gcc test.c util.c code.c encode.c menu.c main.c -o main
main.exe

Linux

gcc test.c util.c code.c encode.c menu.c main.c -o main -lm
./main

How it works

It reads all the symbols from a file and store them in a data struct array. The it adds all the missing nodes from the Huffman tree, and start applying the algorithm to encode.
After completing the tree the program writes the dictionary as key - value pair, with the key being the symbol itself and the value its apparation probability.
From a given file, all the characters are then written encoded in a buffer (for this project I choose a 32 bits array - int) and when the buffer is full it is written to the output file.

When decoding a file, it first reads the dictionary and with it parses all the file and writes the decoded version to an output file.



Tech specs

The entire program has its functions documtented as comments above the their definitions.

A probably confusing thing would be the adding of a char* that represents the encoding of a symbol to a bit array that I will explain below.

All the functionality is implemented via add_to_buffer function that takes 3 parameters:

  1. int buffer - where it stores the bits
  2. int buffer_size - that represents how many bits we have written
  3. char* to_encode - the string that needs to be added to buffer

The function checks how many bits we can write from the to_encode parameter and based on that it returns coresponding.

The add part is done by first shifting the current index to left by 1 so we "allocate" a new bit in which we can write. The second part, the writing, implies bitwise or '|' with the current value of the to_encode parameter.

Assume that for the letter a we have the encode "10011" store as char* and our buffer is 0 and so its index.

Buffer representation: (for simplicity assume that our it has only 8 bits)

+-----------+

| 0000 0000 |

+-----------+

We take the first char of out string (from left to rigth) which is '1' and we shift out number by 1 and the bitwe oring the value with 1.

+-----------+

| 0000 0001 |

+-----------+

Now we continue the same way and the result would be:

+-----------+

| 0001 0011 |

+-----------+



Performance

I have tested the program with a big file - 365 MB - and the compression rate would be ~60%. The time required to all reading the file, encoding and decoding it was, on my machine, ~12 minutes.

About

A compression program based on Huffman encode

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages