Huffman Encoding Tables

Huffman Encoding trees are a loss-less compression schema that work very well with text or other datasets where elements are frequently repeated. A binary tree is constructed where each node and leaf represents a symbol. The address to a node or leaf is the sequential labels of the edges to that node. Because more frequently used symbols are referenced higher on the tree, they have a shorter address. In this way compression is possible.

Because Huffman Encoding is arbitrary, the encoding table must be transmitted with the compressed document making it inefficient for very small documents or documents with a large number of symbols (like images). The key to the encoding scheme is that each node address can be parsed in an unambiguous manner from the data stream.

The program (written in C++ and using templated classes) involves reading a probability table (as a text file) and then constructing a Huffman encoding tree for file compression.

This project contains three files:

  • symbol.h The header file containing the tree definitions
  • symbol.cpp The source code for the methods and implementations used to build the encoding tree
  • huffman.cpp The main procedure to read the data initiate the tree and output the results.

Usage:

Normally the first step in encoding a source file is to take a pass and count all instances of ASCII or Unicode characters--or bytes--in the file and create a frequency table. This frequency table is then converted to a Huffmann tree. This project omits the first step, leaving that relatively trivial activity to a user to implement on his or her own. This code expects to receive an input file with an ASCII character as the first character of a line, followed by a space, then a floating point number representing the probability of occurrence of the character in a file and ending with a newline code.

Language: 
C++
Archive: