Mar 302010

Algorithm Description and Implementation :

The application is built using the C++ Language. The Huffman algorithm is one of the most well known algorithms worldwide and has been used for many years for compressing/encrypting data.

The algorithm’s purpose is to reduce the number of bits representing a byte. Generally, each byte is represented in 8 bits which allows 256 different characters to be represented.

Huffmans algorithm tries to reduce the number of bits required by the most frequently used characters in a certain file or stream. This varies from one file to another.

The implementation of the application has two main programs :

i) encoder.cpp

The program starts by reading the stream byte by byte and calculating the frequency of each byte. The most used characters have a shorter representation.

There may be some characters that the algorithm will give a longer representation to, but these are rare characters.

The corresponding Huffman code is obtained using a priority queue. The STL, standard template libraries for C was referred to in creating this. The file queue contains the priority queue template.

All the bytes are inserted such that the most used comes in the back and the least used goes in the front. The least used 2 bytes are popped from the queue and the sum of the 2 bytes is pushed.

This sum becomes the parent and the 2 bytes that were popped are its children. This process continues until only one node remains. This node becomes head of the Huffman Tree.

As the decoder will decode the file using the same tree then the encoder must save the tree data in the output file for decoding purposes.

This creates a problem regarding the size occupied by the tree data. A well known method for solving this problem is the Canonical Huffman method.

This method involves changing the codes generated for each byte in a systematic way that will give the feature of saving the bytes in the file without saving its corresponding code.

This method will make the header of the output file which holds the tree data to be a maximum of only 265 bytes.

As the original Huffman tree would contain the symbol, number of bits and the code, listing the symbols in sequential order allows the removal of the symbols themselves as they can be determined by the order.

It is also known that later code is higher in value than earlier code. This can be omitted. This leaves the number of bits for each symbol to be transmitted.

For example, the following case would utilise the Canonical Method. A text file that contains only a combination of the character a and b. Let a=97 and b=98.

Using the general Huffman method, it will be stored as the following:

(0,0,0,0,……,length of encoding of a,,0,0,0,0,……,length of encoding code of b)

Here the fields of a and b are filled with many useless zeros. Canonical Huffman stores this as

(length of a,length of b). As can be seen this is saved in only 4 bytes instead of 256 bytes.

The encoder then reads the input stream again byte by byte and inserts the corresponding code in the output file.

Another problem that needs to be solved is the last byte in the stream. As the length of the bytes is changed, the final bit-stream length will not be divisable by 8.

This can be solved by inserting a byte in the header that tells how many bytes will be extracted from the last byte by the decoder.

This saves space in the header and saves the decoder from decoding garbage data.

ii) decoder.cpp

As specified in the encoder section, the decoder will be doing the reverse of the encoder operation.

It starts by building the tree from the saved data. In order to build this tree, the decoder checks the byte value. If the value = 255, the decoder knows is the normal Huffman method.

Else if the byte value of 255 wasnt found, the decoder assumes the canonical Huffman method.

The decoder then reads the encoded file and using the built tree, it decodes it generating the output file.

Each program is a class which its constructor takes two arguments.

The first is the path of the first file (to be compressed file path in case of encoder and previously compressed file in case of decoder) and the second argument takes the path of the file to be stored.

Each of them has only one public method called start ( ) which do all the work. Starting the encoder program will prompt for a file path to be compressed and another one to be saved to.

If no errors happen (the only expected error is failure to open a file), the data will be read from the source file and compressed to the destination file.

The decoder program has the same usage method of the encoder. It will prompt for the path of the previously compressed file and will extract the data to the destination file.

The time used to build Huffman tree for each file and the space occupied (even if it is small) makes it not the best compression method although it works satisfactorily for text files.

The compression ratio for the multimedia file is not significant.

The compression ratios of the different file types is not the same. It depends on the type of file and the number of different characters in the file.

As this number increases, the compressibility may decrease. In fact, many compression programs fail to compress if the size of the file is already extremely small as the output file will even be larger than the original file.

This is due to the tree data and that each character must be represented once. The cost in time is directly proportional to the file size in both compression and decompression operations.

Huffman encoding is practical if the encoded string is large relative to the code table.

Test Procedure :

The following test procedure, for test1, was used to test the application during design. The test procedure was the same for all data. The file names, extensions or locations used are not mandatory.

Compress File:

1. Copy the test file to c:/test1.dat

2. Run the encoder and when prompted for the file to be compressed, enter c:/test1.dat

3. When asked for the file to store, enter c:/test1.huf

4. A file named test1.huf will be created in the c:/ which is the compressed one.

Decompress File :

1. Run the decoder and when prompted for the file to be decompressed, enter c:/test1.huf.

2. When asked for the file to store, enter c:/test2.dat.

3. test1.dat and test2.dat were then compared.

Test Results :

The application was developed in Microsoft Visual Studio 2005 and was tested in DevC++. The application executed successfully ensuring that it does not depend on particular libraries or environments.

It has also been tested on MS Windows XP and MS Windows Vista.

1. test1.dat :

Original Size : 60kb

Compressed Size : 37kb

2. test2.dat :

Original Size : 149kb

Compressed Size : 97kb

3. test3.dat :

Original Size : 23329kb

Compressed Size : 21419kb

Code:

encoder.cpp

```//Encoder.cpp
//Sameer Kumar
//27/04/2007
//DME4
//CA438 Multimedia Technology
//Dr. Stephen Blott
#include <iostream>
#include <fstream>
#include <queue>
#include <vector>
#include <stdlib.h>
class encoder
{   //the node for the tree
class TreeNode{
public:
//frequency of the node
unsigned int Frequency;
//the character for this node
unsigned char c;
//length of the encoding code for this node
int length;
//2 pointers for left and right nodes in the tree
TreeNode *child0,*child1;
TreeNode(){Frequency=0;child0=child1=NULL;length=0;c=255;}
//constructor
TreeNode(unsigned char c,unsigned int Frequency)
{
child0=child1=NULL;
this->c=c;
this->Frequency=Frequency;
length=0;
c=255;
}
//constructor
TreeNode( TreeNode* n0, TreeNode* n1)
{
Frequency=n0->Frequency+n1->Frequency;
child0=n0;
child1=n1;
length=0;
}
//this operator is used for priority queue to determine the location in which to put the inserted node
bool operator <(const TreeNode& a)const{return a.Frequency
};
//the canonical sort used by the sort algorithm
struct CanonicalHuffman{
bool operator()( TreeNode n1,TreeNode n2 ){
if(n1.length
return 1;
else if(n1.length==n2.length)
if(n1.c
return 1;
return 0;
}
};
//used by sort algorithm to sort nodes alphabetically
struct AlphaSort{
bool operator()( TreeNode n1,TreeNode n2 ){
return(n1.c
}
};
std::ifstream in;
std::ofstream out;
//frequency of the 266 bytes
TreeNode SymbolsFrequency;
//number of symbols that exist in the file
int ActiveSymbols;
//the number of bytes that will be in the last byte
char LastByte;
//for bulding huffman tree
std::priority_queue < TreeNode,std::vector
> q;
//traverse huffman tree to get the encode for each byte
void traverse(const TreeNode *Tree,int length){
//go through the whole tree to get the encoding
if(!Tree->child0&&!Tree->child1)
{
SymbolsFrequency[Tree->c].length=length;
return;
}
length++;
if(Tree->child0)
traverse(Tree->child0,length);
if(Tree->child1)
traverse(Tree->child1,length);
}
//calculates the frequency of all the bytes storing them in the SymbolsFrequency array
void calculateFrequency()
{   //calculate all frequencies
while(true)
{
//calculate the frequency of the bytes of the input stream
unsigned char c;
c=in.get();
if(in.eof())break;
if(!SymbolsFrequency[c].Frequency)
{ActiveSymbols++;SymbolsFrequency[c].c=c;}
SymbolsFrequency[c].Frequency++;
}
//building the tree using priority queue
//push all the non-zero frequency nodes to the tree
for(int i=0;i<256;i++){ 			SymbolsFrequency[i].c=i; 			if(SymbolsFrequency[i].Frequency) 				q.push(SymbolsFrequency[i]); 			} 		while(q.size()>1)
{
TreeNode *child0=new TreeNode(q.top());
q.pop();
TreeNode *child1=new TreeNode(q.top());
q.pop();
q.push(TreeNode(child0,child1));
}
//traverse the tree to obtain pre-final codes
traverse(&q.top(),0);
//start canonical method and sort accordingly - changing codes
std::sort(SymbolsFrequency,SymbolsFrequency+256,CanonicalHuffman());
int code=-1;
int ii=0;
while(!SymbolsFrequency[ii].Frequency)
ii++;
int current_length=SymbolsFrequency[ii].length;
//canonical huffman algorithm
for(;ii<256;ii++)
{
code++;
if(SymbolsFrequency[ii].length!=current_length)
{
current_length=SymbolsFrequency[ii].length;
code=code<<1; 			} 			SymbolsFrequency[ii].Frequency=code; 		} 		//end of canonical method 		//re sort alphabetically 		std::sort(SymbolsFrequency,SymbolsFrequency+256,AlphaSort()); 		//check whether it is better to store associative huffman table or canonical table 		//for the associative table...127byte*2 bytes = 254 bytes...use this if it is greater than 127          if(ActiveSymbols>127)
for(int i=0;i<256;i++)
out.put((unsigned char)SymbolsFrequency[i].length);
//use the canonical table because the ActiveSymbols are less than or equal to 127
else
{   //canonical table indication mark
out.put((unsigned char)255);
out.put((unsigned char)ActiveSymbols);
for(int i=0;i<256;i++)
if(SymbolsFrequency[i].length)
{
out.put(SymbolsFrequency[i].c);
out.put((unsigned char)SymbolsFrequency[i].length);
}
}
//restore the input stream to the beginning
in.clear();
in.seekg(0,std::ios::beg);
}
public: 	encoder(char *in,char *out)
{
this->in.open(in,std::ios::binary);
this->out.open(out,std::ios::binary);
ActiveSymbols=LastByte=0;
}
~encoder()
{
in.close();
out.close();
}
void start()
{
if(in.is_open()&&out.is_open())
{
//reserve place for last byte length to be stored
out.put(LastByte);
std::cout<<"Building Huffman Tree..."<
//calculate the frequency and build the tree and save it to the output stream
calculateFrequency();
std::cout<<"Huffman Tree Building finished."<
unsigned char b=0,c,inserter=1<<7;
unsigned int checker;
std::cout<<"Compressing file..."<
while(true)
{
c=in.get();
if(in.eof())break;
checker=1<<(SymbolsFrequency[c].length-1); 				while(checker) 				{ 					if(checker&SymbolsFrequency[c].Frequency)                         //packing bytes in one byte 						b=b|inserter; 					inserter=inserter>>1;
//wait till full 8-bits are packed
if(!inserter)
{
inserter=1<<7; 						out.put(b); 						b=LastByte=0; 					} 					checker=checker>>1;
}
LastByte++;
}
//there may be a bit in the last byte that contains data and garbage. check if any data is present to save it
if(inserter)
out.put(b);
//setting last byte length
//modify the size of the last byte length as it may be invalid
out.seekp(std::ios::beg);
out.put(LastByte);
out.flush();
std::cout<<"Operation Completed."<
}
else
std::cout <<"Error Opening Files."<
}
};
int main(){
char FP,HP;
//prompt user
std::cout<<"Enter Source Path To Be Compressed: ";
std::cin.getline(FP,256);
std::cout<<"Enter Destination Path To Store: ";
std::cin.getline(HP,256);
encoder e(FP,HP);
e.start();
}```

decoder.cpp

```// Decoder.cpp
//Sameer Kumar
//27/04/2007
//DME4
//CA438 Multimedia Technology
//Dr. Stephen Blott
#include <iostream>
#include <fstream>
#include <queue>
#include <vector>
#include <stdlib.h>
class decoder{
//the node for the tree
class TreeNode{
public:
//frequency of the node
unsigned int Frequency;
//the character for this node
unsigned char c;
//length of the encoding code for this node
int length;
//2 pointers for left and right nodes in the tree
TreeNode *child0,*child1;
TreeNode(){Frequency=0;child0=child1=NULL;length=0;c=255;}
//constructor
TreeNode(unsigned char c,unsigned int Frequency)
{
child0=child1=NULL;
this->c=c;
this->Frequency=Frequency;
length=0;
c=255;
}
//constructor
TreeNode( TreeNode* n0, TreeNode* n1)
{
Frequency=n0->Frequency>>1;
length=n0->length-1;
if(n0->Frequency&1)
{child1=n0;child0=n1;}
else
{child1=n1;child0=n0;}
}
//this operator is used for priority queue to determine the location in which to find the inserted node
bool operator <(const TreeNode& a)const
{
if (length==a.length)
return Frequency<a.Frequency;
return length<a.length;
}
};
////the canonical sort used by the sort algorithm
struct CanonicalHuffman{
bool operator()( TreeNode n1,TreeNode n2 ){
if(n1.length<n2.length)
return 1;
else if(n1.length==n2.length)
if(n1.c<n2.c)
return 1;
return 0;
}
};
//used by sort algorithm to sort nodes alphabetically
struct AlphaSort{
bool operator()( TreeNode n1,TreeNode n2 ){
return(n1.c<n2.c);
}
//end of TreeNode class
};
//input stream
std::ifstream in;
//output stream
std::ofstream out;
//array for all the possible 256 characters. it will be converted to the tree from here using the queue
TreeNode SymbolsFrequency;
TreeNode* Tree;
//the number of bytes that will be extracted from the last byte
unsigned char LastByte;
//the queue
std::priority_queue < TreeNode,std::vector<TreeNode> > q;
//builds the huffman tree from the stored table determining whether it is from the general huffman table or canonical table
void build_tree()
{
unsigned char b;
b=in.get();
//canonical mark indicated
if(b==255)
{
unsigned char i=in.get();
while(i)
{
unsigned char symbol,len;
symbol=in.get();
len=in.get();
SymbolsFrequency[symbol].length=len;
SymbolsFrequency[symbol].c=symbol;
i--;
}
}
//huffman method
else
{
SymbolsFrequency.length=b;
SymbolsFrequency.c=0;
int i=1;
do
{
SymbolsFrequency[i].length=in.get();
SymbolsFrequency[i].c=i;
}while(++i<256);
}
//start of canonical method by sorting
std::sort(SymbolsFrequency,SymbolsFrequency+256,CanonicalHuffman());
int ii=0;
int code=-1;
while(!SymbolsFrequency[ii].length)
ii++;
int current_length=SymbolsFrequency[ii].length;
//generate huffman codes
for(int i=ii;i<256;i++)
{
code++;
if(SymbolsFrequency[i].length!=current_length)
{
current_length=SymbolsFrequency[i].length;
code=code<<1;
}
SymbolsFrequency[i].Frequency=code;
}
//canonical codes generated
//build the tree
TreeNode* t=Tree;
//start by filling the queue
for(;ii<256;ii++)
q.push(SymbolsFrequency[ii]);
//loop till only one node remains in the queue.
while(q.size()>1)
{
TreeNode *child0=new TreeNode(q.top());
q.pop();
TreeNode *child1=new TreeNode(q.top());
q.pop();
q.push(TreeNode(child0,child1));
}
//the remaining node is the head of the tree. tree has been constructed successfully
Tree=(TreeNode*)&q.top();
}
public:
//constructor
decoder(char *in,char *out)
{
this->in.open(in,std::ios::binary);
this->out.open(out,std::ios::binary);
Tree=new TreeNode;
}
//start decode
void start(){
if(in.is_open()&&out.is_open())
{
//catching last byte length
LastByte=in.get();
std::cout<<"Building Huffman Tree..."<<std::endl;
//call for building the tree first
build_tree();
std::cout<<"Huffman Tree Building finished."<<std::endl;
TreeNode* t=Tree;
unsigned char b,bb,checker=1<<7;
b=in.get();
bool isLastByte=false;
std::cout<<"Decompressing file..."<<std::endl;
//decompressing data
while(!in.eof()){
bb=in.get();
//indicates if it is decoding the last byte of the code. stops the decoder from decoding any garbage data
if(in.eof())
isLastByte=true;
for(int i=0;i<8;i++)
{
//read each byte and dig down through the tree till we reach the required byte
if(b&checker)
t=t->child1;
else t=t->child0;
//byte found
if(!(t->child0||t->child1)){
//put the byte to the output stream
out.put(t->c);
if(isLastByte){
LastByte--;
if(!LastByte)
break;
}
//reset the tree digger pointer to the top again to make it ready for decoding new bytes
t=Tree;
}
checker=checker>>1;
}
checker=1<<7;
b=bb;
}
//flushing output
out.flush();
std::cout<<"Operation Completed."<<std::endl;
}
else
std::cout <<"error opening files."<<std::endl;
}
};
int main(){
char FP,HP;
//prompt user
std::cout<<"Enter source path to be decompressed: ";
std::cin.getline(FP,256);
std::cout<<"Enter destination path to extract: ";
std::cin.getline(HP,256);
decoder d(FP,HP);
d.start();
}```

References:

01. Algorithms in C, Robert Sedgewick, Addison-Wesley Publishing, ISBN 0-201-51425-751425