söndag 9 december 2012

Implementing Huffman Coding in C# - Tutorial

Wow.. it's been a long time since I last published something here, but - at last - I'm back and this time I thought we'd take a deep dive into the world of Huffman.

For those new to Huffman Coding, here's a bit of background:
Huffman coding is an algorithm devised by David A. Huffman of MIT in 1952  for compressing text data to make a file occupy a smaller number of bytes. Albeit simple, this compression technique is powerful enough to have survived into modern time; variations of it is still in use in computer networks, modems, HDTV, and other areas.

I'm going to ignore various versions of Unicode here and for the sake of simplicity only focus on ASCII. I'm not lazy; I'm informative. :)
Normally text data is stored in a standard format of 8 bits per character - an octet. The idea of Hoffmann is to use different-length binary encodings for different characters. The more frequent the character, the fewer bits used. The trade-off is that some characters may need to use encodings that are longer than 8 bits, but this is reserved for characters that occur infrequently so the extra cost is worth it.

The steps involved in Huffman coding a given text source file into a destination compressed file are the following:

  1. Examine the source file's contents and count the occurrences of each character.
  2. Place each character and its frequency into a sorted priority queue. 
  3. Convert the contents of this priority queue into a binary tree. 
  4. Traverse the tree to discover the binary encoding of each character
  5. Re-examine the source file's contents, and for each character, output the encoded binary version of that character to the destination file. 
In this post, we are going to focus on step 2 through 4. For simplicity, we are going to output the binary encoding of each character in the form of a string instead of writing it to disc as a stream of bits. 

The files we will create are PriorityQueue.cs, HuffmannTree.cs, HuffannNode.cs and Program.cs.
PriorityQueue will be a simple - not thread safe! - implementation of a Priority Queue structure. .Net doesn't come with a priority queue built in so we're gonna build on a SortedDicitionary. More on that later.
HuffmanTree.cs will contain the actual logic for the Huffmann implementation. It exposes a constructor which takes a Dictionary that is the frequency mapping for each character. It also exposes Dictionary CreateEncodings() which creates the binary encoding for each character in the map. 
The HuffmanNode.cs will be a very basic tree node structure that keeps track of parent, childs, the character it represents and the character's frequency.

The text data we will work with are the string "ac bca ba z".
Step 1 would be to read the text data from file and map character and frequency but since we know the text data we also now the frequency of each character. Let's place them in a Dictionary in Program.cs.
IDictionary counts = new Dictionary<char, int>();

// ac bca ba z
counts.Add(' ', 3);
counts.Add('b', 2);
counts.Add('a', 3);
counts.Add('c', 2);
counts.Add('z', 1);
counts.Add('\n', 1);


Step 2 and 3 of the algorithm places these counts into binary tree nodes - HuffmanNode.cs. The nodes are put into a priority queue, which keeps them in sorted order with smaller counts at the front of the queue. Now the algorithm repeatedly removes two nodes from the front of the queue. A new node is created and the two nodes are set as children for the new node; the first node becomes the left child, and the second the right. The frequency variable of the new node is set to the accumulated value of the two nodes. i.e. if the frequency of node n1 was 5 and node n2 was 7 then the frequency of the new node is set to 12. The new node is then reinserted into the queue. This process is repeated until the queue contains only one node. This node is the root of the finished Huffman tree.

The Priority Queue structure is up next. A typical priority queue let's you enqueue an arbitrary set of objects, each if which are associated with a value or priority. In our case we will use integers as priority but you could basically use anything that's sortable. It then let's you dequeue one object at the time and it automatically makes sure that it's the item with the lowest priority that are dequeued.

using System.Collections.Generic;
using System.Linq;

namespace Huffman1
{
    internal class PriorityQueue<T>
    {
        private readonly SortedDictionary<int, Queue<T>> _sortedDictionary = new SortedDictionary<int, Queue<T>>();

        public int Count { get; private set; }

        public void Enqueue(T item, int priority)
        {
            ++Count;
            if (!_sortedDictionary.ContainsKey(priority)) _sortedDictionary[priority] = new Queue<T>();
            _sortedDictionary[priority].Enqueue(item);
        }

        public T Dequeue()
        {
            --Count;
            var item = _sortedDictionary.First();
            if (item.Value.Count == 1) _sortedDictionary.Remove(item.Key);
            return item.Value.Dequeue();
        }
    }
}
To be able to enter Huffman nodes into the priority queue we need the actual node.. 


// HuffmanNode.cs
namespace Huffman1
{
    internal class HuffmanNode
    {
        public HuffmanNode Parent { get; set; }
        public HuffmanNode Left { get; set; }
        public HuffmanNode Right { get; set; }
        public char Value { get; set; }
        public int Count { get; set; }
    }
}

< The code for inserting the character/frequence map into the priority queue and looping the queue to create a binary tree representation of the map is as follows:



private readonly HuffmanNode _root;

public HuffmanTree(IEnumerable<KeyValuePair<char, int>> counts)
{
    var priorityQueue = new PriorityQueue<HuffmanNode>();

    foreach(KeyValuePair<char, int> kvp in counts)
    {
        priorityQueue.Enqueue(new HuffmanNode {Value = kvp.Key, Count = kvp.Value}, kvp.Value);
    }

    while(priorityQueue.Count > 1)
    {
        HuffmanNode n1 = priorityQueue.Dequeue();
        HuffmanNode n2 = priorityQueue.Dequeue();
        var n3 = new HuffmanNode {Left = n1, Right = n2, Count = n1.Count + n2.Count};
        n1.Parent = n3;
        n2.Parent = n3;
        priorityQueue.Enqueue(n3, n3.Count);
    }

    _root = priorityQueue.Dequeue();
}

By now we have a tree and thus we can encode the binary tree Huffman-style. If you could see the tree it would look like this:


            12
           /  \
          /    \
         /      \
        /        \
       5          7
      / \        / \ 
     /   \      /   \
    /     \    /     \
   2      3   3      4
  / \    ' ' 'a'    / \
 1   1             2   2
'z''EOF'          'b' 'c'

EOF is a special character we use to keep track of where the string ended.
Step 4 of the algorithm traverses the tree to discover the binary encoding and by traversing the tree from the root ('12') down we can create a path from root to node. If we wanted to go to 'z' we would have to go "Left" three times where as if we wanted to go to 'a' we would take a right followed by left.

If we assign left and right with '0' and '1' respectively we can now construct a binary representation of the path. For example; our path to 'z' would be represented as '000' since we took left three times. Our path to 'a' would be represented as '10' and the path for 'b' would be '110'. By using the CreateEncodings() method of HoffmanTree.cs we can create a mapping for each character and binary path. This function won't serve more purpose than displaying that the tree has been built correctly and that we can traverse it but that's at least worth something. :) In it's entirety,  HoffmanTree.cs looks like this.


using System.Collections.Generic;

namespace Huffman1
{
    internal class HuffmanTree
    {
        private readonly HuffmanNode _root;

        public HuffmanTree(IEnumerable<KeyValuePair<char, int>> counts)
        {
            var priorityQueue = new PriorityQueue<HuffmanNode>();

            foreach(KeyValuePair<char, int> kvp in counts)
            {
                priorityQueue.Enqueue(new HuffmanNode {Value = kvp.Key, Count = kvp.Value}, kvp.Value);
            }

            while(priorityQueue.Count > 1)
            {
                HuffmanNode n1 = priorityQueue.Dequeue();
                HuffmanNode n2 = priorityQueue.Dequeue();
                var n3 = new HuffmanNode {Left = n1, Right = n2, Count = n1.Count + n2.Count};
                n1.Parent = n3;
                n2.Parent = n3;
                priorityQueue.Enqueue(n3, n3.Count);
            }

            _root = priorityQueue.Dequeue();
        }

        public IDictionary<char, string> CreateEncodings()
        {
            var encodings = new Dictionary<char, string>();
            Encode(_root, "", encodings);
            return encodings;
        }

        private void Encode(HuffmanNode node, string path, IDictionary<char, string> encodings)
        {
            if (node.Left != null)
            {
                Encode(node.Left, path + "0", encodings);
                Encode(node.Right, path + "1", encodings);
            } else
            {
                encodings.Add(node.Value, path);    
            }
        }
    }
}



If we were to write our text data to disc in compressed form we would introduce a new function in HuffmanTree.cs which would take the text data and an output stream for which to write to. For each character in the text data it would retrieve the binary encoding from the tree using the method described above and write the correlating bits to the output stream. To be able to read the compressed data we would need to store the state of the tree, but that is beyond the scope of Huffman and this tutorial. :) 

The complete listing for Program.cs is as follows

using System; using System.Collections.Generic; namespace Huffman1 { class Program { static void Main(string[] args) { IDictionary counts = new Dictionary<char, int>(); // ac bca ba z counts.Add(' ', 3); counts.Add('b', 2); counts.Add('a', 3); counts.Add('c', 2); counts.Add('z', 1); counts.Add('\n', 1); HuffmanTree tree = new HuffmanTree(counts); IDictionary<char, string> encodings = tree.CreateEncodings(); foreach (KeyValuePair<char, string> kvp in encodings) { Console.WriteLine((kvp.Key == '\n' ? "EOF" : kvp.Key.ToString()) + ":\t" + kvp.Value); } Console.ReadLine(); } } }
and that's about it. There are, of course, ways to improve performance but I wen't for readability here so I hope you'll forgive me. I hope you learned something new and, as always, you are free to use my code however you like.

Inga kommentarer:

Skicka en kommentar