UCSC BME 194 Winter 2004

Resource-efficient Programming

Program Assignment 4 (heaps and heapsort)

(Last Update: 10:45 PST 31 January 2004 )
In class on Friday 23 Jan and Friday 30 Jan, we discussed the heap data structure. For this assignment you will implement heaps, use them to do two variants of heapsort, and time the results. The times should be compared with quicksort and insertion sort, as in the previous assignment.

Start by creating code for a heap data structure. Once again, we're going to cheat a bit and have only int-valued objects in the heap---not pointers to more complex data objects as one would normally have.

The data structure for the heap should look something like

struct _heap
{   int * data;
    int NumHeap;	// number of elements in the heap
    int AllocHeap;	// size of data array (NumHeap<=AllocHeap)
}

In the comments for your code, be sure to keep track of where the heap property is valid. The heap property is that data(parent)>=data(child). Since we are starting our numbering at 0, this means data[i]>=data[2*i+1] and data[i]>=data[2*i+2]. There are other important properties of the heap to maintain, as discussed in class (like using data[0]...data[NumHeap-1]).

Implement the following operations (using procedures or macros as appropriate)

h = new_heap(int a)
Allocate a new empty heap of size a and return a pointer to it.
free_heap(struct _heap *h)
free all the memory used in h. Note: it is a good idea to set h to 0 if you can, to avoid having dangling pointers.
push_heap(struct _heap *h, int x)
push x onto the heap, maintaining the heap property.
x = top_heap(const struct _heap *h)
return the largest value in the heap, without changing the heap.
x = pop_heap(struct _heap *h)
return the largest value in the heap, removing it from the heap, and maintaining the heap property.
h = heapify(int *A, int num)
create a heap object in which data=A, AllocHeap=NumHeap=num, and the array A has been reorganized to have the heap property. Note that I want data to be a pointer to A, not to a newly allocated copy of A. This means that h doesn't really own data, and so should not be used with free_heap(h). Discuss the advantages and disadvantages of defining heapify in this way.

I actually want you to do two implementations of heapify, one of which builds the heap from zero up, the other from the middle down.

The upward building heap has a loop in which NumHeap grows from 1 to num, with the heap property always holding for 0..NumHeap. This can be implemented as a very simple loop:

    h->NumHeap=1;
    while(h->NumHeapNumHeap]);

The downward building heap has a loop that counts downward, and maintains a loop invariant that has the heap property holding for [i..num]. The loop here is a bit trickier to write, but still fairly simple.

Build scaffolding to test out your heap manipulation routines. Use assertions as appropriate in your code to make sure that you have robust primitive operations.

Once your heaps are working correctly, write a heapsort routine that heapifies an array, then does repeated pop_heaps to fill the array from the end. You should have two versions of heapsort, corresponding to your two heapify implementations.

Time the heapsorts and compare them to your quicksort and insertion sort implementations. Put the results in a README file.

BONUS POINTS

If heap sort is too easy for you, try doing Huffman coding as well. Rewrite your heap code to allow heaps with pointers to real data. Use the heaps to build a Huffman tree, and output a Huffman code for a given probability vector. You may want your heaps to contain tree nodes as described in class:
struct tree_node
{   struct tree_node *left, *right;
    float weight;
    char name;		// only the leaves have names
}

Input format: a file of entries, one per line. There may be any number of entries, terminated by the end of the file. Each entry consists of a single character in the first column of the line, followed by white space, followed by the probability of that character, for example

A  0.1
B  0.6
C  0.3

Output format: First report the entropy of the input (- sum P(i) log_2 P(i)) which is a lower bound on the encoding cost for the alphabet, then report the actual average encoding cost (sum P(i) bits(code(i))). Finally, report the Huffman code, one character per line. Each entry consists of a single character (as in the input), followed by a tab, followed by the probability of the character followed by a tab, followed by the Huffman code for the character. For example,

Entropy = 1.2955 bits
Encoding cost= 1.4 bits
A	0.1	00
B	0.6	1
C	0.3	01

Demonstrate that the program is working by doing good scaffolding to test it, particularly in boundary cases. Time it. Profile it with gprof.

More bonus points

If you still find that time is weighing heavy on your hands, you can improve the Huffman coding program to handle word-based codes instead of character-based ones. Have one word per line in the input and output instead of one character. For example,
the	0.1
of	0.1
program	0.01
...
You might want to redesign the input and output formats for this task, since some encoding schemes could have arbitrary strings to encode, and having white-spaces in the strings might break the program.

slug icon to go to School of Engineering home page SoE home     UCSC Bioinformatics Home Page     BME 194 home page    

Questions about page content should be directed to

Kevin Karplus
Computer Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250