UCSC BME 194 Winter 2004

Resource-efficient Programming

Program Assignment 5 (Hashing)

(Last Update: 09:29 PST 5 March 2004 )
In class on Monday 2 Feb and Friday 6 Feb, we discussed hash tables and hash functions. For this assignment you will implement hash tables containing strings, and use them to count the frequencies of words in English text.

You will need to report on the time and space used by your data structures (both the string storage and the hash tables).

Based on the analysis we did on Friday, you should use coalesced-chaining to handle overflow, with automatic resizing and rehashing whenever adding to a 100% full table. That is, the hash table entries should look something like

struct string_hash_entry
{   unsigned long key;
    char * word;
    struct string_hash_entry *next;
}
Actually, the above structure won't work, as there is nowhere to store the information the program is intended for---the count of occurrences for each word. It is your choice whether to store that information in the hash table, or to change the object that the hash table points to from a string to a more complicated data object.

The "key" field is the intermediate result of the hash function computation that we discussed on Monday. It should be a function of the word, but not of the hash table size. It is useful both for quicker comparison of words (no need to compare strings if the keys don't match), and for faster recomputation of the hash functions when rehashing to a larger table size.

You will need to write two programs for this assignment: one a scaffolding to test the hash table and the other the word-counting program.

Scaffolding

The scaffolding should test the time for adding a new word to the hash table and for finding a word that is already in the hash table. It should add enough words to the hash table to trigger hashtable overflow and reallocation, to test out those features. Note that simply calling "realloc" on the hashtable will not work---you need to create a new hash table and rehash all the existing entries.

In addition to random testing, you should do some tests with two deliberately damaged hash functions: one which sets all keys to zero (hence disabling both the quick key check and making all items hash to bin 0) and one that leaves the key computation alone, but hashes all keys to the same bin. These "damaged hash" tests are likely to uncover errors in the overflow handling, and timing them tells you the possible worst-case behavior for the hash table.

Word count program

Using the hash tables you implemented, write a simple program that reads plain text and converts it into words. For simplicity, you can treat all punctuation and white-space as word breaks, and convert all words to lower-case letters.

The program should accept text from stdin and print to standard out the total number of words, the number of distinct words, and a list of the most frequent 100 words in the input text in decreasing order of frequency, with their frequencies. I've written a crude perl program, count-words to do this task (a better choice than C for this sort of simple task, since perl has built-in strings, regular expressions, and hash tables, and the data we're dealing with is too small to really need the efficiency of carefully tailored code).

If you run that program (or yours, of course) on this .html file, you should see "the" as the most frequent word, and "to" as the second most frequent. The third most frequent ("hash") is not normal for most English text, but is unsurprising given the topic of this assignment. Other common words appear early in the list ("and", "a", "of", "for", ...). There will be some words that have identical frequencies---you can output them in any order. Because the 100-word cutoff may occur in the middle of a list of identical-frequency words, you might get a slightly different list from correctly running programs.

Bonus points

Try implementing both a modular-division hash function and a multiplicative hash function. Experiment with variations on the string-to-key part of the hash function also.

Try implementing other hash table styles (such as simple chaining or linear probing), and compare the speeds of the implementation with similar amounts of memory. If you use malloc to assign the linked-list nodes for simple chaining, remember to include the overhead of the extra pointer(s) that malloc uses to manage its objects.


slug icon to go to School of Engineering home page SoE home     UCSC Bioinformatics Home Page     BME 194 home page    

Questions about page content should be directed to

Kevin Karplus
Computer Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250