You will need to report on the time and space used by your data structures (both the string storage and the hash tables).
Based on the analysis we did on Friday, you should use coalesced-chaining to handle overflow, with automatic resizing and rehashing whenever adding to a 100% full table. That is, the hash table entries should look something like
struct string_hash_entry
{ unsigned long key;
char * word;
struct string_hash_entry *next;
}
Actually, the above structure won't work, as there is nowhere to store
the information the program is intended for---the count of occurrences
for each word. It is your choice whether to store that information in
the hash table, or to change the object that the hash table points to
from a string to a more complicated data object.
The "key" field is the intermediate result of the hash function computation that we discussed on Monday. It should be a function of the word, but not of the hash table size. It is useful both for quicker comparison of words (no need to compare strings if the keys don't match), and for faster recomputation of the hash functions when rehashing to a larger table size.
You will need to write two programs for this assignment: one a scaffolding to test the hash table and the other the word-counting program.
In addition to random testing, you should do some tests with two deliberately damaged hash functions: one which sets all keys to zero (hence disabling both the quick key check and making all items hash to bin 0) and one that leaves the key computation alone, but hashes all keys to the same bin. These "damaged hash" tests are likely to uncover errors in the overflow handling, and timing them tells you the possible worst-case behavior for the hash table.
The program should accept text from stdin and print to standard out the total number of words, the number of distinct words, and a list of the most frequent 100 words in the input text in decreasing order of frequency, with their frequencies. I've written a crude perl program, count-words to do this task (a better choice than C for this sort of simple task, since perl has built-in strings, regular expressions, and hash tables, and the data we're dealing with is too small to really need the efficiency of carefully tailored code).
If you run that program (or yours, of course) on this .html file, you should see "the" as the most frequent word, and "to" as the second most frequent. The third most frequent ("hash") is not normal for most English text, but is unsurprising given the topic of this assignment. Other common words appear early in the list ("and", "a", "of", "for", ...). There will be some words that have identical frequencies---you can output them in any order. Because the 100-word cutoff may occur in the middle of a list of identical-frequency words, you might get a slightly different list from correctly running programs.
Try implementing both a modular-division hash function and a multiplicative hash function. Experiment with variations on the string-to-key part of the hash function also.
Try implementing other hash table styles (such as simple chaining or
linear probing), and compare the speeds of the implementation with
similar amounts of memory. If you use malloc to assign the
linked-list nodes for simple chaining, remember to include the
overhead of the extra pointer(s) that malloc uses to manage its
objects.
Questions about page content should be directed to
Kevin Karplus