UCSC BME 194 Winter 2004

Resource-efficient Programming

Program Assignment 6 memory use

(Last Update: 14:15 PST 21 February 2004 )
Our assignments so far have mainly focused on CPU time. For this assignment, we'll look mainly at memory use, though CPU time will also be of some importance.

On Wed 18 Feb, we talked about multiway trees, giving three different implementations: tries, packed tries, and child-sibling pointers. For this assignment you will implement two of these methods (child-sibling pointers and packed tries) and measure how much space each takes up on a given input.

The program is to read in words from a file, treating white-space and punctuation as separators, and build a multiway tree containing the words (each node other than the root contains one letter and a path from the root to a node contains the prefix of all words in the subtree rooted at that node). On input, you should build the tree using the two-pointer/node representation (first-child and next-sibling pointers).

After reading in the words report the number of nodes in the tree (including the root) and the total memory used by the data structure. Be sure to include any overhead incurred by malloc. You may have to do your allocation through a wrapper around malloc, to keep track of how many calls are made to malloc and of what sizes. Appendix 3 of Programming Pearls gives some ideas how to measure the malloc overhead.

You can also use various tools to measure memory use more directly, but these tend to be very operating-system specific. On i686 Linux machines, you can run your program under "valgrind", not free your data structures before exiting your program, and look at the memory leaks reported by valgrind.

Convert the tree to a packed trie and report the memory needed for it. Also report the memory space that would have been needed for an unpacked trie (this can be computed from number of nodes in the tree, but you have to be explicit about your assumptions---how would leaves have been represented, how would the is_word bit have been stored, and so forth).

Read another input file, looking up each word in the tree, reporting how many words were in the first file and how many are new. Report the time this operation takes with the child-sibling pointer representation and with the packed trie. Note: this is almost certain to be dominated by the I/O time for the file. How can you make a correct timing estimate for just the lookups?

Hints

The child-sibling data structure might look something like
 struct node
 {   struct node* child;
     struct node* right_sibling;
     unsigned char letter;
     char is_word;	// really just a bool, but make sure it is one byte
 }
 
with the "right_sibling" links forming a sorted linked list of possible children of the parent of the node.

The packed-trie data structure might look something like

unsigned char *letter_for_branch;
unsigned int *subscript_for_branch;
with subscripts into the array identifying nodes. The child of i for letter c is subscript_for_branch[i+c], if letter_for_branch[i+c}==c, and 0 otherwise. Alternatively, you could use an array Trie of
struct node
{ 	unsigned char letter;
	unsigned int subscript_for_branch : 24;
}
The child of Trie[i] for letter c is Trie[i+c].subscript_for_branch if Trie[i+c].letter==c and is 0 otherwise.

There are several ways to store the "is_word" information. Here is one: If you assume a 7-bit alphabet, you can pack the "is_word" bit into the letter_for_branch character, either with explicit masking operations or by using C's "bit field" definitions. With this representation, no node is needed for leaves---the subscript_for_branch can be 0 (or whatever you use to represent the null pointer). Alternatively, you could allow only 23 bits for the subscript_for_branch, and steal a bit for is-word from there.

One problem you might run into when packing the tries is making sure that no two trie nodes start at the same place in the array. If you force every trie node to have a branch to the same character (either an illegal character like 01 or a common character like 'e'), then any packing that avoids conflicts will also avoid starting trie nodes at the same place. You could make a somewhat smaller packed trie, at the cost of somewhat more temporary storage, by keeping a separate bitset of which starting locations have been used, and discarding the bitset when the packed trie is complete. You could also steal yet another bit from the subscript_for_branch field, and use it for marking which indices are starting points.

Bonus points

For bonus points, modify the lexical scanner for words to handle mid-word apostrophes (like in "can't") as letters, but not get confused by single quotes. There are several different levels of sophistication that ca be applied to this problem---for this assignment, the simplest is probably best, even if occasional mistakes are made.

Another possible bonus: modify the lexical scanner and tree-builder so that case is served if all occurrences of a word have the same cases. That is, "Santa" would always have a capital "S", so would be preserved as mixed-case, but "the", which might occur in upper- or lower-case letters, would only be kept as lower-case. Again, there are several ways to do this---check for the existence of the lower-case word first, store all words and clean up later, store all words in the tree, but remove duplicates that exist with different case patterns when reducing to the packed trie, and so on.

Another bonus: modify the trie data structure to record a count of how often each word occurs, rather than just a 1-bit record of the occurrence of a word.

Another bonus: compare the memory space for the packed trie to the memory space for a hash table containing the same set of words.


slug icon to go to School of Engineering home page SoE home     UCSC Bioinformatics Home Page     BME 194 home page    

Questions about page content should be directed to

Kevin Karplus
Computer Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250