Pragram 5: Mini Search Engine

For this project you will build the beginnings of a small web search engine. You will do this by creating an index of all of the words found within a specific web site. The limits of the "web site" will be the same as in project 4. In project 4 you visited each page, looked for links, and checked that the link was valid.

In this project, in addition, for all pages visited within the specified web site, you will create an index of words found in those pages. You will then provide a simple command line interface that allows the user to enter a search word and your program will then print the urls for all pages within the site that contain that word.

For the minimal requirments of this assignment a "word" is any white space delimited string of characters. The output format is as shown in the sample execution below.

The first thing the program should print is the number of pages found in the site. Then a "Ready." prompt (just the first time). After that just print the urls for the pages that contain the selected word or if no pages contain the word, then print the message: "Word not found: xxx" where xxx is the word that was not found.

Sample Execution and Solution

You can find a sample solution in /afs/cats.ucsc.edu/courses/cmps012a-cm/prog5. (At last check it was still not working because the new ucsc.jar file was not installed. 11-20-06)

The sample solution also accepts a -d flag as the second command line arugment which causes it to dump out the entire index. Your program is not required to do this but you might want to do it for debugging.

os-prompt% java WebIndex http://www.soe.ucsc.edu/~charlie/misc/f06prog4/
There are 9 pages in this site.
Ready.
unique
http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t1.html
http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t3.html
http://www.soe.ucsc.edu/~charlie/misc/f06prog4/
t3-html
http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t3.html
some
http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t1.html
http://www.soe.ucsc.edu/~charlie/misc/f06prog4/
http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t2.html
"word"
http://www.soe.ucsc.edu/~charlie/misc/f06prog4/
word
http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t1.html
http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t3.html
http://www.soe.ucsc.edu/~charlie/misc/f06prog4/
charlie
Word not found: charlie
the
http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t2c.html
http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t5.html
http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t3.html
http://www.soe.ucsc.edu/~charlie/misc/f06prog4/

Getting Started

The first thing you should do is modify your program one to simply print out the list of sites that were successfully visited instead of printing out the bad urls. Create a method that takes in the initial url as a String and returns an ArrayList of Strings containing the URLs of the sites visited. Then just print this list.

When that is working you will want to modify your program at the point where it is visiting a single page. Before looking for valid a-tags in a line from a page, first update your index (see below) for each word on that line, recording the fact that the current page contains each of those words. Then just process the line as you did before, looking for a-tags.

A map (as used here) is a table that can be used to convert (map) one value (the key) into another value (the value). For example you might have a map the maps a city name into its longitude and latitude, or a map that maps a persons name into their phone number. For this project we need a map that maps a word into a list of pages that contain that word. For this we can use the standard Java class java.util.HashMap where Key and Value will be the types of the key and the value. In this project the Key type will be String (remember the keys are the words on the page - i.e. the things we want to look up in the index) and the Value type will be a java.util.HashSet. I recommend you use a HashSet instead of an ArrayList because the HashSet will automatically take care of duplicates. That is, if you add the same page (url string) to a particular set multiple times, the set will only store it once. We don't care how many times a word appears on a page only whether or not it appears at all.

This may appear a bit daunting at first but hopefully with a bit of effort you can see what this is doing. The type of your index will be:

HashMap<String,HashSet<String>>
and the line to declare and intialize the index is:
HashMap<String,HashSet<String>> index = new HashMap<String,HashSet<String>>();
Here is a program, IndexStarter.java, that computes an index similar to what you need to be doing. In this simple program the index is a map from words (like your program) to sets of line numbers (instead sets of urls as for your program). Once you understand this program, you should be able to make the changes necessaary to incorporate parts of this sample program into your project.

I have also added one new method to ucsc.WebPage. The method is isValidHtml(). This method should be used instead of isValid(). It includes an additional check to try an ascertain if the page appears to contain html text. This is only an approximation but is sufficient for this project. It has the advantage that you will not be attempting to add the "words" from non-html pages to your index. If you are working at home be sure and update your copy of ucsc.jar.

Correctness Points