In this project, in addition, for all pages visited within the specified web site, you will create an index of words found in those pages. You will then provide a simple command line interface that allows the user to enter a search word and your program will then print the urls for all pages within the site that contain that word.
For the minimal requirments of this assignment a "word" is any white space delimited string of characters. The output format is as shown in the sample execution below.
The first thing the program should print is the number of pages found in the site. Then a "Ready." prompt (just the first time). After that just print the urls for the pages that contain the selected word or if no pages contain the word, then print the message: "Word not found: xxx" where xxx is the word that was not found.
The sample solution also accepts a -d flag as the second command line arugment which causes it to dump out the entire index. Your program is not required to do this but you might want to do it for debugging.
os-prompt% java WebIndex http://www.soe.ucsc.edu/~charlie/misc/f06prog4/ There are 9 pages in this site. Ready. unique http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t1.html http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t3.html http://www.soe.ucsc.edu/~charlie/misc/f06prog4/ t3-html http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t3.html some http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t1.html http://www.soe.ucsc.edu/~charlie/misc/f06prog4/ http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t2.html "word" http://www.soe.ucsc.edu/~charlie/misc/f06prog4/ word http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t1.html http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t3.html http://www.soe.ucsc.edu/~charlie/misc/f06prog4/ charlie Word not found: charlie the http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t2c.html http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t5.html http://www.soe.ucsc.edu/~charlie/misc/f06prog4/t3.html http://www.soe.ucsc.edu/~charlie/misc/f06prog4/
When that is working you will want to modify your program at the point where it is visiting a single page. Before looking for valid a-tags in a line from a page, first update your index (see below) for each word on that line, recording the fact that the current page contains each of those words. Then just process the line as you did before, looking for a-tags.
A map (as used here) is a table that can be used to convert (map) one value (the key)
into another value (the value). For example you might have a map the maps a
city name into its longitude and latitude, or a map that maps a persons name into their
phone number. For this project we need a map that maps a word into a list of pages that
contain that word. For this we can use
the standard Java class java.util.HashMap
This may appear a bit daunting at first but hopefully with a bit of effort you can
see what this is doing. The type of your index will be:
I have also added one new method to ucsc.WebPage. The method is isValidHtml().
This method should be used instead of isValid(). It includes an additional check
to try an ascertain if the page appears to contain html text. This is only an
approximation but is sufficient for this project. It has the advantage that you will
not be attempting to add the "words" from non-html pages to your index.
If you are working at home be sure and update your copy of
ucsc.jar.
HashMap<String,HashSet<String>>
and the line to declare and intialize the index is:
HashMap<String,HashSet<String>> index = new HashMap<String,HashSet<String>>();
Here is a program,
IndexStarter.java,
that computes an index similar to what you need to be doing.
In this simple program the index is a map from words (like your program) to
sets of line numbers (instead sets of urls as for your program).
Once you understand this program, you should be able to make the changes necessaary
to incorporate parts of this sample program into your project.
Correctness Points