Pragram 3: Extracting Anchors (Links)

Conceptually this program is similar to program one. In program one you looked for URLs identified by the fact that they started with "http:". For this assignment you will look for actual html links (called anchors) indicated by the presence of "A tags". The url for the web page in which to search will be specified on the command line.

Specifically your program should recognize strings of the form: "<a href=url>". There can be extra spaces before "href", around the "=" and before the closing ">". There can also be double quotation marks around the url.

All tag designators and URLs are INSENSITIVE to case, so "<A hREf=http://wWw.soe.UcSc.edu>" works just the same as "<a href=http://www.soe.ucsc.edu>". A good method to know about is String.toLowerCase().

A url may or may not begin with a protocol specifier such as "http:" or "ftp:". The protocol specifier will then be followed by a path. You will need to recognize three different forms for the path.

  1. It may be a fully specified path of the form "//hostname/theRest", e.g. http://soe.ucsc.edu/classes/cmps012a, where hostname is the name of the server and theRest is the path to the page on that server. The hostname maybe be empty in which case the hostname defaults to localhost (that is literally the string "localhost").
  2. It may be relatve to a base indicated by beginning with something other than a slash "/", e.g. hw/prog1.html.
  3. It may be an absolute path but from the same web server indicated by an initial slash "/" followed by something other than a "/", e.g. /classes/cmps012a.
If there is a protocol specified, indicated by the presence of a colon and the protocol is anything other than "http:" then simply ignore the anchor and print nothing. If there is no protocol then assume the protocol is "http:" and prepend that to the url before printing.

Case one is the easiest, if the path begins with two slashes followed by a non-slash then just print the url (prepending http: if necessary).

For case two you must prepend the base url onto the path value to get the complete url for printing. Determining the base url is discussed below.

For case three you must prepend "//hostname" onto the path value to get the complete url for printing. Determing the hostname is discussed below.

Base for Relative URLs

Initially the base for relative URLs is the URL for the page being read. This can be changed by the presence of one or more base tags of the following form: "<base href=url>". When such a tag is encountered, the base is changed to the specified URL. These tags may contain extra spaces just like the anchor tags discussed above. Notice that the "<a>" tags and the "<base>" tags have the exact same form except for the tag name. Your program should take advantage of this and include a method to capture this repeated functionality (i.e. the recognition and extraction of the URL following the "href" attribute name).

Hostname for local absolute paths

The hostname in an http url path is the string between the "//" and the next slash "/". If the string is empty (e.g. http:///) then the hostname is "localhost". If there is no third slash then the entire string after the "//" is the hostname. If the path begins with a single slash followed by a non-slash, this is an absolute url and the hostname is the same as the hostname in the base (see the previous section).

Sample Execution

You can find a sample solution in /afs/cats/courses/cmps012a-cm/prog2. The program is the file PrintAnchors.class. To run the sample solution type:
java PrintAnchors startingURL
Here is a test page with a variety of tags (the output for this page is shown below). Note that the "localhost" links are "broken" (by design), in that they would only be valid if you happen to be running a web browser while logged into the soe web server.

Here are some sample executions. The third one shows that you can actually use the "file:" protocol for testing on your local machine without a web server. You will of course have to change the path to make sense for your machine.

os-prompt% java PrintAnchors http://www.soe.ucsc.edu/classes/cmps012a/Fall06/hw/Hw3TestPage.html
http://www.soe.ucsc.edu/classes
http://www.soe.ucsc.edu/classes
http://www.soe.ucsc.edu/classes
http://www.soe.ucsc.edu/classes/cmps012a/Fall06/hw/prog1.html
http://www.soe.ucsc.edu/classes/cmps012a/Fall06/hw/prog1.html
http://www.soe.ucsc.edu/classes/cmps012a/Fall06/labinfo/index.html
http://www.soe.ucsc.edu/classes/cmps012a/Fall06/labinfo/index.html
http://localhost/classes/cmps012a/Fall06/hw/prog2.html
http://localhost/classes/cmps012a/Fall06/hw/prog2.html
os-prompt%
os-prompt%java PrintAnchors http://www.soe.ucsc.edu/~charlie
http://www.soe.ucsc.edu/~charlie/official.html
http://www.soe.ucsc.edu/~charlie/personal.html
http://www.soe.ucsc.edu/~charlie/projects/index.html
http://www.soe.ucsc.edu/~charlie/classes
http://www.soe.ucsc.edu/~charlie/jarel
http://www.mtsu.edu/~untch/karel/
http://www.lulu.com/JavaByDissection
http://www.cse.ucsc.edu/~pohl/java.html
http://www.soe.ucsc.edu/~charlie/research.html
http://www.soe.ucsc.edu/~charliedirections.html
os-prompt%
os-prompt%java PrintAnchors file://localhost/Users/charlie/class/12a/webpage/index.html
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/labinfo
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/faq
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/supplements/
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/hw
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/stars
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/exams
http://ic.ucsc.edu/docs/webct/create-account.php
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/notes
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/labinfo
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/labinfo
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/supplements/supplements.html
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/faq/faq.html
http://www.lulu.com/JavaByDissection
http://www.abebooks.com/servlet/SearchResults?&isbn=0201725991&nsa=1
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/hw/pairProgramming.html
http://www.cs.berkeley.edu/~aiken/moss.html
http://oasas.ucsc.edu/avcue/integrity/
http://ic.ucsc.edu:8000/webct/public/home.pl
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/hw
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/hw
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/labinfo
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/faq
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/supplements/
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/hw
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/stars
http://www.cse.ucsc.edu/classes/cmps012a/Fall06/exams

Correctness Points

10 points - makes appropriate use of methods. For this assignment a STRICT limit of 30 lines per method will be enforced. If any method exceeds 30 lines you will lose this 10 points of correctness. On future assignments this is relegated to style. If you sacrifice style to get the method under 30 lines, you will still be penalized (e.g. putting multiple statements on one line). The opening block comment for a method does NOT count towards the 30 lines.

5 points - properly handles tags without extra spaces or quotation marks

5 points - properly handles a-tags with extra spaces

5 points - properly handles a-tags with quotation marks

5 points - properly adjusts the base for base-tags

5 points - properly abstracts href attribute parsing with a method

5 points - properly handles fully specified url paths

5 points - properly handles relative to a base url paths

5 points - properly handles absolute url paths without an explicit server name