Week 7 (Applications of Recursion and Sets)

Consider the following recursive crawler: week4/simpleRecursiveCrawler.java. Can this visit the same url more than once? Point it at http://sebastian.doc.gold.ac.uk/a/one.html to see what happens. (Watch video (Recursive Web Crawler))
Adapt week4/simpleRecursiveCrawler.java so that it takes a single command line argument (a URL) and it prints out the the number of links at
- depth 1
- depth 1 and depth 2
- depth 1, 2 and 3.
What is the output when you point it at http://sebastian.doc.gold.ac.uk/a/one.html?
(Watch video (Better Recursive Web Crawler))
Make a better recursive crawler which visits each URL only once. Do this by keeping track of all visited URLs in a set and checking whether the next URL is in the set before visiting. solution
(Easy Assignment) Write a program that given a URL, u prints out the set of all URLs which are reachable from u but are also in the same domain. The domain is a command line argument. We assume a link is in the domain if it contains the domain in the String. e.g. java sameSite http://gold.ac.uk gold.ac.uk (two command line arguments) should print out all the URLs reachable from http://gold.ac.uk which are in http://gold.ac.uk. Do not visit links outside the domain (i.e. links which do not contain the second command line argument)
Hint: Keep a set of already visited links. Before visiting a link check whether it is in this set. If a link contains the second command line argument visit it. Then add it to the set. What is the output when you point it at http://sebastian.doc.gold.ac.uk/a/one.html?
(Watch video (Help with week 7 easy assignment))
solution
Generalise the above so it takes a whole set of possible prefixes on the command line. Also the parser crashes when it gets certain files for example tar.gz files. Have a HashSet of bad file suffixes which you stop the parser visiting to avoid the crashes. solution
(Hard Assignment) Write a program that crawls through a site like sameSite1.java and puts all the broken links in a HashSet which it prints out at the end. One way to do this is to catch the Exception thrown by the Parser and then try and openStream for the URL (see getURL.java). If this throws a FileNotFound Exception it means the link did not exist.
solution

s.danicic@gold.ac.uk
Sebastian Danicic BSc MSc PhD (Reader in Computer Science)
Dept of Computing, Goldsmiths, University of London, London SE14 6NW
Last updated 2015-09-04