- Understand the design of a Web indexer and fill in missing methods.
- Choose appropriate data structures from the Java Collections Framework (JCF).
- Use Java's
Map
interface and theHashMap
implementation. - Use the
Set
interface and theHashSet
implementation.
At this point we have built a basic Web crawler; the next piece we will work on is the index. In the context of Web search, an index is a data structure that makes it possible to look up a search term and find the pages where that term appears. In addition, we would like to know how many times the search term appears on each page, which will help identify the pages most relevant to the term.
For example, if a user submits the search terms "Java" and "programming", we would look up both search terms and get two sets of pages. Pages with the word "Java" would include pages about the island of Java, the nickname for coffee, and the programming language. Pages with the word "programming" would include pages about different programming languages, as well as other uses of the word. By selecting pages with both terms, we hope to eliminate irrelevant pages and find the ones about Java programming.
Now that we understand what the index is and what operations it performs, we can design a data structure to represent it.
The fundamental operation of the index is a lookup; specifically, we need the ability to look up a term and find all pages that contain it. The simplest implementation would be a collection of pages. Given a search term, we could iterate through the contents of the pages and select the ones that contain the search term. But the runtime would be directly proportional to the total number of words on all the pages, which would be way too slow.
A better alternative is a map, which is a data structure that represents a collection of key-value pairs and provides a fast way to look up a key and find the corresponding value. For example, the first map we'll construct is a TermCounter
, which maps from each search term to the number of times it appears in a page. The keys are the search terms and the values are the counts (also called "frequencies").
Java provides an interface called Map
that specifies the methods a map should provide; the most important are:
-
get(key)
: This method looks up a key and returns the corresponding value. -
put(key, value)
: This method adds a new key-value pair to theMap
; or if the key is already in the map, it replaces the value associated withkey
.
Java provides several implementations of Map
, including the two we will focus on, HashMap
and TreeMap
. In upcoming labs, we'll look at these implementations and analyze their performance.
In addition to the TermCounter
, which maps from search terms to counts, we will define a class called Index
, which maps from a search term to a collection of pages where it appears. And that raises the next question, which is how to represent a collection of pages. Again, if we think about the operations we want to perform, that guides our decision.
In this case, we'll need to combine two or more collections and find the pages that appear in all of them. You might recognize this operation as set intersection: the intersection of two sets is the set of elements that appear in both.
As you might expect by now, Java provides a Set
interface that defines the operations a set should perform. It doesn't actually provide set intersection, but it provides methods that make it possible to implement intersection and other set operations efficiently. The core Set
methods are:
-
add(element)
: This method adds an element to a set; if the element is already in the set, it has no effect. -
contains(element)
: This method checks whether the given element is in the set.
Again, Java provides several implementation of Set
; the ones we will focus on are HashSet
and TreeSet
. We'll come back to them later.
Now that we've designed our data structures from the top down, we'll implement them from the inside out, starting with TermCounter
.
TermCounter
is a class that represents a mapping from search terms to the number of times they appear in a page. Here is the first part of the class definition:
public class TermCounter {
private Map<String, Integer> map;
private String label;
public TermCounter(String label) {
this.label = label;
this.map = new HashMap<String, Integer>();
}
}
The instance variables are map
, which contains the mapping from terms to counts, and label
, which identifies the document the terms came from; we'll use it to store URLs.
To implement the mapping, I chose HashMap
, which is the most commonly-used Map
. Coming up in a few lessons, you will see how it works and why it is a common choice.
TermCounter
provides put
and get
, which are defined like this:
public void put(String term, int count) {
map.put(term, count);
}
public Integer get(String term) {
Integer count = map.get(term);
return count == null ? 0 : count;
}
put
is just a wrapper method; when you call put
on a TermCounter
, it calls put
on the embedded map.
On the other hand, get
actually does some work. When you call get
on a TermCounter
, it calls get
on the map, and then it checks the result. If the term does not appear in the map, TermCount.get
returns 0. Defining get
this way makes it easier to write incrementTermCount
, which takes a term and increases by one the counter associated with that term.
public void incrementTermCount(String term) {
put(term, get(term) + 1);
}
If the term has not been seen before, get
returns 0; we add 1, then use put
to add a new key-value pair to the map. If the term is already in the map, we get the old count, add 1, and then store the new count, which replaces the old value.
In addition, TermCounter
provides these other methods to help with indexing Web pages:
public void processElements(Elements paragraphs) {
for (Node node: paragraphs) {
processTree(node);
}
}
public void processTree(Node root) {
for (Node node: new WikiNodeIterable(root)) {
if (node instanceof TextNode) {
processText(((TextNode) node).text());
}
}
}
public void processText(String text) {
String[] array = text.replaceAll("\\pP", " ").toLowerCase().split("\\s+");
for (int i=0; i<array.length; i++) {
String term = array[i];
incrementTermCount(term);
}
}
-
processElements
takes anElements
object, which is a collection of jsoupElement
objects. It iterates through the collection and callsprocessTree
on each. -
processTree
takes a jsoupNode
that represents the root of a DOM tree. It iterates through the tree to find the nodes that contain text; then it extracts the text and passes it toprocessText
. -
processText
takes a String that contains words, spaces, punctuation, etc. It removes punctuation characters by replacing them with spaces, converts the remaining letters to lowercase, then splits the text into words. Then it loops through the words it found and callsincrementTermCount
on each.
Finally, here's an example that demonstrates how TermCounter
is used:
String url = "https://en.wikipedia.org/wiki/Java_(programming_language)";
WikiFetcher wf = new WikiFetcher();
Elements paragraphs = wf.fetchWikipedia(url);
TermCounter counter = new TermCounter(url);
counter.processElements(paragraphs);
counter.printCounts();
This example uses a WikiFetcher
to download a page from Wikipedia and parse the main text. Then it creates a TermCounter
and uses it to count the words in the page.
In the next section, you'll have a chance to run this code and test your understanding by filling in a missing method.
When you check out the repository for this lab, you should find a file structure similar to what you saw in previous labs. The top level directory contains CONTRIBUTING.md
, LICENSE.md
, README.md
, and the directory with the code for this lab, javacs-lab06
.
In the subdirectory javacs-lab06/src/com/flatironschool/javacs
you'll find the source files for this lab:
* `TermCounter.java` contains the code from the previous section.
* `Index.java` contains the class definition for the next part of this lab.
* `WikiFetcher.java` contains the class we used in the previous lab to download and parse Web pages.
* `WikiNodeIterable.java` contains the class we used to traverse the nodes in a DOM tree.
Also, in javacs-lab06
, you'll find the Ant build file build.xml
.
-
In
javacs-lab06
, runant build
to compile the source files. Then runant TermCounter
, it should run the code from the previous section and print a list of terms and their counts. The output should look something like thisgenericservlet, 2 configurations, 1 claimed, 1 servletresponse, 2 occur, 2 Total of all counts = -1
When you run it, the order of the terms might be different.
- The last line is supposed to print the total of the term counts, but it returns
-1
because the methodsize
is incomplete. Fill in this method and runant TermCounter
again. The result should be4798
.
Run ant test1
to confirm that this part of the lab is complete and correct.
For the second part of the lab, we'll present our implementation of an Index
object and you will fill in a missing method. Here's the beginning of the class definition:
public class Index {
private Map<String, Set<TermCounter>> index = new HashMap<String, Set<TermCounter>>();
public void add(String term, TermCounter tc) {
Set<TermCounter> set = get(term);
// if we're seeing a term for the first time, make a new Set
if (set == null) {
set = new HashSet<TermCounter>();
index.put(term, set);
}
// otherwise we can modify an existing Set
set.add(tc);
}
public Set<TermCounter> get(String term) {
return index.get(term);
}
The instance variable, index
, is a map from each search term to a set of TermCounter
objects. Each TermCounter
represents a page where the search term appears.
The add
method adds a new TermCounter
to the set associated with a term. When we index a term that has not appeared before, we have to create a new set. Otherwise we can just add a new element to an existing set. In that case, set.add
modifies a set that lives inside index
, but doesn't modify index
itself. The only time we modify index
is when we add a new term.
Finally, the get
method takes a search term and returns the corresponding set of TermCounter
objects.
This data structure is moderately complicated. To review, an Index
object contains a map from each search term to a set of TermCounter
objects, and each TermCounter
is a map from search terms to counts. The method printIndex
shows how to unpack this data structure:
public void printIndex() {
// loop through the search terms
for (String term: keySet()) {
System.out.println(term);
// for each term, print the pages where it appears and the count
Set<TermCounter> tcs = get(term);
for (TermCounter tc: tcs) {
Integer count = tc.get(term);
System.out.println(" " + tc.getLabel() + " " + count);
}
}
}
The outer loop iterates the search terms. The inner loop iterates the TermCounter
objects that represent the pages where the search term appears.
- Run
ant build
to make sure your source code is compiled, and then runant Index
. It downloads two Wikipedia pages, indexes them, and prints the results; but when you run it you won't see any output because we've left one of the methods empty.
Your job is to fill in indexPage
, which takes a URL (as a String) and an Elements
object, and updates the index. The comments below sketch what it should do:
public void indexPage(String url, Elements paragraphs) {
// make a TermCounter and count the terms in the paragraphs
// for each term in the TermCounter, add the TermCounter to the index
}
-
When it's working, run
ant Index
again, and you should see output like this:... configurations https://en.wikipedia.org/wiki/Programming_language 1 https://en.wikipedia.org/wiki/Java_(programming_language) 1 claimed https://en.wikipedia.org/wiki/Java_(programming_language) 1 servletresponse https://en.wikipedia.org/wiki/Java_(programming_language) 2 occur https://en.wikipedia.org/wiki/Java_(programming_language) 2
The order of the search terms might be different when you run it.
Also, run ant test2
to confirm that this part of the lab is complete.
This lab involved more reading that some of the others, and not as much programming. But we hope you learned a lot. Congratulations on getting to the end!