This tool crawls a website and generates a graph of pages and links. Additionally, a node is added for each word and edges are added from each word to the pages they are a part of.
Only Leiningen is required for building. Install Leiningen, then execute the following from the webGraph directory:
webGraph$ lein deps
webGraph$ lein compile
To actually do anything useful, you'll need access to a running Neo4J server. You can install this to your local machine and start an instance:
neo4j-community-1.6$ ./bin/neo4j start
Use the --help
command line option to see the possible arguments:
webGraph$ lein run --help
Usage:
Switches Default Desc
-------- ------- ----
-h, --no-help, --help false Show this help message
--host localhost Neo4J server host
--port 7474 Neo4J server port
-s, --seed Fully qualified URL for the crawl
-p, --max-pages 10 Maximum number of pages to crawl
-d, --max-depth 5 Maximum crawl depth
-n, --politeness 1000 Time to wait between page requests (ms)
-j, --threads 1 Number of crawler threads
To crawl a domain specify the fully qualified domain name as the seed:
webGraph$ lein run --seed http://www.milk.com/
The crawler will follow links outside of the given domain.
- There needs to be a way to query the database outside of Neo4J's web interface. Either a command-line query tool or a custom web interface.
- The web crawler library that is used
(crawler4j) does a poor job
at extracting plain-text from the HTML. Some words get squished
together. This can be fixed by taking the raw HTML from the crawler
and running it through something like jsoup's
Jsoup.parse(html).text()
. - Currently, each distinct word from a crawled page requires either two or three calls to the server. This is very slow, but it can be sped up by using the batch API.
- To run PageRank on the graph, the server must be taken off-line so a custom Gremlin distribution can access it directly. Maybe there's a better way?
Copyright (C) 2012 James Marble and John Russell
Distributed under the Eclipse Public License, the same as Clojure.