A customizable distributed web crawler taylored for researching online social networks and social media content.
Basically, I use this for my data collecting needs. It is ever-evolving. Moreover, I use this package to keep up with some programming techniques such as new API's, distributed patterns etc. Thus from time to time I may perform some major refactorings on the package.
This package contains the common code for:
- Distributed crawling
- Event based communication
- Failure detection
- Journal Recovery
- Journal is implemented, still needs recovery code
- Different Schedulers
- BFS (Snowball)
- Greedy attempt of crawling everything
- Random Walks
- Limited by a maximum number of steps
- Damping factor (stop probability)
- Doing random jumps is hard on online social networks. Some id generation is necessary. It is easy to extend the crawler for this.
- EGO Networks crawls
- (neighbors, neighbors of neighbors, neighbors of neighbors of neighbors, etc)
- Basically, a pruned BFS
- Caching
- Some walkers revist node (such as random walk). Caching aliviates requests to servers in these cases
To really crawl content individual Jobs should be implemented.
- guava (Google's common Java library) 13.0+
- protocol-buffers (Google's protocol buffer library and compiler) 2.4+
- commons-cli (Apache commons command line interface library) 1.2.x
- commons-configuration (Apache commons command configuration library) 1.6
- commons-lang (Apache commons lang library) 2.6
- commons-logging (Apache commons loggin library) 1.1.1
- log4j (Apache's loggind library) 1.2.x
- junit4 (For unit testing) 4.0+
If you have ant (ant-optional on ubuntu) and ivy (ivy on ubuntu) dependecies will be autodownloaded. Just run ant deps.