Concurrent Wikipedia Web Crawlers

Tufts University - Fall 2019

Benjamin Auerbach, Andrew Gross, Trung Truong

Description

A Python tool that crawls and indexes Wikipedia sites concurrently

Installing

Clone this repository. To install all dependencies and run the program, pip3 and python3 are required. Navigate to root folder and run the following instructions to install dependencies.

pip3 install -r requirements.txt

To run the concurrent crawler, (-s default 1000, -t default 20)

python3 src/concurrent-spider.py <starting wikipedia url> -t
        <number crawler threads> -s <number sites to crawl>

To run the breath-first-search algorithm on a graph,

python3 src/bfs.py graph.json <start> <end>
===========================================

python3 src/bfs.py graph.json Businessperson California
================================================================================
Path between Businessperson and California
================================================================================
> https://en.wikipedia.org/wiki/Businessperson
> https://en.wikipedia.org/wiki/National_capitalism
> https://en.wikipedia.org/wiki/File:A_coloured_voting_box.svg
> https://en.wikipedia.org/wiki/Abraham_Lincoln
> https://en.wikipedia.org/wiki/Democratic_Party_(United_States)
> https://en.wikipedia.org/wiki/California

To run the visualization prgram,

python3 src/viz.py graph_small.json

Sample graphs

. .

Authors

License