/wikipedia-challenge

Wikipedia Challenge - Find the shortest path between two Wikipedia pages.

Primary LanguageJavaScriptMIT LicenseMIT

#Wikipedia Challenge - Find the shortest path between two Wikipedia pages

Dependencies devDependencies JavaScript Style Guide

This application is inspired by Ran Levi's Wikipedia challenge in his famous podcast, Curious Minds, in which Ran asks the listeners to find the shortest path between two Wikipedia pages.

Prerequisites

Download and install the following tools:

  1. Node JS 6.x (or higher).
  2. PostgreSQL DB v9.2 (or higher).
  3. Rabbit Message Queue (RMQ) v3.6.x (or higher).
  • After successful installation, make sure Postgres and RMQ are running.

Once you have completed the previous step, you need to run the following commands from the source's root directory:

  1. npm install - installs all NPM packages specified in package.json.
  2. npm run migrate up - creates the database tables.

Environment Variables (configuration):

  • DATABASE_URL - the URL to PostgreSQL DB. Default value: postgres://postgres:@localhost/postgres
  • AMQP_URL - the URL to RMQ. Default value: amqp://guest:guest@localhost:5672
  • LOG_LEVEL - log level. Values: error/warn/info/verbose/debug/silly. Default value: info

The application is combined of three parts

Downloader

Web crawler that downloads all Wikipedia pages and stores them on the disk for future processing.

Running instructions

From the source root directory, run:
npm run downloader
You may run several downloaders at the same time to speed up the download process.

Processor

Tool that processes the downloaded pages. The downloaded pages are stored on the local drive while a process job is pushed into the RMQ. The processor takes an item from RMQ and extracts all links from the page, then it will push more download jobs into RMQ to be processed by the downloaders.

Running instructions

From the source root directory, run:
npm run processor
You should NOT run multiple processors at the same time.

Finder

Once the downloaders and the processor are done (may take a few hours), the finder will use the indexed pages to find the shortest path from any given two Wikipedia pages using Breadth-first search (BFS) algorithm.

Running instructions

From the source root directory, run:
npm run finder <LINK-START> <LINK-END>

Where:

  • <LINK-START> is the URL to the "start" Wikipedia page.
  • <LINK-END> is the URL to the "finish" Wikipedia page.

Example:
npm run finder https://he.wikipedia.org/wiki/%D7%95/%D7%90%D7%95 https://he.wikipedia.org/wiki/Can%27t_Buy_Me_Love

License

MIT