The goal is to take any page on wikipedia and compute the number of links to the "Kevin Bacon" wikipedia entry (https://en.wikipedia.org/wiki/Kevin_Bacon).
For example there is a one link separation between Footloose (https://en.wikipedia.org/wiki/Footloose_(1984_film)) and Kevin Bacon. There is a two link separation between Tom Cruise and Kevin Bacon (via https://en.wikipedia.org/wiki/A_Few_Good_Men). Also consider the links of separation between https://en.wikipedia.org/wiki/Philosophy or https://en.wikipedia.org/wiki/Taj_Mahal and Kevin Bacon etc.
Inputs to the program will be a wikipedia url and outputs should be an integer.
- From a terminal run:
npm run start
, should seeServer running at: http://0.0.0.0:3000
in the stdout - Use the swagger docs to get started: http://localhost:3000/documentation
- Start API by running the following command in a terminal of choice:
npm run start
- Run command to execute tests (use new terminal):
npm run e2e
- Edit
batch.js
with the starting pages and the depth for which you want to scrape wiki pages - From terminal run:
node batch
- Uses Depths-First Search to scape page links to a specified depth
- Can use
POST /graph
route to populate in-memory graph using web scraper
- Uses a Breadths-First Search approach
- The
/separation
route can be edited to use the file directory of pre-scraped data as graph or an in-memory graph - I also tried to "live" scrape data as BFS but the performance was so bad I didn't invest any more time in it. Same approach but different data source.
- Future: use a graph database to store data for better performance
- As nodes are added to the graph, via scraping, store the depth at which they were scraped. This approach only works since Kevin_Bacon is always the target. Start scraping from Kevin_Bacon. The time complexity would be O(1).
- Option 1: call (API)[https://www.mediawiki.org/wiki/API:Parsing_wikitext] and traverse the pages yourself (webscraper.js) and store in database
- Not very cost effective if application is hosted in cloud
- Amount of time to populate data could be significant since the network latency is unknown
- Option 2: find db download with initial data to seed database
- Must also consider how to update database - wikipedia pages are updated and new ones are added often
- Is there a way to know which pages are updated or created? Pub/Sub? Polling?
- Collecting the data
- Had to clean links for generic wiki pages (ex: help, portal) and remove special chars to store in directory
- New data sources can be add by implementing the following interface
interface DataSource {
computeDegreesOfSeparation:number,
insertNode:void,
addEdge:void
}
https://en.wikipedia.org/w/api.php?format=json&prop=links&action=parse&page=Cleveland