Web-Crawler

Requirements

Recursively crawl popular blogging website https://medium.com using Node.js and harvest all possible hyperlinks that belong to medium.com and store them in a database.

What is needed to be stored?

Every unique URL encountered.
The total reference count of every URL.
A complete unique list of parameters associated with this URL

Assumption

Let’s say the first request to medium.com gives a 100 links
We’ll fire the first 5 requests using the 1st 5 links (concurrency of 5 requests)
Now one of them finishes (concurrency = 4), but because we need 5 concurrent requests at all time, you’ll fire one more request with the 6th url in the list, making the concurrency = 5 again.
This goes on till all the links in the list are exhausted.

Instructions to install on local machine

1. Run Command - git clone https://github.com/vijaypatneedi/nodeCrawler.git
2. Run npm install
3. node app.js

Environment Variable

PORT -- Port for running server (default: 3000)
MONGO_URL -- Url for mongo connection (default: 'mongodb://localhost:27017')
DB -- Name of DataBase (default: "crawler")
MONGO_RECONN_TRIES -- Mongo reconnection tries (default: 0)
MONGO_RECONN_TIME -- Mongo reconnection time (default: 0)

To Retrieve the already parsed urls-

Hit the URL in your browser - localhost:${port}/getData

To Start the Process of Scraping afresh-

Hit the URL in your browser - localhost:${port}/crawl

And after 1-2 minutes again hit the URl - localhost:${port}/getData to see the parsed urls