Recursively crawl popular blogging website https://medium.com using Node.js and harvest all possible hyperlinks that belong to medium.com and store them in a database.
What is needed to be stored?
- Every unique URL encountered.
- The total reference count of every URL.
- A complete unique list of parameters associated with this URL
- Let’s say the first request to medium.com gives a 100 links
- We’ll fire the first 5 requests using the 1st 5 links (concurrency of 5 requests)
- Now one of them finishes (concurrency = 4), but because we need 5 concurrent requests at all time, you’ll fire one more request with the 6th url in the list, making the concurrency = 5 again.
- This goes on till all the links in the list are exhausted.
1. Run Command - git clone https://github.com/vijaypatneedi/nodeCrawler.git
2. Run npm install
3. node app.js
- PORT -- Port for running server (default: 3000)
- MONGO_URL -- Url for mongo connection (default: 'mongodb://localhost:27017')
- DB -- Name of DataBase (default: "crawler")
- MONGO_RECONN_TRIES -- Mongo reconnection tries (default: 0)
- MONGO_RECONN_TIME -- Mongo reconnection time (default: 0)
Hit the URL in your browser - localhost:${port}
/getData
Hit the URL in your browser - localhost:${port}
/crawl
And after 1-2 minutes again hit the URl - localhost:${port}
/getData to see the parsed urls