Recursively crawl https://stackoverflow.com/questions using Node.js based crawler, harvest all questions on Stack Overflow and store them in a database of your choice.
What do you need to store?
- Every unique URL (Stack Overflow question) you encountered.
- The total reference count for every URL (How many time this URL was encountered).
- Total # of upvotes and total # of answers for every question.
- Dump the data in a CSV file when the user kills the script.
Things you should keep in mind:
- Maintain a concurrency of 5 requests at all times. Refrain from using throttled-request package to limit concurrency.
- Your solution needs to be asynchronous in nature.
- If you are using request.js, do not use its connection pool to throttle # of requests.
- You can use cheerio or similar library for HTML parsing.
I solved this problem using the following steps:
- I pushed all the questions on the first page in an array.
- Then, I iterated through the array recursively and popped the urls from the array and saved it into the database along with it's url, number of upvotes, total answers and the name of the question.
- When the script is terminated, all the questions which are stored in the database is saved into a CSV File.
Fork the repository
Open the terminal and write git clone https://github.com//Assignment-Airtribe.git
cd Assignment-Airtribe
npm install
create a .env file and copy the contents of config.env in it.
npm start