web_crawler

Making a web_crawler

.nvmrc is for setting the node version

npm init for generating the json file(package.json), i am going to use .gitignore to ignore all the packages that might be installed while building this,
so use npm install , package.json is going to handle it all npm install --save-dev jest for developer only adding node_modules to .gitignore as specified before touch .gitignore

npm install jsdom -- for getting urls by crawling
~ Read JSDOM documenatation

modifed the script in package.json => npm start for running, main.js is entry point

"scripts": {
    "start": "node main.js",
    "test": "echo \"Error: no test specified\" && exit 1"
  },

 "test": "echo \"Error: no test specified\" && exit 1" to "jest"

npm test

Why normalise URLS Cloudfare Docs => for crawler.js file

After modifying the crawl.js to take only the hostname+path, stripping the protocol

note for capials check in urls , the URL constructor in the crawl.js actually takes care of it

TIMELINE TILL NOW

~gitlens

Note all the Tests are done using jest so npm test

Demo npm run start https://stephendavidwilliams.com/ai-in-data-engineering-part-2

Note:- Although this crawler is working fine with most of the websites, There is a website called Medium, crawling throught that website is making it loop endlessly, for this instead of incrementing the count of the already visited websites, we can

Stop at a website which have all the link repeated from the already visited URLs
Stop a the exact moment the link is repeated <-- seems faulty in logic

I am thinking of implementing these two later

Recently started using gitlens therfore putting out the timeline once again

yeskaydee/web_crawler

web_crawler