/Store-Information-Crawler

A Python scrapy project that crawls the websites of several organizations for store information, company information, or club information.

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Store Info Web Crawler

This crawler fetches data from the websites of various websites (e.g. clubs, companies) in order to get information about their store locations, clubs, or other company informaiton. Information such as store name, locations, coordinates, phone number, operating hours, etc. See the results folder for the crawler output.

General Notes

  • Either crawler 1 or 2 was not working because the robots.txt was being misread. While the website's robots.txt allowed the specific URL to be accessed by crawlers, scapy did not read that correctly.
    • Workaround: set ROBOTSTXT_OBEY to False in settings.py
    • Further investigation needed.

Running the crawlers

Use the following commands to run the crawlers.

Output as JSON file:

scrapy crawl <name> -o results/<name>.json

Output as CSV file:

scrapy crawl <name> -o results/<name>.csv -t csv

Crawlers

The crawlers would need to be tested and changed on a regular basis to make sure they still works.

Name Last Ran
towncaredental 2020-07-15
rickysalldaygrillcanada 2020-07-15
jockey 2020-07-15
rentking 2020-07-15
uae_free 2020-07-18
marketwatch_ipo 2020-07-15
maac 2020-07-15

Pipelines

  • XlsxWriterPipeline will take the items from a spider and place them in an excel spreadsheet. If the spider yields multiple items, they will be placed in separate sheets in the excel file.

Notes

Crawler 5 "uae_free"

Resources

  1. ScraPy module for Python: https://docs.scrapy.org/en/latest/. Quick start-to-finish example: https://www.codementor.io/andy995/writing-a-simple-web-scraper-using-scrapy-myb7vrmgx
  2. XPath syntax: https://devhints.io/xpath. Use Google Chrome Inspector (Dev tools) to test XPath to access HTML nodes of a website; example: https://yizeng.me/2014/03/23/evaluate-and-validate-xpath-css-selectors-in-chrome-developer-tools/
  3. Network Log details/demo: https://developers.google.com/web/tools/chrome-devtools/network/