/Undergraduate-Project-2020

Bachelor of Business Information Technology

Primary LanguagePythonMIT LicenseMIT

Undergraduate-Project-2020

Bachelor of Business Information Technology

Project structure (dir)

Undergraduate-Project-2020/
├── backend
│   ├── analysis.py
│   ├── api
│   │   └── api.py
│   ├── crawler.py
│   ├── database.py
│   ├── db_check.py
│   ├── db_data
│   │   ├── analyzed_data.json
│   │   ├── products.db
│   │   └── sample_data.json
│   ├── driver
│   │   └── chromedriver
│   └── pycache
│   └── database.cpython-37.pyc
├── frontend
│   └── FrontEnd here
├── LICENSE
├── README.md
└── try-out
├── ...

Database Structure

id product_name demand
0 product name 5
1 product name 8
  • id: Primary key
  • product_name: Text
  • demand: INTEGER

Crawler Website

Dynamic Pages or Client-Side Rendering

Although websites are increasingly becoming interactive and user-friendly, this has the reverse effect on web crawlers.

  • Nowadays, modern websites use a lot of dynamic coding practices which are not at all crawler friendly. Some of the examples are lazy image loading, infinite scrolling, or elements loaded via AJAX calls, which makes it difficult to crawl even for Googlebot.
  • Modern websites heavily rely on JavaScript to load dynamic elements.

How to Know if It Is Dynamic Page or Static Page?

You can detect if a web pages uses asynchronous loading or if it is a dynamic page by viewing the page source (if you right click on the page, you will find option View Page Source). If, upon searching the content you are looking for, you cannot find it then it is probable that Javascript renders the content.

  • Modern websites are Javascript rendered pages which makes them difficult for web scrapers.

How Does the Webdriver Handle Dynamic Pages?

The Selenium WebDriver is one of the most popular tools for Web UI Automation. It allows for the automatic execution of the actions performed in a web browser window like navigating to a website, filling out forms (including dealing with text boxes, radio buttons, and drop-downs), submitting the forms, browsing through web pages, handling pop-ups, and so on.

Handling AJAX Loading and Infinite Loading

Sometimes, fetching content from dynamic sites is actually straightforward, as they are highly dependent on API calls. In asynchronous loading, most of the time, data is loaded by making GET and POST requests; you can watch these API calls in the Network tab of Developer Tools.

Project API EndPoints

https://rocqjones.pythonanywhere.com/api/products/all
https://rocqjones.pythonanywhere.com/api/products=fruits
https://rocqjones.pythonanywhere.com/api/products=cerials
https://rocqjones.pythonanywhere.com/api/products=vegetables