/economist-web-scraper

This is a project that can be used to scrape the economist latest news section

Primary LanguagePython

Title Python

Web Scraper

This is an implementation of a tool designed to scrape the content of the latest news articles from The Economist website.

Challenges, Difficulties and Roadmap

  1. First of all, I examined the website's structure to gain a comprehensive understanding of its layout. Upon realizing that scraping the website using HTML tags could be time-consuming and inefficient, particularly due to potential changes over time, I opted to explore alternative approaches. My objective was to locate a suitable API that would enable me to fetch and parse the desired data. Fortunately, I discovered that The Economist provides an RSS feed that I could utilize. Consequently, I focused my efforts on parsing the RSS feed to extract the latest news section, which proved to be a more reliable and efficient method.

  2. Secondly, I selected "The World This Week" section as the source for the latest news.

  3. In the next phase, I came across an XML parser called pydantic-xml, which offers support for the lxml backend (it has a better performance because it is written in C). This parser allows for the seamless deserialization of XML data into Python objects.

  4. I encountered another challenge when working with XML files, specifically the issue of handling empty fields such as description. To address this, I had to utilize the Union data type. Surprisingly, the documentation of pydantic-xml failed to provide any information or guidance on this matter.

  5. In order to prevent duplication within my data structure, I opted to utilize the set data structure. This decision required me to implement the __eq__ and __hash__ functions within the Item schema class. By doing so, I was able to ensure proper comparison and hashing functionality for the set.

  6. I added support for multithreading to improve the performance of the scraping process. Since scraping is an I/O bound task, using a single thread can result in slow execution. By utilizing multithreading, I was able to observe significant improvements in the results section.

Installation

  • Install requirements:
    1. Install "python >= 3.8"
    2. Install "virtualenv"
      pip install virtualenv
    3. Create the virtual environment using following command:
      virtualenv .env
    4. Active virtualenv:
      • For linux:
        source .env/bin/activate
      • For windows:
        .\.env\Scripts\activate
    5. Now you can install libraries and dependencies listed in requirements file:
      pip install -r ./requirements.txt
    6. You can exit from virtual environment using following command:
      deactivate

Run

In order to run the application just run the following command :
python app.py

Results

As mentioned previously, with multithreading the performance increased significantly.

Multi Thread

MultiThread

Single Thread

SingleThread

Improvement

To enhance our processes, we can leverage Airflow as a valuable tool for automating the ETL pipeline. By incorporating Airflow into our workflow, we can streamline the extraction, transformation, and loading of data. Additionally, we can implement a data storage mechanism that ensures synchronization, eliminating the need to repeatedly fetch duplicate data. This optimization not only enhances the overall performance but also prevents unnecessary data redundancy.