/Web-Scraping-And-Data-Transformation

This project showcases various Python scripts for web scraping and data transformation, enabling extraction and cleaning of data from websites and manipulation using Excel and Pandas. The aim is to facilitate effective data collection and transformation for analytical purposes.

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Web-Scraping-And-Data-Transformation

Overview

A comprehensive collection of Python scripts demonstrating web scraping techniques for extracting data from various websites and data transformation methods using Excel and Python Pandas. Explore scraping weather forecasts from BBC Weather, extracting IMDb's top movie listings, geocoding addresses, and more. Learn data cleaning and transformation techniques such as removing outliers, splitting text, and creating Pivot Tables for analysis.

Contents

  1. Scraping Web Pages

    • BBC Weather: Scrape weather data for a specified city using BeautifulSoup and requests.
    • IMDb Top Movies: Extract IMDb's top movie listings and their ratings from the IMDb website.
    • Wikipedia Data Extraction: Extract information from Wikipedia pages using the Wikipedia library.
    • Geocoding: Obtain latitude and longitude coordinates for addresses and calculate distances between cities.
    • PDF Scraping: Extract tables from PDF documents using the Tabula library.
  2. Data Cleaning using Excel

    • Find and Replace: Perform find and replace operations to clean data.
    • Changing Data Formats: Change data formats to standardize data representation.
    • Removing Extra Spaces: Remove extra spaces in data using Excel's Trim function.
    • Removing Blanks and Duplicates: Remove blank rows and duplicate entries to clean data.
  3. Data Transformation in Excel

    • Removing Outliers: Identify and remove outliers from data using Pivot Tables and charts.
    • Splitting Text: Split text into multiple columns using Excel's Text to Columns feature.
    • Extracting Date Components: Extract month, year, and week number from date columns.
  4. Data Aggregation

    • Column and Row Removal: Remove empty or unnecessary columns and rows from datasets.
    • Pivot Tables: Create Pivot Tables for aggregating and summarizing data.
    • Conditional Formatting: Apply conditional formatting and sparklines for visual data analysis.
  5. Profiling with Pandas

    • HTML Reports: Generate HTML reports with detailed information about data outliers, correlation, etc., using Pandas profiling.

Usage

Each section contains Python scripts and instructions on how to use them. Additionally, Excel files and PDF documents are provided for practicing data cleaning and transformation techniques.

Dependencies

  • Python 3.x
  • Required Python packages (specified in each script)
  • Excel (for Excel-related tasks)
  • Tabula (for PDF scraping)