HTML Text Analysis and Visualization

image

image

This repository contains Python code to extract text from a given HTML URL, preprocess the text data, and perform analysis and visualization tasks such as word cloud generation, frequency analysis, and scatter plot visualization of word frequencies. The analysis helps in understanding the key themes and insights from the text data.

Table of Contents

Introduction

In today's digital world, vast amounts of information are available online in the form of HTML pages. Extracting valuable insights from this unstructured text data can be challenging. This Python script aims to simplify the process by providing tools to extract, preprocess, analyze, and visualize text data obtained from HTML pages.

Usage

  1. Run the script main.py.
  2. Enter the URL of the HTML page you want to analyze.
  3. The script will extract text from the HTML, preprocess it, and perform various analysis tasks.
  4. Visualizations such as word clouds, frequency bar plots, and scatter plots will be displayed.

Features

  • Text Extraction: Extract text from HTML pages using web scraping techniques.
  • Data Preprocessing: Clean and preprocess the text data by removing punctuation, normalizing text, tokenizing, lemmatizing, and removing stop words.
  • Word Cloud Generation: Generate visually appealing word clouds to visualize the most frequent words in the text data.
  • Frequency Analysis: Analyze the frequency of words and visualize the top N most frequent words using bar plots.
  • Scatter Plot Visualization: Visualize the distribution of word frequencies using a colorful scatter plot.

Examples

Below are some examples of visualizations generated by this script:

  • Word Cloud
  • Top 20 Most Frequent Words
  • Word Frequency vs Rank (Colorful Scatter)