Developed by Jiechao Li, this Python application processes a dataset of news articles, with a focus on analyzing headlines. It categorizes, tokenizes, and counts word occurrences within these headlines, identifying the top 10 most frequent words in each news category. The application employs multi-threading for concurrent processing across different news categories, and outputs the results in an HTML file.
Before running the application, ensure you have:
- Python 3.x installed on your system.
- NLTK data (tokenizers and stopwords) available.
- Clone the repository to your local machine.
- Navigate to the project directory.
- Install the required Python packages:
pip install -r requirements.txt
To run the application, use the following command in your terminal:
python main.py data/News_Category_Dataset_v3.json
main.py
: The main script managing the dataset processing. It creates threads for each news category and generates the HTML report.htmls.py
: A module for creating simple HTML files, supporting h1 headings and p paragraphs.data/
: Contains theNews_Category_Dataset_v3.json
file, the dataset used by the application.
The application generates an HTML file (output.html
) in the project's root directory, listing the top 10 words for each news category, with each category as an h1
heading and the words in p
tags.
Operational logs of the application are recorded in logs/main.log
, including progress updates and error tracking for each news category's processing.
Due to a keyword conflict with the nltk library, the file originally named html.py has been renamed to htmls.py. This change is made to avoid conflicts with any internal keywords or modules used by nltk. The application should function as intended with this change.
Jiechao Li