Python scripts to analyze articles from exxpress with basic stylometric features, possibly identifying stylistic changes over time (e.g., from generative AI use).
- Retrieve all articles from
exxpress.at
via the WordPress REST API. - Count articles per year.
- Extract and analyze all "Native Ad" articles.
- Perform stylometric analysis on categories (e.g., average word count, sentence count, sentence length, and lexical diversity) by month.
You'll need Python (version 3.7 or higher recommended). Install dependencies within a virtual environment (venv) for easy management.
First, download or clone the repository to your computer.
git clone https://github.com/yourusername/express.git
cd express
Set up a Python virtual environment (venv) to manage dependencies.
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
With the virtual environment activated, install the necessary packages.
pip install -r requirements.txt
Or manually install the packages:
pip install requests nltk pandas matplotlib
Some NLTK resources are required for tokenizing and stopwords. Run the script below to ensure all resources are downloaded.
python3 -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('punkt_tab')"
Run the crawl.py
script to download all articles up to the current date from the exxpress API. This saves a JSON file named express.json
with the downloaded articles.
python3 crawl.py
Run the count.py
script to output the number of articles published each year.
python3 count.py
To extract all articles tagged as "Native Ad" into a separate JSON file, run nativead.py
. This will create a file called native-ad.json
with only Native Ad articles.
python3 nativead.py
Run the analyze.py
script to analyze each article's stylometric features per category per month. This outputs:
category_monthly_stats.xlsx
– An Excel file with monthly statistics for each category.- Line charts for each stylometric feature saved as
.png
files.
python3 analyze.py
- category_monthly_stats.xlsx: Monthly statistics for each category, with features like average word count, sentence count, sentence length, and lexical diversity.
- Feature Plots: PNG images, each plotting a stylometric feature over time for different categories.
- native-ad.json: A JSON file containing only Native Ad articles.
This project is licensed under the MIT License.