Collecting and Exploring Datasets

This repo contains code and results of my explorations of text data from different sources. The aim is to create a dataset that consists of related longer and short texts from the same source or plattform. Longer texts are such as blog posts or news articles (full-text) and short text, should be similar in length to social media posts, such as abstracts or taglines.

First idea: The New York Times API
NYT offers a set of APIs and also have one for their archive. It contains partial articles (e.g. headline, abstract) and other information, such as which section an article belongs to (e.g. Arts, News) and the length of it (word count).
Unfortunately they only offer unlimited access to the full-texts in the article archive for subscribed users.

Althogh I won't be using their data for my research project, the available data is useful to find out some characteristics of news articles, e.g. the average length or how long their abstracts and titles are. This information could help in finding similar open-access data.

Repo Structure

.
├── README.md
├── config.ini
├── data
│   ├── raw
│   └── interim
├── notebooks
│   ├── 01_nyt_apis.ipynb
│   ├── 02_guardian_api.ipynb
│   ├── 03_twitter_api.ipynb
│   └── 04_explore_datasets.ipynb
├── requirements.in
├── requirements.txt
└── src
    ├── __init__.py
    ├── data
    │   ├── __init__.py
    │   ├── clean_text.py
    │   └── make_dataset.py
    └── main.py

Datasets

NYT Dataset:
The dataset contains partial article data from the NYT Archive API from 10/2021 to 09/2022, approx. 34,700 items.

GU Dataset:
The dataset contains article data from the GU Search API from 27/09/2021 to 27/09/2022, approx. 75,000 items.

GU Tweets Dataset:
The dataset contains tweets from The Guardian main account (@guardian), with 3,200 items (maximum allowed number of tweets from a user as per Twitter API limitations).

import os
import glob
import numpy as np
import pandas as pd
import re

from itertools import chain

import matplotlib.pyplot as plt
import seaborn as sns

# Notebook settings
import warnings
warnings.filterwarnings('ignore')
# Set data directory paths
nyt_path = 'data/interim/nyt_data'
gu_path = 'data/interim/gu_data'
gu_twitter_path = 'data/interim/gu_twitter_data/gu_tweets.csv'
# Load NYT data files and combine them
nyt_files = glob.glob(os.path.join(nyt_path, '*.csv'))
nyt_lst = []

for file in nyt_files:
    nyt_single_df = pd.read_csv(file)
    nyt_lst.append(nyt_single_df)

nyt_df = pd.concat(nyt_lst)
# Load GU data files and combine them
gu_files = glob.glob(os.path.join(gu_path, '*.csv'))
gu_lst = []

for file in gu_files:
    gu_single_df = pd.read_csv(file)
    gu_lst.append(gu_single_df)

gu_df = pd.concat(gu_lst)
# Load GU Twitter data
twitter_df = pd.read_csv(gu_twitter_path)

The New York Times

nyt_df.describe()
word_count word_count_headline word_count_abstract
count 34752.000000 34752.000000 34752.000000
mean 991.256417 9.483943 22.524660
std 698.379953 2.946029 7.412678
min 0.000000 1.000000 1.000000
25% 525.000000 8.000000 18.000000
50% 934.000000 10.000000 23.000000
75% 1315.000000 11.000000 27.000000
max 20573.000000 22.000000 103.000000
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize=(12,5))
fig.suptitle('Word counts for NYT articles')
fig.supylabel('Frequency')

ax1.hist(nyt_df['word_count_headline'], bins=20)
ax1.set_title('Headline')
ax2.hist(nyt_df['word_count_abstract'], bins=100)
ax2.set_title('Abstract')
ax3.hist(nyt_df['word_count'], bins=100)
ax3.set_title('Full-text')

for ax in (ax1, ax2, ax3):
    ax.set(xlabel='Word count')

plt.show()

png

The Guardian

gu_df.describe()
wordcount charCount word_count_headline word_count_trailText
count 74956.000000 74956.000000 74956.000000 74956.000000
mean 780.012301 4644.333662 11.643658 19.906385
std 464.920930 2726.342686 2.943087 5.498631
min 0.000000 0.000000 2.000000 1.000000
25% 498.000000 2980.000000 10.000000 16.000000
50% 711.000000 4239.000000 12.000000 19.000000
75% 947.000000 5654.000000 13.000000 23.000000
max 9633.000000 54826.000000 28.000000 77.000000
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize=(12,5))
fig.suptitle('Word counts for GUARDIAN articles')
fig.supylabel('Frequency')

ax1.hist(gu_df['word_count_headline'], bins=30)
ax1.set_title('Headline')
ax2.hist(gu_df['word_count_trailText'], bins=100)
ax2.set_title('Abstract')
ax3.hist(gu_df['wordcount'], bins=100)
ax3.set_title('Full-text')

for ax in (ax1, ax2, ax3):
    ax.set(xlabel='Word count')

plt.show()

png

Tweets from The Guardian

twitter_df.head()
created_at id text clean_text word_count
0 Wed Sep 28 21:17:47 +0000 2022 1575233331943309339 Morning mail: hurricane with 240km/h winds hit... Morning mail hurricane with kmh winds hits Flo... 14
1 Wed Sep 28 21:17:46 +0000 2022 1575233327329574924 R Kelly ordered to pay restitution of $300,000... R Kelly ordered to pay restitution of to his v... 10
2 Wed Sep 28 21:17:44 +0000 2022 1575233321063292928 Two aircraft involved in ‘minor collision’ on ... Two aircraft involved in ‘minor collision’ on ... 10
3 Wed Sep 28 21:10:00 +0000 2022 1575231372490309658 We’re keen to hear from people who have recent... We’re keen to hear from people who have recent... 25
4 Wed Sep 28 21:03:05 +0000 2022 1575229631912869894 Guardian front page, Thursday 29 September 202... Guardian front page Thursday September Banks £... 12
twitter_df.describe()
id word_count
count 3.200000e+03 3200.000000
mean 1.572793e+18 12.177188
std 1.462916e+15 3.107159
min 1.570134e+18 1.000000
25% 1.571554e+18 10.000000
50% 1.572913e+18 12.000000
75% 1.574037e+18 14.000000
max 1.575233e+18 25.000000
plt.hist(twitter_df['word_count'], bins=25)
plt.xlabel('Word count')
plt.ylabel('Frequency')
plt.title('Histogram of Word Counts in GUARDIAN Tweets')
plt.show()

png


The Guardian Dataset Features

gu_df.head()
id sectionId sectionName webPublicationDate webUrl apiUrl pillarId pillarName byline body ... bylineHtml fields.contributorBio scheduledPublicationDate tag_ids tag_webTitles tag_webUrls cl_headline cl_trailText word_count_headline word_count_trailText
0 sport/blog/2022/jun/27/imperious-nsw-seize-adv... sport Sport 2022-06-26T23:46:44Z https://www.theguardian.com/sport/blog/2022/ju... https://content.guardianapis.com/sport/blog/20... pillar/sport Sport Nick Tedeschi <p>A star was born in debutant Matt Burton. An... ... <a href="profile/nick-tedeschi">Nick Tedeschi</a> NaN NaN ['sport/state-of-origin', 'sport/rugbyleague',... ['State of Origin', 'Rugby league', 'NRL', 'Au... ['https://www.theguardian.com/sport/state-of-o... Imperious NSW seize advantage after Queensland... Poor tackling basic handling errors and a lack... 10 23
1 music/2022/jun/27/kendrick-lamar-at-glastonbur... music Music 2022-06-26T23:32:10Z https://www.theguardian.com/music/2022/jun/27/... https://content.guardianapis.com/music/2022/ju... pillar/arts Arts Alexis Petridis <p>As Glastonbury 2022 draws to a close, a var... ... <a href="profile/alexispetridis">Alexis Petrid... NaN NaN ['music/glastonbury-2022', 'music/kendrick-lam... ['Glastonbury 2022', 'Kendrick Lamar', 'Music'... ['https://www.theguardian.com/music/glastonbur... Kendrick Lamar at Glastonbury review – faith f... Sporting a bejewelled crown of thorns and with... 11 25
2 world/2022/jun/27/garbage-island-no-more-how-o... world World news 2022-06-26T23:06:58Z https://www.theguardian.com/world/2022/jun/27/... https://content.guardianapis.com/world/2022/ju... pillar/news News Justin McCurry on Teshima island <p>Toru Ishii remembers when the shredded car ... ... <a href="profile/justinmccurry">Justin McCurry... NaN NaN ['world/japan', 'world/asia-pacific', 'world/w... ['Japan', 'Asia Pacific', 'World news', 'Envir... ['https://www.theguardian.com/world/japan', 'h... Garbage island no more how one Japanese commun... Teshima – site of Japan’s worst case of illega... 14 27
3 media/2022/jun/27/young-people-must-report-har... media Media 2022-06-26T23:01:00Z https://www.theguardian.com/media/2022/jun/27/... https://content.guardianapis.com/media/2022/ju... pillar/news News Dan Milmo <p>Young people should report harmful online c... ... <a href="profile/danmilmo">Dan Milmo</a> NaN NaN ['media/ofcom', 'media/social-media', 'society... ['Ofcom', 'Social media', 'Online abuse', 'You... ['https://www.theguardian.com/media/ofcom', 'h... Young people must report harmful online conten... Ofcom says of to yearolds have seen harmful co... 10 16
4 stage/2022/jun/27/mad-house-review-david-harbo... stage Stage 2022-06-26T23:01:00Z https://www.theguardian.com/stage/2022/jun/27/... https://content.guardianapis.com/stage/2022/ju... pillar/arts Arts Arifa Akbar <p>Theresa Rebeck’s play opens as a dysfunctio... ... <a href="profile/arifa-akbar">Arifa Akbar</a> NaN NaN ['stage/theatre', 'stage/stage', 'culture/cult... ['Theatre', 'Stage', 'Culture', 'Article', 'Re... ['https://www.theguardian.com/stage/theatre', ... Mad House review – David Harbour and Bill Pull... Theresa Rebeck’s play follows the relationship... 14 16

5 rows × 25 columns

gu_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 74956 entries, 0 to 6145
Data columns (total 25 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   id                        74956 non-null  object
 1   sectionId                 74956 non-null  object
 2   sectionName               74956 non-null  object
 3   webPublicationDate        74956 non-null  object
 4   webUrl                    74956 non-null  object
 5   apiUrl                    74956 non-null  object
 6   pillarId                  74453 non-null  object
 7   pillarName                74453 non-null  object
 8   byline                    73447 non-null  object
 9   body                      74956 non-null  object
 10  wordcount                 74956 non-null  int64 
 11  publication               74956 non-null  object
 12  lang                      74956 non-null  object
 13  bodyText                  74757 non-null  object
 14  charCount                 74956 non-null  int64 
 15  bylineHtml                73447 non-null  object
 16  fields.contributorBio     15 non-null     object
 17  scheduledPublicationDate  3 non-null      object
 18  tag_ids                   74956 non-null  object
 19  tag_webTitles             74956 non-null  object
 20  tag_webUrls               74956 non-null  object
 21  cl_headline               74956 non-null  object
 22  cl_trailText              74904 non-null  object
 23  word_count_headline       74956 non-null  int64 
 24  word_count_trailText      74956 non-null  int64 
dtypes: int64(4), object(21)
memory usage: 14.9+ MB

The NYT has more longer articles than the GU. The headlines show the opposite behaviour. The abstracts seem similar in both.
Compared to the length of tweets, the most similar article part in length is the headline.

# Check for duplicates
gu_df[gu_df.duplicated(['id','webUrl'], keep=False)]
id sectionId sectionName webPublicationDate webUrl apiUrl pillarId pillarName byline body ... bylineHtml fields.contributorBio scheduledPublicationDate tag_ids tag_webTitles tag_webUrls cl_headline cl_trailText word_count_headline word_count_trailText
4912 lifeandstyle/2022/sep/05/work-therapy-can-a-so... lifeandstyle Life and style 2022-09-04T17:30:03Z https://www.theguardian.com/lifeandstyle/2022/... https://content.guardianapis.com/lifeandstyle/... pillar/lifestyle Lifestyle Jenny Valentish <p>Whether you’re a quiet quitter or a 24/7 si... ... <a href="profile/jenny-valentish">Jenny Valent... NaN NaN ['lifeandstyle/australian-lifestyle', 'music/a... [Australian lifestyle, Australian music, Artic... ['https://www.theguardian.com/lifeandstyle/aus... Work therapy can a social media coach talk a s... Musician Dave McCormack hates social media but... 15 27
4913 lifeandstyle/2022/sep/05/work-therapy-can-a-so... lifeandstyle Life and style 2022-09-04T17:30:03Z https://www.theguardian.com/lifeandstyle/2022/... https://content.guardianapis.com/lifeandstyle/... pillar/lifestyle Lifestyle Jenny Valentish <p>Whether you’re a quiet quitter or a 24/7 si... ... <a href="profile/jenny-valentish">Jenny Valent... NaN NaN ['lifeandstyle/australian-lifestyle', 'music/a... [Australian lifestyle, Australian music, Artic... ['https://www.theguardian.com/lifeandstyle/aus... Work therapy can a social media coach talk a s... Musician Dave McCormack hates social media but... 15 27
1247 film/2022/aug/21/anais-in-love-review-anais-de... film Film 2022-08-21T07:00:02Z https://www.theguardian.com/film/2022/aug/21/a... https://content.guardianapis.com/film/2022/aug... pillar/arts Arts Wendy Ide <p>She’s a familiar character. Skittish, self-... ... <a href="profile/wendy-ide">Wendy Ide</a> NaN NaN ['film/drama', 'film/film', 'culture/culture',... [Drama films, Film, Culture, World cinema, Art... ['https://www.theguardian.com/film/drama', 'ht... Anaïs in Love review – Anaïs Demoustier intoxi... Cinema’s latest irresistible chaotic femme Dem... 13 16
1248 lifeandstyle/2022/aug/21/emma-beddington-my-ki... lifeandstyle Life and style 2022-08-21T07:00:02Z https://www.theguardian.com/lifeandstyle/2022/... https://content.guardianapis.com/lifeandstyle/... pillar/lifestyle Lifestyle Emma Beddington <p>I’m nearly an empty nester. That conjures u... ... <a href="profile/emma-beddington">Emma Bedding... NaN NaN ['lifeandstyle/parents-and-parenting', 'lifean... [Parents and parenting, Family, Life and style... ['https://www.theguardian.com/lifeandstyle/par... My kids have moved out but please don’t call i... I want to take issue with this nests metaphor ... 13 19
1249 film/2022/aug/21/anais-in-love-review-anais-de... film Film 2022-08-21T07:00:02Z https://www.theguardian.com/film/2022/aug/21/a... https://content.guardianapis.com/film/2022/aug... pillar/arts Arts Wendy Ide <p>She’s a familiar character. Skittish, self-... ... <a href="profile/wendy-ide">Wendy Ide</a> NaN NaN ['film/drama', 'film/film', 'culture/culture',... [Drama films, Film, Culture, World cinema, Art... ['https://www.theguardian.com/film/drama', 'ht... Anaïs in Love review – Anaïs Demoustier intoxi... Cinema’s latest irresistible chaotic femme Dem... 13 16
1250 lifeandstyle/2022/aug/21/emma-beddington-my-ki... lifeandstyle Life and style 2022-08-21T07:00:02Z https://www.theguardian.com/lifeandstyle/2022/... https://content.guardianapis.com/lifeandstyle/... pillar/lifestyle Lifestyle Emma Beddington <p>I’m nearly an empty nester. That conjures u... ... <a href="profile/emma-beddington">Emma Bedding... NaN NaN ['lifeandstyle/parents-and-parenting', 'lifean... [Parents and parenting, Family, Life and style... ['https://www.theguardian.com/lifeandstyle/par... My kids have moved out but please don’t call i... I want to take issue with this nests metaphor ... 13 19
1390 world/2022/aug/20/estonia-europe-inflation-hot... world World news 2022-08-20T07:00:33Z https://www.theguardian.com/world/2022/aug/20/... https://content.guardianapis.com/world/2022/au... pillar/news News Daniel Boffey in Tallinn <p>Like his cappuccinos, Taniel Vaaderpass, 33... ... <a href="profile/daniel-boffey">Daniel Boffey<... NaN NaN ['world/estonia', 'business/inflation', 'busin... [Estonia, Inflation, Eurozone, Business, Econo... ['https://www.theguardian.com/world/estonia', ... ‘I am not blaming anyone’ Estonians shrug off ... Those in Europe’s inflation hotspot remain cal... 9 23
1392 world/2022/aug/20/estonia-europe-inflation-hot... world World news 2022-08-20T07:00:33Z https://www.theguardian.com/world/2022/aug/20/... https://content.guardianapis.com/world/2022/au... pillar/news News Daniel Boffey in Tallinn <p>Like his cappuccinos, Taniel Vaaderpass, 33... ... <a href="profile/daniel-boffey">Daniel Boffey<... NaN NaN ['world/estonia', 'business/inflation', 'busin... [Estonia, Inflation, Eurozone, Business, Econo... ['https://www.theguardian.com/world/estonia', ... ‘I am not blaming anyone’ Estonians shrug off ... Those in Europe’s inflation hotspot remain cal... 9 23
2580 politics/2022/may/15/labour-keir-starmer-tory-... politics Politics 2022-05-15T16:00:34Z https://www.theguardian.com/politics/2022/may/... https://content.guardianapis.com/politics/2022... pillar/news News Jessica Elgot Chief political correspondent <p>Labour activists in Durham have called on a... ... <a href="profile/jessica-elgot">Jessica Elgot<... NaN NaN ['politics/labour', 'politics/conservatives', ... [Labour, Conservatives, Keir Starmer, Politics... ['https://www.theguardian.com/politics/labour'... Labour activists call on Tory MP to withdraw B... Local chair accuses Richard Holden of ‘wasting... 10 18
2582 politics/2022/may/15/labour-keir-starmer-tory-... politics Politics 2022-05-15T16:00:34Z https://www.theguardian.com/politics/2022/may/... https://content.guardianapis.com/politics/2022... pillar/news News Jessica Elgot Chief political correspondent <p>Labour activists in Durham have called on a... ... <a href="profile/jessica-elgot">Jessica Elgot<... NaN NaN ['politics/labour', 'politics/conservatives', ... [Labour, Conservatives, Keir Starmer, Politics... ['https://www.theguardian.com/politics/labour'... Labour activists call on Tory MP to withdraw B... Local chair accuses Richard Holden of ‘wasting... 10 18

10 rows × 25 columns

# Drop duplicates
gu_df.drop_duplicates(['id', 'webUrl'], keep='first', inplace=True, ignore_index=True)

Articles per Category/Section in The Guardian

Number of articles per section

gu_df['sectionName'].value_counts().nlargest(25).sort_values(ascending=True).plot(kind='barh')
plt.xlabel("Number of Articles", labelpad=14)
plt.ylabel("Section", labelpad=14)
plt.title("Number of Articles per Section in GUARDIAN (top 25 Sections)")
plt.show()

png

#gu_df.groupby('sectionName').size().sort_values(ascending=False)
with pd.option_context('display.max_rows', None):
    display(gu_df.groupby('sectionName').agg({'wordcount': [min, max, np.mean, np.median], 'sectionName': 'count'}) \
        .sort_values([('sectionName', 'count')], ascending=False))
wordcount sectionName
min max mean median count
sectionName
World news 44 6734 794.656148 706.0 8873
Australia news 0 3674 839.524584 781.0 5593
Opinion 0 3652 859.196750 892.0 5169
Football 0 6284 747.456158 752.0 5075
Sport 0 6247 786.508210 763.0 4811
UK news 41 7842 650.519959 587.0 3933
US news 0 5215 862.981171 785.0 3877
Business 0 6361 663.485000 614.0 3800
Politics 0 5816 722.032179 657.0 3263
Environment 0 6990 762.019539 682.0 3122
Life and style 0 8000 782.904100 699.0 2951
Film 76 3949 730.911191 598.0 2511
Music 50 5751 771.810295 590.0 2409
Television & radio 0 5612 879.591774 766.0 2261
Books 0 6401 890.570594 814.5 2224
Society 51 6620 758.870447 635.0 2169
Stage 38 4131 666.072258 513.0 1550
Art and design 80 4959 914.185529 834.0 1078
Food 1 9633 823.816365 643.0 1051
Culture 0 5711 894.080000 753.5 1050
Technology 79 4100 753.202369 634.0 1013
Money 107 4231 714.265979 571.0 970
Global development 117 6062 884.699408 811.0 845
Media 60 6037 711.642857 582.0 798
News 61 7824 503.050360 227.0 695
Education 67 5699 648.908148 569.0 675
Science 38 6769 702.636228 584.5 668
Travel 0 5362 1068.965839 1079.5 644
Fashion 68 6798 666.250554 544.0 451
Law 94 2719 730.716846 650.0 279
Games 167 3354 912.715953 822.0 257
Crosswords 0 1164 318.409574 153.0 188
From the Observer 269 1605 682.702703 387.0 111
Guardian Masterclasses 30 1654 717.337209 685.5 86
Global 0 2710 816.560976 657.0 41
GNM press office 146 1118 512.696970 476.0 33
Inequality 152 2000 671.269231 614.0 26
Info 179 7066 1649.850000 1060.0 20
The invested generation 763 1376 983.937500 933.5 16
Cities 140 1094 656.307692 669.0 13
Animals farmed 882 1112 995.272727 989.0 11
Seek: The new world of work 640 1226 947.100000 928.0 10
Bonjour Provence and Côte d’Azur 34 1333 1008.111111 1065.0 9
Community of solvers 741 1271 976.888889 882.0 9
Membership 464 2561 1171.500000 1069.0 8
Spotify: Morning moods 576 1364 932.375000 906.0 8
Weather 189 815 391.875000 312.0 8
SBS: A world of difference 615 1141 954.285714 916.0 7
A new career with The University of Law 23 1119 770.428571 822.0 7
Macquarie: Home of electric vehicles 791 1407 1129.166667 1151.0 6
On my terms 454 999 761.166667 790.5 6
Connected thinking 987 1520 1208.666667 1150.5 6
The Guardian clearing hub 128 1048 803.333333 932.0 6
From the Guardian 419 2238 825.400000 488.0 5
A vision for better food 834 1211 1056.800000 1083.0 5
Google: Helpful by nature 748 1205 1002.800000 1077.0 5
Pioneering innovation for a purposeful future 878 1101 996.000000 985.0 5
Rise with London South Bank University 740 939 839.000000 838.5 4
Conservation in action 48 1043 709.500000 873.5 4
Quest Apartment Hotels: As local as you like it 811 877 847.000000 850.0 4
Rediscover tequila 798 1168 941.500000 900.0 4
Retail careers that mean more 769 833 790.500000 780.0 4
SBS On Demand: New Gold Mountain 756 995 860.750000 846.0 4
SAP: Smart business 677 965 866.250000 911.5 4
Made with love 630 1012 808.500000 796.0 4
Scotland's stories 1148 1384 1243.250000 1220.5 4
Send smarter 14 1299 894.500000 1132.5 4
The whole picture 713 1393 989.250000 925.5 4
Toyota Australia: Journey to electric 829 1181 969.250000 933.5 4
AMC+: Only the good stuff 182 859 501.000000 481.5 4
Meta: Buy Blak 783 1035 925.500000 942.0 4
Colonial First State: Unleash your second half 654 916 777.000000 769.0 4
Help 162 473 385.000000 452.5 4
Helga's: Capturing kindness 838 1027 925.000000 917.5 4
Forefront of fintech 779 871 824.750000 824.5 4
For the love of numbers 847 874 864.750000 869.0 4
Growing for good 473 806 654.500000 669.5 4
BMW: Sustainable mobility 843 1304 1045.000000 988.0 3
HBF: Never Settle 945 959 950.666667 948.0 3
Lexus Australia: New luxury 755 1171 953.333333 934.0 3
Snooze: Investing In Sleep 767 1010 879.666667 862.0 3
Specsavers: Experts in eye care 849 1188 989.666667 932.0 3
Business Victoria: Making Headway 714 1375 975.333333 837.0 3
Spotify: Find the one 1459 1845 1592.666667 1474.0 3
Letters to Tomorrow 845 1031 918.000000 878.0 3
The Fred Hollows Foundation: 30 years of restoring sight 788 1012 897.333333 892.0 3
From the inside out 665 850 765.000000 780.0 3
The Life You Can Save: Effective giving 801 1258 959.000000 818.0 3
The need for speed 36 1014 650.666667 902.0 3
Griffith University: Make it matter 501 925 773.333333 894.0 3
Volvo Car Australia: Pure Electric 92 927 637.333333 893.0 3
Westpac Foundation: Investing in social enterprise 894 1182 1039.000000 1041.0 3
Focus 1061 4925 2428.000000 1298.0 3
City of Melbourne: FOMO 612 898 767.000000 791.0 3
Curtin University: Why study law 720 853 787.666667 790.0 3
MG Motor: Switch to electric 711 1306 955.000000 848.0 3
A time for Japan 629 919 764.666667 746.0 3
Dine: Hope Grows 560 873 725.333333 743.0 3
MINI: Serious fun 178 1062 717.000000 911.0 3
Dairy Australia: Healthy sustainable diets 51 992 629.000000 844.0 3
Monash University: The Endangered Generation 732 875 823.333333 863.0 3
Mirvac: Voyager 760 885 832.333333 852.0 3
Moccona: Make a little difference 778 803 790.500000 790.5 2
Cancer Council Victoria: Giving in Will 1083 1124 1103.500000 1103.5 2
City of Melbourne: Shop the City 817 859 838.000000 838.0 2
The University of Notre Dame: Ethical education 940 957 948.500000 948.5 2
Specsavers: An Eye for Art 911 1181 1046.000000 1046.0 2
Boutique Homes: My home my way 835 843 839.000000 839.0 2
Travel Associates: Get out there in the Red Centre 851 944 897.500000 897.5 2
Bank Australia: Code red 780 934 857.000000 857.0 2
Swinburne Edge: A new work era 126 1030 578.000000 578.0 2
Michelin: Built to keep you moving 752 913 832.500000 832.5 2
Guardian US press office 473 544 508.500000 508.5 2
SAP: Transformation mindset 796 1059 927.500000 927.5 2
Specsavers: Liberty London 642 883 762.500000 762.5 2
Honda CR-V: Joy in the detail 667 936 801.500000 801.5 2
Volvo Car Australia: Family bond 710 828 769.000000 769.0 2
Hurtigruten: Discover Norway 717 842 779.500000 779.5 2
IFAW: Help animals thrive 770 1121 945.500000 945.5 2
Plico: Renewable energy 812 883 847.500000 847.5 2
Fed Square: Sustainable September 794 971 882.500000 882.5 2
Kyndryl:The Heart of Progress 1121 1248 1184.500000 1184.5 2
Dairy Australia: fracture research 922 972 947.000000 947.0 2
OPSM: a vision for safer roads 758 980 869.000000 869.0 2
Archie Rose: Made in good spirits 832 1117 974.500000 974.5 2
Kyco: Full transparency 800 800 800.000000 800.0 1
The last taboo 737 737 737.000000 737.0 1
Mazda: Sustainable style 811 811 811.000000 811.0 1
Marine Stewardship Council: Saltwater Schools 874 874 874.000000 874.0 1
All Saints' College: The Education Revolution is Here 830 830 830.000000 830.0 1
Daikin: Pure Air 946 946 946.000000 946.0 1
Adult social care careers in Essex 934 934 934.000000 934.0 1
WeAre8: Built to do good 810 810 810.000000 810.0 1
West Australian Opera: Discover season 2022 900 900 900.000000 900.0 1
Melbourne Museum: Unearthing an icon 736 736 736.000000 736.0 1
Specsavers: Focus on health 928 928 928.000000 928.0 1
Australian World Orchestra: Zubin Mehta 812 812 812.000000 812.0 1
Monash University: Ask a lawyer 876 876 876.000000 876.0 1
Kathmandu: Sustainable future 892 892 892.000000 892.0 1
Disney: See How They Run 715 715 715.000000 715.0 1
Plan International Australia: Global Hunger Crisis Appeal 1273 1273 1273.000000 1273.0 1
NITV: Always Was, Always Will Be 869 869 869.000000 869.0 1
RSPCA Australia: RSPCA Approved Farming Scheme 1000 1000 1000.000000 1000.0 1
Releaseit: Ready to rent 672 672 672.000000 672.0 1
Curtin University: Humanities 833 833 833.000000 833.0 1
Searchlight Pictures: Nightmare Alley 761 761 761.000000 761.0 1
Sydney Opera House: Antidote festival 884 884 884.000000 884.0 1
Searchlight Pictures: The French Dispatch 832 832 832.000000 832.0 1
Canna: Small space gardening 1002 1002 1002.000000 1002.0 1
Guardian Sustainable Business 810 810 810.000000 810.0 1
Southern Cross University: better energy 1124 1124 1124.000000 1124.0 1
OPSM: Optimal health 1119 1119 1119.000000 1119.0 1
Specsavers: Wearable art 1054 1054 1054.000000 1054.0 1
Michelin: Driving the Future 737 737 737.000000 737.0 1
GNM archive 351 351 351.000000 351.0 1
# List of sections with more than 10 articles
greater_10_lst = gu_df.groupby('sectionName').filter(lambda x: len(x) > 10).groupby('sectionName').agg({'sectionName': 'count'}).index.tolist()
gu_df[gu_df['sectionName'].isin(greater_10_lst)].groupby('sectionName')['sectionName'].agg(['count']).sort_values('count', ascending=False)
count
sectionName
World news 8873
Australia news 5593
Opinion 5169
Football 5075
Sport 4811
UK news 3933
US news 3877
Business 3800
Politics 3263
Environment 3122
Life and style 2951
Film 2511
Music 2409
Television & radio 2261
Books 2224
Society 2169
Stage 1550
Art and design 1078
Food 1051
Culture 1050
Technology 1013
Money 970
Global development 845
Media 798
News 695
Education 675
Science 668
Travel 644
Fashion 451
Law 279
Games 257
Crosswords 188
From the Observer 111
Guardian Masterclasses 86
Global 41
GNM press office 33
Inequality 26
Info 20
The invested generation 16
Cities 13
Animals farmed 11

There are a lot of sections that seem to be concerned with only one very specific topic. These also contain very few articles ($\leq$ 10).
'Animals farmed' is a special series inside the environment section. It contains articleas about food production and climate issues, so it can also be considered news.
'Inequality' has articles about pocilies and current political and societal issues, so it belongs to news.

# Drop sections not related to 'real' news
sections_to_drop = ['Football', 'Sport', 'Life and style', 'Film', 'Music', 'Television & radio', 'Books', 'Society',
'Stage', 'Art and design', 'Food', 'Culture', 'Media', 'Travel', 'Fashion', 'Games', 'Crosswords',
'Guardian Masterclasses', 'GNM press office', 'Info', 'The invested generation', 'Cities'] 

Select only sections from news related categories that contain more than 10 items.

gu_news_df = gu_df[~gu_df['sectionName'].isin(sections_to_drop) & gu_df['sectionName'].isin(greater_10_lst)]
with pd.option_context('display.max_rows', None):
    display(gu_news_df.groupby('sectionName').agg({'wordcount': [min, max, np.mean, np.median], 'sectionName': 'count'}) \
        .sort_values([('sectionName', 'count')], ascending=False))
wordcount sectionName
min max mean median count
sectionName
World news 44 6734 794.656148 706.0 8873
Australia news 0 3674 839.524584 781.0 5593
Opinion 0 3652 859.196750 892.0 5169
UK news 41 7842 650.519959 587.0 3933
US news 0 5215 862.981171 785.0 3877
Business 0 6361 663.485000 614.0 3800
Politics 0 5816 722.032179 657.0 3263
Environment 0 6990 762.019539 682.0 3122
Technology 79 4100 753.202369 634.0 1013
Money 107 4231 714.265979 571.0 970
Global development 117 6062 884.699408 811.0 845
News 61 7824 503.050360 227.0 695
Education 67 5699 648.908148 569.0 675
Science 38 6769 702.636228 584.5 668
Law 94 2719 730.716846 650.0 279
From the Observer 269 1605 682.702703 387.0 111
Global 0 2710 816.560976 657.0 41
Inequality 152 2000 671.269231 614.0 26
Animals farmed 882 1112 995.272727 989.0 11
gu_news_df[gu_news_df['wordcount'] == 0][['body', 'bodyText']]
body bodyText
379 <figure class="element element-atom"> \n <gu-a... NaN
556 <figure class="element element-interactive int... NaN
1001 <figure class="element element-interactive int... NaN
1428 <figure class="element element-interactive int... NaN
2024 <figure class="element element-interactive int... NaN
... ... ...
72608 <figure class="element element-interactive int... NaN
73475 <figure class="element element-interactive int... NaN
74057 <figure class="element element-interactive int... NaN
74500 <figure class="element element-image element--... NaN
74902 <figure class="element element-interactive int... NaN

138 rows × 2 columns

Items with wordcount = 0 have some kind of HTML content that is not text.

print(gu_news_df[gu_news_df['wordcount'] == 0].iloc[0]['body'])
print(gu_news_df[gu_news_df['wordcount'] == 0].iloc[120]['body'])
<figure class="element element-atom"> 
 <gu-atom data-atom-id="7c5c6f4d-068a-455b-88d9-3d0274c6c70d" data-atom-type="quiz"> 
  <div>
   <div class="quiz" data-questions-length="8" data-title="The Almost Great Electricity Crisis Quiz">
    <ol class="quiz__questions">
     <li class="quiz__question question" data-text="What is load shedding?"><p class="question__text">What is load shedding?</p>
      <ol class="question__answers" type="A" data-answers-length="4">
       <li class="question__answer answer" data-text="A yoga term describing the relief when moving from the downward dog to a low lunge." data-correct="false"><p class="answer__text">A yoga term describing the relief when moving from the downward dog to a low lunge.</p></li>
       <li class="question__answer answer" data-text=" A fancy name for a blackout." data-correct="false"><p class="answer__text"> A fancy name for a blackout.</p></li>
       <li class="question__answer answer" data-text="A last resort when electricity market bosses have tried everything else to balance out demand with supply. " data-correct="true"><p class="answer__text">A last resort when electricity market bosses have tried everything else to balance out demand with supply. </p></li>
       <li class="question__answer answer" data-text=" Any time a power generator has to carry out planned repairs." data-correct="false"><p class="answer__text"> Any time a power generator has to carry out planned repairs.</p></li>
      </ol></li>
     <li class="quiz__question question" data-text="What is an example of dispatchability?"><p class="question__text">What is an example of dispatchability?</p>
      <ol class="question__answers" type="A" data-answers-length="4">
       <li class="question__answer answer" data-text="An off-the-shelf power source that can be put in place quickly, such as a battery or a solar panel." data-correct="false"><p class="answer__text">An off-the-shelf power source that can be put in place quickly, such as a battery or a solar panel.</p></li>
       <li class="question__answer answer" data-text="Foreign minister Penny Wong being sent to the Pacific straight after an election to rebuild Australia’s reputation on climate change." data-correct="false"><p class="answer__text">Foreign minister Penny Wong being sent to the Pacific straight after an election to rebuild Australia’s reputation on climate change.</p></li>
       <li class="question__answer answer" data-text="A shop promising immediate delivery of a power bank to keep your mobile phone going when it runs out of juice." data-correct="false"><p class="answer__text">A shop promising immediate delivery of a power bank to keep your mobile phone going when it runs out of juice.</p></li>
       <li class="question__answer answer" data-text="A source of electricity that can be controlled to keep supply and demand balanced in the system." data-correct="true"><p class="answer__text">A source of electricity that can be controlled to keep supply and demand balanced in the system.</p></li>
      </ol></li>
     <li class="quiz__question question" data-text="What is a “default market offer” in the electricity sector?"><p class="question__text">What is a “default market offer” in the electricity sector?</p>
      <ol class="question__answers" type="A" data-answers-length="4">
       <li class="question__answer answer" data-text="An electricity company can’t pay back its loans, but someone down the market is offering to sell you that company in exchange for all those power banks you keep buying. " data-correct="false"><p class="answer__text">An electricity company can’t pay back its loans, but someone down the market is offering to sell you that company in exchange for all those power banks you keep buying. </p></li>
       <li class="question__answer answer" data-text="A price set by an energy regulator that influences how much an electricity retailer can charge you." data-correct="true"><p class="answer__text">A price set by an energy regulator that influences how much an electricity retailer can charge you.</p></li>
       <li class="question__answer answer" data-text="The standard price for borrowing a market stallholder’s extension cable which can rise or fall in line with the cost of lettuce." data-correct="false"><p class="answer__text">The standard price for borrowing a market stallholder’s extension cable which can rise or fall in line with the cost of lettuce.</p></li>
       <li class="question__answer answer" data-text="The minimum cost that power generators such as a wind farm or a coal plant say they can provide electricity for." data-correct="false"><p class="answer__text">The minimum cost that power generators such as a wind farm or a coal plant say they can provide electricity for.</p></li>
      </ol></li>
     <li class="quiz__question question" data-text="What is the wholesale electricity market?"><p class="question__text">What is the wholesale electricity market?</p>
      <ol class="question__answers" type="A" data-answers-length="4">
       <li class="question__answer answer" data-text="The price a fruit and vegetable wholesaler pays for air-conditioning so a $12 iceberg lettuce doesn’t go limp." data-correct="false"><p class="answer__text">The price a fruit and vegetable wholesaler pays for air-conditioning so a $12 iceberg lettuce doesn’t go limp.</p></li>
       <li class="question__answer answer" data-text="A place where electrons go to buy confectionery in bulk." data-correct="false"><p class="answer__text">A place where electrons go to buy confectionery in bulk.</p></li>
       <li class="question__answer answer" data-text="A virtual marketplace in which retailers buy electricity from companies that generate it." data-correct="true"><p class="answer__text">A virtual marketplace in which retailers buy electricity from companies that generate it.</p></li>
       <li class="question__answer answer" data-text="Any participant in the electricity market – from a coal plant to a battery owner – that can theoretically deliver at least 1,000 megawatts of electricity." data-correct="false"><p class="answer__text">Any participant in the electricity market – from a coal plant to a battery owner – that can theoretically deliver at least 1,000 megawatts of electricity.</p></li>
      </ol></li>
     <li class="quiz__question question" data-text="What is a Lack of Reserve Notice?"><p class="question__text">What is a Lack of Reserve Notice?</p>
      <ol class="question__answers" type="A" data-answers-length="4">
       <li class="question__answer answer" data-text="A notice from your boss at the cafe to buy more avocados (but not too many) because not enough are available to meet smashed-avo demand." data-correct="false"><p class="answer__text">A notice from your boss at the cafe to buy more avocados (but not too many) because not enough are available to meet smashed-avo demand.</p></li>
       <li class="question__answer answer" data-text="A notice issued by the electricity market operator to all market participants, such as coal plant owners and large battery operators, about a potential or actual shortfall in energy supply." data-correct="true"><p class="answer__text">A notice issued by the electricity market operator to all market participants, such as coal plant owners and large battery operators, about a potential or actual shortfall in energy supply.</p></li>
       <li class="question__answer answer" data-text="An email from the coach to say the squad is threadbare this week and does anyone have a mate that can play wing defence?" data-correct="false"><p class="answer__text">An email from the coach to say the squad is threadbare this week and does anyone have a mate that can play wing defence?</p></li>
       <li class="question__answer answer" data-text="A notice issued by electricity generators to the market that they can no longer provide as much power as usual." data-correct="false"><p class="answer__text">A notice issued by electricity generators to the market that they can no longer provide as much power as usual.</p></li>
      </ol></li>
     <li class="quiz__question question" data-text="What does FCAS stand for?"><p class="question__text">What does FCAS stand for?</p>
      <ol class="question__answers" type="A" data-answers-length="4">
       <li class="question__answer answer" data-text="Flexible Contingency Auxiliary System. Also known as a parasitic load, it refers to energy used by generators themselves." data-correct="false"><p class="answer__text">Flexible Contingency Auxiliary System. Also known as a parasitic load, it refers to energy used by generators themselves.</p></li>
       <li class="question__answer answer" data-text="Forecast Close Asset Strategy. A plan agreed between a regulator and a power provider for staged close-down of a power plant." data-correct="false"><p class="answer__text">Forecast Close Asset Strategy. A plan agreed between a regulator and a power provider for staged close-down of a power plant.</p></li>
       <li class="question__answer answer" data-text="Frequency Control Ancillary Services. A market in which generators can provide services that keep the electricity system balanced." data-correct="true"><p class="answer__text">Frequency Control Ancillary Services. A market in which generators can provide services that keep the electricity system balanced.</p></li>
       <li class="question__answer answer" data-text="Fully Cooked And Solared. Electricity market slang for when cheaper renewables push fossil fuels out of the market." data-correct="false"><p class="answer__text">Fully Cooked And Solared. Electricity market slang for when cheaper renewables push fossil fuels out of the market.</p></li>
      </ol></li>
     <li class="quiz__question question" data-text="What is the integrated system plan?"><p class="question__text">What is the integrated system plan?</p>
      <ol class="question__answers" type="A" data-answers-length="4">
       <li class="question__answer answer" data-text="A detailed plan produced every two years by the Australian Energy Market Operator on the optimal future design of the National Electricity Market." data-correct="true"><p class="answer__text">A detailed plan produced every two years by the Australian Energy Market Operator on the optimal future design of the National Electricity Market.</p></li>
       <li class="question__answer answer" data-text="A plan produced every two years by the energy department to link the National Electricity Market with the Northern Territory’s electricity networks and Western Australia’s South West Interconnected System." data-correct="false"><p class="answer__text">A plan produced every two years by the energy department to link the National Electricity Market with the Northern Territory’s electricity networks and Western Australia’s South West Interconnected System.</p></li>
       <li class="question__answer answer" data-text="A plan that owners of coal-power plants must submit each year detailing how quickly they are working to close down." data-correct="false"><p class="answer__text">A plan that owners of coal-power plants must submit each year detailing how quickly they are working to close down.</p></li>
       <li class="question__answer answer" data-text="A plan to integrate all the systems in an integrated and systematic way that both plans and integrates all the things." data-correct="false"><p class="answer__text">A plan to integrate all the systems in an integrated and systematic way that both plans and integrates all the things.</p></li>
      </ol></li>
     <li class="quiz__question question" data-text="If you hear energy wonks and ministers talking about a “capacity mechanism”, what might they be referring to?"><p class="question__text">If you hear energy wonks and ministers talking about a “capacity mechanism”, what might they be referring to?</p>
      <ol class="question__answers" type="A" data-answers-length="4">
       <li class="question__answer answer" data-text="A proposal to make sure that the electricity system always has access to enough power generation." data-correct="true"><p class="answer__text">A proposal to make sure that the electricity system always has access to enough power generation.</p></li>
       <li class="question__answer answer" data-text="The network of poles and wires that delivers electricity to consumers." data-correct="false"><p class="answer__text">The network of poles and wires that delivers electricity to consumers.</p></li>
       <li class="question__answer answer" data-text="Energy worker jargon for a meal break. “I’m off for my capacity mechanism, boss.”" data-correct="false"><p class="answer__text">Energy worker jargon for a meal break. “I’m off for my capacity mechanism, boss.”</p></li>
       <li class="question__answer answer" data-text="Google it, mate." data-correct="false"><p class="answer__text">Google it, mate.</p></li>
      </ol></li>
    </ol>
    <h2 class="quiz__correct-answers-title">Solutions</h2>
    <p class="quiz__correct-answers">1:C - When electricity market bosses have tried everything else to balance out demand with supply, they can resort to load shedding – deliberately turning off power to some places to reduce demand and stop a large-scale collapse. It’s different to a blackout, which refers to an unplanned outage. , 2:D - Technically, coal, gas, large-scale solar and wind are all dispatchable forms of electricity. Rooftop solar on its own isn’t as it can’t be controlled in the same way. “Dispatchable” is sometimes incorrectly used interchangeably with “firm”, which in market jargon relates to how dependable and predictable a source of power is, and “flexibility”, which refers to how quickly it can be turned up, down, on or off., 3:B - Households and small businesses normally buy electricity on either a promotional deal (a “market offer”) or a default “standing offer”. In South Australia, New South Wales and south-east Queensland, the Australian Energy Regulator sets a maximum price a retailer can charge – called the “default market offer” – and this acts like a soft cap on prices. In Victoria, the state’s essential services commission sets this, and calls it the Victorian Default Offer., 4:C - Retailers buy electricity either at a “spot price”, which changes every five minutes, or a contracted price agreed between retailers and generators over a set period. The wholesale price makes up about a third of a consumer’s electricity bill., 5:B - These notices come from the Australian Energy Market Operator. The most serious is an “actual” level 3 notice, which means power supply is probably being turned off somewhere to keep the system balanced., 6:C - Electricity market operators call on these services if there are sudden changes, such as a fault in a power plant or a big consumer stops needing power., 7:A - The plan looks at how the market could be developed over the coming decades that would be low cost, reliable and in line with climate targets. The National Electricity Market covers NSW, the ACT, Queensland, South Australia, Victoria and Tasmania. The next plan is due at the end of June., 8:A - This is an idea being explored by energy ministers and market regulators to pay electricity providers to have guaranteed power available at certain times when demand is high. </p>
    <h3 class="quiz__scores-title">Scores</h3>
    <ol class="quiz__scores" data-result-groups-length="9">
     <li class="quiz__score score" data-title="Rating: Terrawatt. You know transitioning Australia’s electricity grid away from fossil fuels is a crucial part of getting to net zero, and so you want to know the detail. Either that or all your friends work at Aoemo." data-share="I got _/_ in <quiz title>" data-min-score="8"><p class="score__min-score">8 and above.</p><p class="score__title">Rating: Terrawatt. You know transitioning Australia’s electricity grid away from fossil fuels is a crucial part of getting to net zero, and so you want to know the detail. Either that or all your friends work at Aoemo.</p></li>
     <li class="quiz__score score" data-title="Rating: Megawatt. You know the difference between your NEM and your elbow. Great effort." data-share="I got _/_ in <quiz title>" data-min-score="7"><p class="score__min-score">7 and above.</p><p class="score__title">Rating: Megawatt. You know the difference between your NEM and your elbow. Great effort.</p></li>
     <li class="quiz__score score" data-title="Rating: Megawatt. You know the difference between your NEM and your elbow. Great effort." data-share="I got _/_ in <quiz title>" data-min-score="6"><p class="score__min-score">6 and above.</p><p class="score__title">Rating: Megawatt. You know the difference between your NEM and your elbow. Great effort.</p></li>
     <li class="quiz__score score" data-title="Rating: Megawatt. You know the difference between your NEM and your elbow. Great effort." data-share="I got _/_ in <quiz title>" data-min-score="5"><p class="score__min-score">5 and above.</p><p class="score__title">Rating: Megawatt. You know the difference between your NEM and your elbow. Great effort.</p></li>
     <li class="quiz__score score" data-title="Rating: Megawatt. You know the difference between your NEM and your elbow. Great effort." data-share="I got _/_ in <quiz title>" data-min-score="4"><p class="score__min-score">4 and above.</p><p class="score__title">Rating: Megawatt. You know the difference between your NEM and your elbow. Great effort.</p></li>
     <li class="quiz__score score" data-title="Rating: Watt? You seem to think the future of Australia’s electricity supply is a joke. But seeing as you’re here. How many birds does it take to change a lightbulb? Toucan do it." data-share="I got _/_ in <quiz title>" data-min-score="3"><p class="score__min-score">3 and above.</p><p class="score__title">Rating: Watt? You seem to think the future of Australia’s electricity supply is a joke. But seeing as you’re here. How many birds does it take to change a lightbulb? Toucan do it.</p></li>
     <li class="quiz__score score" data-title="Rating: Watt? You seem to think the future of Australia’s electricity supply is a joke. But seeing as you’re here. How many birds does it take to change a lightbulb? Toucan do it." data-share="I got _/_ in <quiz title>" data-min-score="2"><p class="score__min-score">2 and above.</p><p class="score__title">Rating: Watt? You seem to think the future of Australia’s electricity supply is a joke. But seeing as you’re here. How many birds does it take to change a lightbulb? Toucan do it.</p></li>
     <li class="quiz__score score" data-title="Rating: Watt? You seem to think the future of Australia’s electricity supply is a joke. But seeing as you’re here. How many birds does it take to change a lightbulb? Toucan do it." data-share="I got _/_ in <quiz title>" data-min-score="1"><p class="score__min-score">1 and above.</p><p class="score__title">Rating: Watt? You seem to think the future of Australia’s electricity supply is a joke. But seeing as you’re here. How many birds does it take to change a lightbulb? Toucan do it.</p></li>
     <li class="quiz__score score" data-title="Rating: Watt? You seem to think the future of Australia’s electricity supply is a joke. But seeing as you’re here. How many birds does it take to change a lightbulb? Toucan do it." data-share="I got _/_ in <quiz title>" data-min-score="0"><p class="score__min-score">0 and above.</p><p class="score__title">Rating: Watt? You seem to think the future of Australia’s electricity supply is a joke. But seeing as you’re here. How many birds does it take to change a lightbulb? Toucan do it.</p></li>
    </ol>
   </div> 
  </div>
 </gu-atom> 
</figure>
<figure class="element element-interactive interactive element--showcase" data-interactive="https://interactive.guim.co.uk/embed/iframe-wrapper/0.1/boot.js" data-canonical-url="https://interactive.guim.co.uk/2016/03/comics-master-2016/embed/embed.html?srcs-mobile=https://media.guim.co.uk/3ad1f0d19c9695ff3a2a32e975dcd4c0d0876d28/0_0_1874_5208/720.jpg+https://media.guim.co.uk/3ad1f0d19c9695ff3a2a32e975dcd4c0d0876d28/0_5252_1874_4207/445.jpg+https://media.guim.co.uk/3ad1f0d19c9695ff3a2a32e975dcd4c0d0876d28/0_9481_1874_3012/622.jpg&amp;ratios-mobile=277.77777777777777+224.21524663677127+160.7717041800643&amp;srcs-desktop=https://media.guim.co.uk/b671fe529dcbd3f70c31532b3bdf7673a880b557/0_0_3508_6130/3508.jpg&amp;ratios-desktop=174.82517482517483&amp;credit=Cartoon%20by%20First%20Dog%20on%20the%20Moon&amp;background=336699&amp;vpadding=5" data-alt="First Dog on ... poisonous Australian animals!"> <a href="https://interactive.guim.co.uk/2016/03/comics-master-2016/embed/embed.html?srcs-mobile=https://media.guim.co.uk/3ad1f0d19c9695ff3a2a32e975dcd4c0d0876d28/0_0_1874_5208/720.jpg+https://media.guim.co.uk/3ad1f0d19c9695ff3a2a32e975dcd4c0d0876d28/0_5252_1874_4207/445.jpg+https://media.guim.co.uk/3ad1f0d19c9695ff3a2a32e975dcd4c0d0876d28/0_9481_1874_3012/622.jpg&ratios-mobile=277.77777777777777+224.21524663677127+160.7717041800643&srcs-desktop=https://media.guim.co.uk/b671fe529dcbd3f70c31532b3bdf7673a880b557/0_0_3508_6130/3508.jpg&ratios-desktop=174.82517482517483&credit=Cartoon%20by%20First%20Dog%20on%20the%20Moon&background=336699&vpadding=5">First Dog on ... poisonous Australian animals!</a> </figure>

Items with 0 words contain interactive contents, like quizes and multimedia.

# remove rows where wordcount is 0
gu_news_df = gu_news_df[gu_news_df['wordcount'] != 0]
with pd.option_context('display.max_colwidth', None):
    print(gu_news_df[gu_news_df['wordcount'] == 1]['body'].to_string()[:300])
12230    <figure class="element element-atom"> \n <gu-atom data-atom-id="59f05449-081a-4cbe-a869-2341aa1c4369" data-atom-type="quiz"> \n  <div>\n   <div class="quiz" data-questions-length="25" data-title="The bumper climate quiz">\n    <ol class="quiz__questions">\n     <li class="quiz__question que
# remove rows where wordcount is 0
gu_news_df = gu_news_df[gu_news_df['wordcount'] != 1]
with pd.option_context('display.max_rows', None):
    display(gu_news_df.groupby('sectionName').agg({'wordcount': [min, max, np.mean, np.median], 'sectionName': 'count'}) \
        .sort_values([('sectionName', 'count')], ascending=False))
wordcount sectionName
min max mean median count
sectionName
World news 44 6734 794.656148 706.0 8873
Australia news 80 3674 840.125425 781.0 5589
Opinion 13 3652 881.013291 900.0 5041
UK news 41 7842 650.519959 587.0 3933
US news 99 5215 863.203818 785.5 3876
Business 109 6361 663.659647 614.0 3799
Politics 32 5816 722.253525 657.5 3262
Environment 40 6990 762.507692 682.0 3120
Technology 79 4100 753.202369 634.0 1013
Money 107 4231 714.265979 571.0 970
Global development 117 6062 884.699408 811.0 845
News 61 7824 503.050360 227.0 695
Education 67 5699 648.908148 569.0 675
Science 38 6769 702.636228 584.5 668
Law 94 2719 730.716846 650.0 279
From the Observer 269 1605 682.702703 387.0 111
Global 73 2710 858.435897 659.0 39
Inequality 152 2000 671.269231 614.0 26
Animals farmed 882 1112 995.272727 989.0 11

Tags in all articles

# convert tags webTitles string into list
gu_news_df['tag_webTitles'] = gu_news_df['tag_webTitles'].apply(lambda x: x.strip('[]').replace("'", '').split(', '))
# create set of tags
tags_lst = []

tags_lst.extend(gu_news_df['tag_webTitles'])
print("Number of unique tags: ", len(set(list(chain.from_iterable(tags_lst)))))
Number of unique tags:  7643
tags_set = set(list(chain.from_iterable(tags_lst)))
list(tags_set)[:10]
['William Morris',
 'GDPR',
 'Khalil El Halabi',
 'Nikkei',
 'Muska Najibullah',
 'Wales',
 'Australian Open',
 'Paris climate agreement',
 'The upside',
 'Chris Riddell']
tag_count = pd.value_counts(np.array(list(chain.from_iterable(tags_lst))))
tag_count
Article                                        42825
News                                           26382
UK news                                        22981
The Guardian                                   18397
Main section                                   16733
                                               ...  
Steven Borowiec                                    1
MIT - Massachusetts Institute of Technology        1
Royal Bank of Scotland                             1
Contempt of court                                  1
Jarvis Cocker                                      1
Length: 7643, dtype: int64
tag_count.nlargest(25)
Article           42825
News              26382
UK news           22981
The Guardian      18397
Main section      16733
World news        11909
UK Home News       8171
Opinion            7809
Politics           7691
Australia news     6788
Features           6733
Business           6648
US news            6337
Comment            6167
Environment        5496
Europe             5325
Australia News     4808
Coronavirus        4304
UK Business        3821
US News            3637
Journal            3517
UK Foreign         3493
Ukraine            3354
Russia             3291
Conservatives      3215
dtype: int64
tag_count.nsmallest(25)
Rugby sevens                    1
Mary Beard                      1
Manuel Cortes                   1
Brenna Hassett                  1
James Cooray Smith              1
Alex Blasdel                    1
El Niño southern oscillation    1
Michael Hogan                   1
Dzhokhar Tsarnaev               1
Josie Dale-Jones                1
Aaliyah                         1
Peter Bengtsen                  1
Shaun Peter Qureshi             1
Chris Curtis                    1
Boston Marathon bombing         1
Neal Katyal                     1
Neelie Kroes                    1
Roderick Beaton                 1
Mark Bennister                  1
Muska Najibullah                1
Hannah Brady                    1
Paulina Velasco                 1
Hulu                            1
Spanish food and drink          1
Eve Fairbanks                   1
dtype: int64
from wordcloud import WordCloud
wordcloud = WordCloud().generate(' '.join(list(chain.from_iterable(tags_lst))))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

png

Most tags are referring to sections.

sections_lst = list(gu_df['sectionName'].unique())
pillars_lst = list(gu_df['pillarName'].unique())
pillars_lst
['Sport', 'Arts', 'News', 'Opinion', 'Lifestyle', nan]
tag_count[~tag_count.index.isin(sections_lst)]
Article                                         74956
The Guardian                                    33307
Main section                                    20491
Features                                        19647
UK Home News                                    10575
                                                ...  
Commonwealth Games 2002                             1
Bill Paxton                                         1
Sara Paretsky                                       1
COP 21: Paris climate change conference 2015        1
Midnight Special                                    1
Length: 13894, dtype: int64

Tags in sections

gu_top25_df.groupby('sectionName').apply(lambda x: )