This repo contains code and results of my explorations of text data from different sources. The aim is to create a dataset that consists of related longer and short texts from the same source or plattform. Longer texts are such as blog posts or news articles (full-text) and short text, should be similar in length to social media posts, such as abstracts or taglines.
First idea: The New York Times API
NYT offers a set of APIs and also have one for their archive. It contains partial articles (e.g. headline, abstract) and other information, such as which section an article belongs to (e.g. Arts, News) and the length of it (word count).
Unfortunately they only offer unlimited access to the full-texts in the article archive for subscribed users.
Althogh I won't be using their data for my research project, the available data is useful to find out some characteristics of news articles, e.g. the average length or how long their abstracts and titles are. This information could help in finding similar open-access data.
.
├── README.md
├── config.ini
├── data
│ ├── raw
│ └── interim
├── notebooks
│ ├── 01_nyt_apis.ipynb
│ ├── 02_guardian_api.ipynb
│ ├── 03_twitter_api.ipynb
│ └── 04_explore_datasets.ipynb
├── requirements.in
├── requirements.txt
└── src
├── __init__.py
├── data
│ ├── __init__.py
│ ├── clean_text.py
│ └── make_dataset.py
└── main.py
NYT Dataset:
The dataset contains partial article data from the NYT Archive API from 10/2021 to 09/2022, approx. 34,700 items.
GU Dataset:
The dataset contains article data from the GU Search API from 27/09/2021 to 27/09/2022, approx. 75,000 items.
GU Tweets Dataset:
The dataset contains tweets from The Guardian main account (@guardian), with 3,200 items (maximum allowed number of tweets from a user as per Twitter API limitations).
import os
import glob
import numpy as np
import pandas as pd
import re
from itertools import chain
import matplotlib.pyplot as plt
import seaborn as sns
# Notebook settings
import warnings
warnings.filterwarnings('ignore')
# Set data directory paths
nyt_path = 'data/interim/nyt_data'
gu_path = 'data/interim/gu_data'
gu_twitter_path = 'data/interim/gu_twitter_data/gu_tweets.csv'
# Load NYT data files and combine them
nyt_files = glob.glob(os.path.join(nyt_path, '*.csv'))
nyt_lst = []
for file in nyt_files:
nyt_single_df = pd.read_csv(file)
nyt_lst.append(nyt_single_df)
nyt_df = pd.concat(nyt_lst)
# Load GU data files and combine them
gu_files = glob.glob(os.path.join(gu_path, '*.csv'))
gu_lst = []
for file in gu_files:
gu_single_df = pd.read_csv(file)
gu_lst.append(gu_single_df)
gu_df = pd.concat(gu_lst)
# Load GU Twitter data
twitter_df = pd.read_csv(gu_twitter_path)
nyt_df.describe()
word_count | word_count_headline | word_count_abstract | |
---|---|---|---|
count | 34752.000000 | 34752.000000 | 34752.000000 |
mean | 991.256417 | 9.483943 | 22.524660 |
std | 698.379953 | 2.946029 | 7.412678 |
min | 0.000000 | 1.000000 | 1.000000 |
25% | 525.000000 | 8.000000 | 18.000000 |
50% | 934.000000 | 10.000000 | 23.000000 |
75% | 1315.000000 | 11.000000 | 27.000000 |
max | 20573.000000 | 22.000000 | 103.000000 |
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize=(12,5))
fig.suptitle('Word counts for NYT articles')
fig.supylabel('Frequency')
ax1.hist(nyt_df['word_count_headline'], bins=20)
ax1.set_title('Headline')
ax2.hist(nyt_df['word_count_abstract'], bins=100)
ax2.set_title('Abstract')
ax3.hist(nyt_df['word_count'], bins=100)
ax3.set_title('Full-text')
for ax in (ax1, ax2, ax3):
ax.set(xlabel='Word count')
plt.show()
gu_df.describe()
wordcount | charCount | word_count_headline | word_count_trailText | |
---|---|---|---|---|
count | 74956.000000 | 74956.000000 | 74956.000000 | 74956.000000 |
mean | 780.012301 | 4644.333662 | 11.643658 | 19.906385 |
std | 464.920930 | 2726.342686 | 2.943087 | 5.498631 |
min | 0.000000 | 0.000000 | 2.000000 | 1.000000 |
25% | 498.000000 | 2980.000000 | 10.000000 | 16.000000 |
50% | 711.000000 | 4239.000000 | 12.000000 | 19.000000 |
75% | 947.000000 | 5654.000000 | 13.000000 | 23.000000 |
max | 9633.000000 | 54826.000000 | 28.000000 | 77.000000 |
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize=(12,5))
fig.suptitle('Word counts for GUARDIAN articles')
fig.supylabel('Frequency')
ax1.hist(gu_df['word_count_headline'], bins=30)
ax1.set_title('Headline')
ax2.hist(gu_df['word_count_trailText'], bins=100)
ax2.set_title('Abstract')
ax3.hist(gu_df['wordcount'], bins=100)
ax3.set_title('Full-text')
for ax in (ax1, ax2, ax3):
ax.set(xlabel='Word count')
plt.show()
twitter_df.head()
created_at | id | text | clean_text | word_count | |
---|---|---|---|---|---|
0 | Wed Sep 28 21:17:47 +0000 2022 | 1575233331943309339 | Morning mail: hurricane with 240km/h winds hit... | Morning mail hurricane with kmh winds hits Flo... | 14 |
1 | Wed Sep 28 21:17:46 +0000 2022 | 1575233327329574924 | R Kelly ordered to pay restitution of $300,000... | R Kelly ordered to pay restitution of to his v... | 10 |
2 | Wed Sep 28 21:17:44 +0000 2022 | 1575233321063292928 | Two aircraft involved in ‘minor collision’ on ... | Two aircraft involved in ‘minor collision’ on ... | 10 |
3 | Wed Sep 28 21:10:00 +0000 2022 | 1575231372490309658 | We’re keen to hear from people who have recent... | We’re keen to hear from people who have recent... | 25 |
4 | Wed Sep 28 21:03:05 +0000 2022 | 1575229631912869894 | Guardian front page, Thursday 29 September 202... | Guardian front page Thursday September Banks £... | 12 |
twitter_df.describe()
id | word_count | |
---|---|---|
count | 3.200000e+03 | 3200.000000 |
mean | 1.572793e+18 | 12.177188 |
std | 1.462916e+15 | 3.107159 |
min | 1.570134e+18 | 1.000000 |
25% | 1.571554e+18 | 10.000000 |
50% | 1.572913e+18 | 12.000000 |
75% | 1.574037e+18 | 14.000000 |
max | 1.575233e+18 | 25.000000 |
plt.hist(twitter_df['word_count'], bins=25)
plt.xlabel('Word count')
plt.ylabel('Frequency')
plt.title('Histogram of Word Counts in GUARDIAN Tweets')
plt.show()
gu_df.head()
id | sectionId | sectionName | webPublicationDate | webUrl | apiUrl | pillarId | pillarName | byline | body | ... | bylineHtml | fields.contributorBio | scheduledPublicationDate | tag_ids | tag_webTitles | tag_webUrls | cl_headline | cl_trailText | word_count_headline | word_count_trailText | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | sport/blog/2022/jun/27/imperious-nsw-seize-adv... | sport | Sport | 2022-06-26T23:46:44Z | https://www.theguardian.com/sport/blog/2022/ju... | https://content.guardianapis.com/sport/blog/20... | pillar/sport | Sport | Nick Tedeschi | <p>A star was born in debutant Matt Burton. An... | ... | <a href="profile/nick-tedeschi">Nick Tedeschi</a> | NaN | NaN | ['sport/state-of-origin', 'sport/rugbyleague',... | ['State of Origin', 'Rugby league', 'NRL', 'Au... | ['https://www.theguardian.com/sport/state-of-o... | Imperious NSW seize advantage after Queensland... | Poor tackling basic handling errors and a lack... | 10 | 23 |
1 | music/2022/jun/27/kendrick-lamar-at-glastonbur... | music | Music | 2022-06-26T23:32:10Z | https://www.theguardian.com/music/2022/jun/27/... | https://content.guardianapis.com/music/2022/ju... | pillar/arts | Arts | Alexis Petridis | <p>As Glastonbury 2022 draws to a close, a var... | ... | <a href="profile/alexispetridis">Alexis Petrid... | NaN | NaN | ['music/glastonbury-2022', 'music/kendrick-lam... | ['Glastonbury 2022', 'Kendrick Lamar', 'Music'... | ['https://www.theguardian.com/music/glastonbur... | Kendrick Lamar at Glastonbury review – faith f... | Sporting a bejewelled crown of thorns and with... | 11 | 25 |
2 | world/2022/jun/27/garbage-island-no-more-how-o... | world | World news | 2022-06-26T23:06:58Z | https://www.theguardian.com/world/2022/jun/27/... | https://content.guardianapis.com/world/2022/ju... | pillar/news | News | Justin McCurry on Teshima island | <p>Toru Ishii remembers when the shredded car ... | ... | <a href="profile/justinmccurry">Justin McCurry... | NaN | NaN | ['world/japan', 'world/asia-pacific', 'world/w... | ['Japan', 'Asia Pacific', 'World news', 'Envir... | ['https://www.theguardian.com/world/japan', 'h... | Garbage island no more how one Japanese commun... | Teshima – site of Japan’s worst case of illega... | 14 | 27 |
3 | media/2022/jun/27/young-people-must-report-har... | media | Media | 2022-06-26T23:01:00Z | https://www.theguardian.com/media/2022/jun/27/... | https://content.guardianapis.com/media/2022/ju... | pillar/news | News | Dan Milmo | <p>Young people should report harmful online c... | ... | <a href="profile/danmilmo">Dan Milmo</a> | NaN | NaN | ['media/ofcom', 'media/social-media', 'society... | ['Ofcom', 'Social media', 'Online abuse', 'You... | ['https://www.theguardian.com/media/ofcom', 'h... | Young people must report harmful online conten... | Ofcom says of to yearolds have seen harmful co... | 10 | 16 |
4 | stage/2022/jun/27/mad-house-review-david-harbo... | stage | Stage | 2022-06-26T23:01:00Z | https://www.theguardian.com/stage/2022/jun/27/... | https://content.guardianapis.com/stage/2022/ju... | pillar/arts | Arts | Arifa Akbar | <p>Theresa Rebeck’s play opens as a dysfunctio... | ... | <a href="profile/arifa-akbar">Arifa Akbar</a> | NaN | NaN | ['stage/theatre', 'stage/stage', 'culture/cult... | ['Theatre', 'Stage', 'Culture', 'Article', 'Re... | ['https://www.theguardian.com/stage/theatre', ... | Mad House review – David Harbour and Bill Pull... | Theresa Rebeck’s play follows the relationship... | 14 | 16 |
5 rows × 25 columns
gu_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 74956 entries, 0 to 6145
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 74956 non-null object
1 sectionId 74956 non-null object
2 sectionName 74956 non-null object
3 webPublicationDate 74956 non-null object
4 webUrl 74956 non-null object
5 apiUrl 74956 non-null object
6 pillarId 74453 non-null object
7 pillarName 74453 non-null object
8 byline 73447 non-null object
9 body 74956 non-null object
10 wordcount 74956 non-null int64
11 publication 74956 non-null object
12 lang 74956 non-null object
13 bodyText 74757 non-null object
14 charCount 74956 non-null int64
15 bylineHtml 73447 non-null object
16 fields.contributorBio 15 non-null object
17 scheduledPublicationDate 3 non-null object
18 tag_ids 74956 non-null object
19 tag_webTitles 74956 non-null object
20 tag_webUrls 74956 non-null object
21 cl_headline 74956 non-null object
22 cl_trailText 74904 non-null object
23 word_count_headline 74956 non-null int64
24 word_count_trailText 74956 non-null int64
dtypes: int64(4), object(21)
memory usage: 14.9+ MB
The NYT has more longer articles than the GU. The headlines show the opposite behaviour. The abstracts seem similar in both.
Compared to the length of tweets, the most similar article part in length is the headline.
# Check for duplicates
gu_df[gu_df.duplicated(['id','webUrl'], keep=False)]
id | sectionId | sectionName | webPublicationDate | webUrl | apiUrl | pillarId | pillarName | byline | body | ... | bylineHtml | fields.contributorBio | scheduledPublicationDate | tag_ids | tag_webTitles | tag_webUrls | cl_headline | cl_trailText | word_count_headline | word_count_trailText | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4912 | lifeandstyle/2022/sep/05/work-therapy-can-a-so... | lifeandstyle | Life and style | 2022-09-04T17:30:03Z | https://www.theguardian.com/lifeandstyle/2022/... | https://content.guardianapis.com/lifeandstyle/... | pillar/lifestyle | Lifestyle | Jenny Valentish | <p>Whether you’re a quiet quitter or a 24/7 si... | ... | <a href="profile/jenny-valentish">Jenny Valent... | NaN | NaN | ['lifeandstyle/australian-lifestyle', 'music/a... | [Australian lifestyle, Australian music, Artic... | ['https://www.theguardian.com/lifeandstyle/aus... | Work therapy can a social media coach talk a s... | Musician Dave McCormack hates social media but... | 15 | 27 |
4913 | lifeandstyle/2022/sep/05/work-therapy-can-a-so... | lifeandstyle | Life and style | 2022-09-04T17:30:03Z | https://www.theguardian.com/lifeandstyle/2022/... | https://content.guardianapis.com/lifeandstyle/... | pillar/lifestyle | Lifestyle | Jenny Valentish | <p>Whether you’re a quiet quitter or a 24/7 si... | ... | <a href="profile/jenny-valentish">Jenny Valent... | NaN | NaN | ['lifeandstyle/australian-lifestyle', 'music/a... | [Australian lifestyle, Australian music, Artic... | ['https://www.theguardian.com/lifeandstyle/aus... | Work therapy can a social media coach talk a s... | Musician Dave McCormack hates social media but... | 15 | 27 |
1247 | film/2022/aug/21/anais-in-love-review-anais-de... | film | Film | 2022-08-21T07:00:02Z | https://www.theguardian.com/film/2022/aug/21/a... | https://content.guardianapis.com/film/2022/aug... | pillar/arts | Arts | Wendy Ide | <p>She’s a familiar character. Skittish, self-... | ... | <a href="profile/wendy-ide">Wendy Ide</a> | NaN | NaN | ['film/drama', 'film/film', 'culture/culture',... | [Drama films, Film, Culture, World cinema, Art... | ['https://www.theguardian.com/film/drama', 'ht... | Anaïs in Love review – Anaïs Demoustier intoxi... | Cinema’s latest irresistible chaotic femme Dem... | 13 | 16 |
1248 | lifeandstyle/2022/aug/21/emma-beddington-my-ki... | lifeandstyle | Life and style | 2022-08-21T07:00:02Z | https://www.theguardian.com/lifeandstyle/2022/... | https://content.guardianapis.com/lifeandstyle/... | pillar/lifestyle | Lifestyle | Emma Beddington | <p>I’m nearly an empty nester. That conjures u... | ... | <a href="profile/emma-beddington">Emma Bedding... | NaN | NaN | ['lifeandstyle/parents-and-parenting', 'lifean... | [Parents and parenting, Family, Life and style... | ['https://www.theguardian.com/lifeandstyle/par... | My kids have moved out but please don’t call i... | I want to take issue with this nests metaphor ... | 13 | 19 |
1249 | film/2022/aug/21/anais-in-love-review-anais-de... | film | Film | 2022-08-21T07:00:02Z | https://www.theguardian.com/film/2022/aug/21/a... | https://content.guardianapis.com/film/2022/aug... | pillar/arts | Arts | Wendy Ide | <p>She’s a familiar character. Skittish, self-... | ... | <a href="profile/wendy-ide">Wendy Ide</a> | NaN | NaN | ['film/drama', 'film/film', 'culture/culture',... | [Drama films, Film, Culture, World cinema, Art... | ['https://www.theguardian.com/film/drama', 'ht... | Anaïs in Love review – Anaïs Demoustier intoxi... | Cinema’s latest irresistible chaotic femme Dem... | 13 | 16 |
1250 | lifeandstyle/2022/aug/21/emma-beddington-my-ki... | lifeandstyle | Life and style | 2022-08-21T07:00:02Z | https://www.theguardian.com/lifeandstyle/2022/... | https://content.guardianapis.com/lifeandstyle/... | pillar/lifestyle | Lifestyle | Emma Beddington | <p>I’m nearly an empty nester. That conjures u... | ... | <a href="profile/emma-beddington">Emma Bedding... | NaN | NaN | ['lifeandstyle/parents-and-parenting', 'lifean... | [Parents and parenting, Family, Life and style... | ['https://www.theguardian.com/lifeandstyle/par... | My kids have moved out but please don’t call i... | I want to take issue with this nests metaphor ... | 13 | 19 |
1390 | world/2022/aug/20/estonia-europe-inflation-hot... | world | World news | 2022-08-20T07:00:33Z | https://www.theguardian.com/world/2022/aug/20/... | https://content.guardianapis.com/world/2022/au... | pillar/news | News | Daniel Boffey in Tallinn | <p>Like his cappuccinos, Taniel Vaaderpass, 33... | ... | <a href="profile/daniel-boffey">Daniel Boffey<... | NaN | NaN | ['world/estonia', 'business/inflation', 'busin... | [Estonia, Inflation, Eurozone, Business, Econo... | ['https://www.theguardian.com/world/estonia', ... | ‘I am not blaming anyone’ Estonians shrug off ... | Those in Europe’s inflation hotspot remain cal... | 9 | 23 |
1392 | world/2022/aug/20/estonia-europe-inflation-hot... | world | World news | 2022-08-20T07:00:33Z | https://www.theguardian.com/world/2022/aug/20/... | https://content.guardianapis.com/world/2022/au... | pillar/news | News | Daniel Boffey in Tallinn | <p>Like his cappuccinos, Taniel Vaaderpass, 33... | ... | <a href="profile/daniel-boffey">Daniel Boffey<... | NaN | NaN | ['world/estonia', 'business/inflation', 'busin... | [Estonia, Inflation, Eurozone, Business, Econo... | ['https://www.theguardian.com/world/estonia', ... | ‘I am not blaming anyone’ Estonians shrug off ... | Those in Europe’s inflation hotspot remain cal... | 9 | 23 |
2580 | politics/2022/may/15/labour-keir-starmer-tory-... | politics | Politics | 2022-05-15T16:00:34Z | https://www.theguardian.com/politics/2022/may/... | https://content.guardianapis.com/politics/2022... | pillar/news | News | Jessica Elgot Chief political correspondent | <p>Labour activists in Durham have called on a... | ... | <a href="profile/jessica-elgot">Jessica Elgot<... | NaN | NaN | ['politics/labour', 'politics/conservatives', ... | [Labour, Conservatives, Keir Starmer, Politics... | ['https://www.theguardian.com/politics/labour'... | Labour activists call on Tory MP to withdraw B... | Local chair accuses Richard Holden of ‘wasting... | 10 | 18 |
2582 | politics/2022/may/15/labour-keir-starmer-tory-... | politics | Politics | 2022-05-15T16:00:34Z | https://www.theguardian.com/politics/2022/may/... | https://content.guardianapis.com/politics/2022... | pillar/news | News | Jessica Elgot Chief political correspondent | <p>Labour activists in Durham have called on a... | ... | <a href="profile/jessica-elgot">Jessica Elgot<... | NaN | NaN | ['politics/labour', 'politics/conservatives', ... | [Labour, Conservatives, Keir Starmer, Politics... | ['https://www.theguardian.com/politics/labour'... | Labour activists call on Tory MP to withdraw B... | Local chair accuses Richard Holden of ‘wasting... | 10 | 18 |
10 rows × 25 columns
# Drop duplicates
gu_df.drop_duplicates(['id', 'webUrl'], keep='first', inplace=True, ignore_index=True)
Number of articles per section
gu_df['sectionName'].value_counts().nlargest(25).sort_values(ascending=True).plot(kind='barh')
plt.xlabel("Number of Articles", labelpad=14)
plt.ylabel("Section", labelpad=14)
plt.title("Number of Articles per Section in GUARDIAN (top 25 Sections)")
plt.show()
#gu_df.groupby('sectionName').size().sort_values(ascending=False)
with pd.option_context('display.max_rows', None):
display(gu_df.groupby('sectionName').agg({'wordcount': [min, max, np.mean, np.median], 'sectionName': 'count'}) \
.sort_values([('sectionName', 'count')], ascending=False))
wordcount | sectionName | ||||
---|---|---|---|---|---|
min | max | mean | median | count | |
sectionName | |||||
World news | 44 | 6734 | 794.656148 | 706.0 | 8873 |
Australia news | 0 | 3674 | 839.524584 | 781.0 | 5593 |
Opinion | 0 | 3652 | 859.196750 | 892.0 | 5169 |
Football | 0 | 6284 | 747.456158 | 752.0 | 5075 |
Sport | 0 | 6247 | 786.508210 | 763.0 | 4811 |
UK news | 41 | 7842 | 650.519959 | 587.0 | 3933 |
US news | 0 | 5215 | 862.981171 | 785.0 | 3877 |
Business | 0 | 6361 | 663.485000 | 614.0 | 3800 |
Politics | 0 | 5816 | 722.032179 | 657.0 | 3263 |
Environment | 0 | 6990 | 762.019539 | 682.0 | 3122 |
Life and style | 0 | 8000 | 782.904100 | 699.0 | 2951 |
Film | 76 | 3949 | 730.911191 | 598.0 | 2511 |
Music | 50 | 5751 | 771.810295 | 590.0 | 2409 |
Television & radio | 0 | 5612 | 879.591774 | 766.0 | 2261 |
Books | 0 | 6401 | 890.570594 | 814.5 | 2224 |
Society | 51 | 6620 | 758.870447 | 635.0 | 2169 |
Stage | 38 | 4131 | 666.072258 | 513.0 | 1550 |
Art and design | 80 | 4959 | 914.185529 | 834.0 | 1078 |
Food | 1 | 9633 | 823.816365 | 643.0 | 1051 |
Culture | 0 | 5711 | 894.080000 | 753.5 | 1050 |
Technology | 79 | 4100 | 753.202369 | 634.0 | 1013 |
Money | 107 | 4231 | 714.265979 | 571.0 | 970 |
Global development | 117 | 6062 | 884.699408 | 811.0 | 845 |
Media | 60 | 6037 | 711.642857 | 582.0 | 798 |
News | 61 | 7824 | 503.050360 | 227.0 | 695 |
Education | 67 | 5699 | 648.908148 | 569.0 | 675 |
Science | 38 | 6769 | 702.636228 | 584.5 | 668 |
Travel | 0 | 5362 | 1068.965839 | 1079.5 | 644 |
Fashion | 68 | 6798 | 666.250554 | 544.0 | 451 |
Law | 94 | 2719 | 730.716846 | 650.0 | 279 |
Games | 167 | 3354 | 912.715953 | 822.0 | 257 |
Crosswords | 0 | 1164 | 318.409574 | 153.0 | 188 |
From the Observer | 269 | 1605 | 682.702703 | 387.0 | 111 |
Guardian Masterclasses | 30 | 1654 | 717.337209 | 685.5 | 86 |
Global | 0 | 2710 | 816.560976 | 657.0 | 41 |
GNM press office | 146 | 1118 | 512.696970 | 476.0 | 33 |
Inequality | 152 | 2000 | 671.269231 | 614.0 | 26 |
Info | 179 | 7066 | 1649.850000 | 1060.0 | 20 |
The invested generation | 763 | 1376 | 983.937500 | 933.5 | 16 |
Cities | 140 | 1094 | 656.307692 | 669.0 | 13 |
Animals farmed | 882 | 1112 | 995.272727 | 989.0 | 11 |
Seek: The new world of work | 640 | 1226 | 947.100000 | 928.0 | 10 |
Bonjour Provence and Côte d’Azur | 34 | 1333 | 1008.111111 | 1065.0 | 9 |
Community of solvers | 741 | 1271 | 976.888889 | 882.0 | 9 |
Membership | 464 | 2561 | 1171.500000 | 1069.0 | 8 |
Spotify: Morning moods | 576 | 1364 | 932.375000 | 906.0 | 8 |
Weather | 189 | 815 | 391.875000 | 312.0 | 8 |
SBS: A world of difference | 615 | 1141 | 954.285714 | 916.0 | 7 |
A new career with The University of Law | 23 | 1119 | 770.428571 | 822.0 | 7 |
Macquarie: Home of electric vehicles | 791 | 1407 | 1129.166667 | 1151.0 | 6 |
On my terms | 454 | 999 | 761.166667 | 790.5 | 6 |
Connected thinking | 987 | 1520 | 1208.666667 | 1150.5 | 6 |
The Guardian clearing hub | 128 | 1048 | 803.333333 | 932.0 | 6 |
From the Guardian | 419 | 2238 | 825.400000 | 488.0 | 5 |
A vision for better food | 834 | 1211 | 1056.800000 | 1083.0 | 5 |
Google: Helpful by nature | 748 | 1205 | 1002.800000 | 1077.0 | 5 |
Pioneering innovation for a purposeful future | 878 | 1101 | 996.000000 | 985.0 | 5 |
Rise with London South Bank University | 740 | 939 | 839.000000 | 838.5 | 4 |
Conservation in action | 48 | 1043 | 709.500000 | 873.5 | 4 |
Quest Apartment Hotels: As local as you like it | 811 | 877 | 847.000000 | 850.0 | 4 |
Rediscover tequila | 798 | 1168 | 941.500000 | 900.0 | 4 |
Retail careers that mean more | 769 | 833 | 790.500000 | 780.0 | 4 |
SBS On Demand: New Gold Mountain | 756 | 995 | 860.750000 | 846.0 | 4 |
SAP: Smart business | 677 | 965 | 866.250000 | 911.5 | 4 |
Made with love | 630 | 1012 | 808.500000 | 796.0 | 4 |
Scotland's stories | 1148 | 1384 | 1243.250000 | 1220.5 | 4 |
Send smarter | 14 | 1299 | 894.500000 | 1132.5 | 4 |
The whole picture | 713 | 1393 | 989.250000 | 925.5 | 4 |
Toyota Australia: Journey to electric | 829 | 1181 | 969.250000 | 933.5 | 4 |
AMC+: Only the good stuff | 182 | 859 | 501.000000 | 481.5 | 4 |
Meta: Buy Blak | 783 | 1035 | 925.500000 | 942.0 | 4 |
Colonial First State: Unleash your second half | 654 | 916 | 777.000000 | 769.0 | 4 |
Help | 162 | 473 | 385.000000 | 452.5 | 4 |
Helga's: Capturing kindness | 838 | 1027 | 925.000000 | 917.5 | 4 |
Forefront of fintech | 779 | 871 | 824.750000 | 824.5 | 4 |
For the love of numbers | 847 | 874 | 864.750000 | 869.0 | 4 |
Growing for good | 473 | 806 | 654.500000 | 669.5 | 4 |
BMW: Sustainable mobility | 843 | 1304 | 1045.000000 | 988.0 | 3 |
HBF: Never Settle | 945 | 959 | 950.666667 | 948.0 | 3 |
Lexus Australia: New luxury | 755 | 1171 | 953.333333 | 934.0 | 3 |
Snooze: Investing In Sleep | 767 | 1010 | 879.666667 | 862.0 | 3 |
Specsavers: Experts in eye care | 849 | 1188 | 989.666667 | 932.0 | 3 |
Business Victoria: Making Headway | 714 | 1375 | 975.333333 | 837.0 | 3 |
Spotify: Find the one | 1459 | 1845 | 1592.666667 | 1474.0 | 3 |
Letters to Tomorrow | 845 | 1031 | 918.000000 | 878.0 | 3 |
The Fred Hollows Foundation: 30 years of restoring sight | 788 | 1012 | 897.333333 | 892.0 | 3 |
From the inside out | 665 | 850 | 765.000000 | 780.0 | 3 |
The Life You Can Save: Effective giving | 801 | 1258 | 959.000000 | 818.0 | 3 |
The need for speed | 36 | 1014 | 650.666667 | 902.0 | 3 |
Griffith University: Make it matter | 501 | 925 | 773.333333 | 894.0 | 3 |
Volvo Car Australia: Pure Electric | 92 | 927 | 637.333333 | 893.0 | 3 |
Westpac Foundation: Investing in social enterprise | 894 | 1182 | 1039.000000 | 1041.0 | 3 |
Focus | 1061 | 4925 | 2428.000000 | 1298.0 | 3 |
City of Melbourne: FOMO | 612 | 898 | 767.000000 | 791.0 | 3 |
Curtin University: Why study law | 720 | 853 | 787.666667 | 790.0 | 3 |
MG Motor: Switch to electric | 711 | 1306 | 955.000000 | 848.0 | 3 |
A time for Japan | 629 | 919 | 764.666667 | 746.0 | 3 |
Dine: Hope Grows | 560 | 873 | 725.333333 | 743.0 | 3 |
MINI: Serious fun | 178 | 1062 | 717.000000 | 911.0 | 3 |
Dairy Australia: Healthy sustainable diets | 51 | 992 | 629.000000 | 844.0 | 3 |
Monash University: The Endangered Generation | 732 | 875 | 823.333333 | 863.0 | 3 |
Mirvac: Voyager | 760 | 885 | 832.333333 | 852.0 | 3 |
Moccona: Make a little difference | 778 | 803 | 790.500000 | 790.5 | 2 |
Cancer Council Victoria: Giving in Will | 1083 | 1124 | 1103.500000 | 1103.5 | 2 |
City of Melbourne: Shop the City | 817 | 859 | 838.000000 | 838.0 | 2 |
The University of Notre Dame: Ethical education | 940 | 957 | 948.500000 | 948.5 | 2 |
Specsavers: An Eye for Art | 911 | 1181 | 1046.000000 | 1046.0 | 2 |
Boutique Homes: My home my way | 835 | 843 | 839.000000 | 839.0 | 2 |
Travel Associates: Get out there in the Red Centre | 851 | 944 | 897.500000 | 897.5 | 2 |
Bank Australia: Code red | 780 | 934 | 857.000000 | 857.0 | 2 |
Swinburne Edge: A new work era | 126 | 1030 | 578.000000 | 578.0 | 2 |
Michelin: Built to keep you moving | 752 | 913 | 832.500000 | 832.5 | 2 |
Guardian US press office | 473 | 544 | 508.500000 | 508.5 | 2 |
SAP: Transformation mindset | 796 | 1059 | 927.500000 | 927.5 | 2 |
Specsavers: Liberty London | 642 | 883 | 762.500000 | 762.5 | 2 |
Honda CR-V: Joy in the detail | 667 | 936 | 801.500000 | 801.5 | 2 |
Volvo Car Australia: Family bond | 710 | 828 | 769.000000 | 769.0 | 2 |
Hurtigruten: Discover Norway | 717 | 842 | 779.500000 | 779.5 | 2 |
IFAW: Help animals thrive | 770 | 1121 | 945.500000 | 945.5 | 2 |
Plico: Renewable energy | 812 | 883 | 847.500000 | 847.5 | 2 |
Fed Square: Sustainable September | 794 | 971 | 882.500000 | 882.5 | 2 |
Kyndryl:The Heart of Progress | 1121 | 1248 | 1184.500000 | 1184.5 | 2 |
Dairy Australia: fracture research | 922 | 972 | 947.000000 | 947.0 | 2 |
OPSM: a vision for safer roads | 758 | 980 | 869.000000 | 869.0 | 2 |
Archie Rose: Made in good spirits | 832 | 1117 | 974.500000 | 974.5 | 2 |
Kyco: Full transparency | 800 | 800 | 800.000000 | 800.0 | 1 |
The last taboo | 737 | 737 | 737.000000 | 737.0 | 1 |
Mazda: Sustainable style | 811 | 811 | 811.000000 | 811.0 | 1 |
Marine Stewardship Council: Saltwater Schools | 874 | 874 | 874.000000 | 874.0 | 1 |
All Saints' College: The Education Revolution is Here | 830 | 830 | 830.000000 | 830.0 | 1 |
Daikin: Pure Air | 946 | 946 | 946.000000 | 946.0 | 1 |
Adult social care careers in Essex | 934 | 934 | 934.000000 | 934.0 | 1 |
WeAre8: Built to do good | 810 | 810 | 810.000000 | 810.0 | 1 |
West Australian Opera: Discover season 2022 | 900 | 900 | 900.000000 | 900.0 | 1 |
Melbourne Museum: Unearthing an icon | 736 | 736 | 736.000000 | 736.0 | 1 |
Specsavers: Focus on health | 928 | 928 | 928.000000 | 928.0 | 1 |
Australian World Orchestra: Zubin Mehta | 812 | 812 | 812.000000 | 812.0 | 1 |
Monash University: Ask a lawyer | 876 | 876 | 876.000000 | 876.0 | 1 |
Kathmandu: Sustainable future | 892 | 892 | 892.000000 | 892.0 | 1 |
Disney: See How They Run | 715 | 715 | 715.000000 | 715.0 | 1 |
Plan International Australia: Global Hunger Crisis Appeal | 1273 | 1273 | 1273.000000 | 1273.0 | 1 |
NITV: Always Was, Always Will Be | 869 | 869 | 869.000000 | 869.0 | 1 |
RSPCA Australia: RSPCA Approved Farming Scheme | 1000 | 1000 | 1000.000000 | 1000.0 | 1 |
Releaseit: Ready to rent | 672 | 672 | 672.000000 | 672.0 | 1 |
Curtin University: Humanities | 833 | 833 | 833.000000 | 833.0 | 1 |
Searchlight Pictures: Nightmare Alley | 761 | 761 | 761.000000 | 761.0 | 1 |
Sydney Opera House: Antidote festival | 884 | 884 | 884.000000 | 884.0 | 1 |
Searchlight Pictures: The French Dispatch | 832 | 832 | 832.000000 | 832.0 | 1 |
Canna: Small space gardening | 1002 | 1002 | 1002.000000 | 1002.0 | 1 |
Guardian Sustainable Business | 810 | 810 | 810.000000 | 810.0 | 1 |
Southern Cross University: better energy | 1124 | 1124 | 1124.000000 | 1124.0 | 1 |
OPSM: Optimal health | 1119 | 1119 | 1119.000000 | 1119.0 | 1 |
Specsavers: Wearable art | 1054 | 1054 | 1054.000000 | 1054.0 | 1 |
Michelin: Driving the Future | 737 | 737 | 737.000000 | 737.0 | 1 |
GNM archive | 351 | 351 | 351.000000 | 351.0 | 1 |
# List of sections with more than 10 articles
greater_10_lst = gu_df.groupby('sectionName').filter(lambda x: len(x) > 10).groupby('sectionName').agg({'sectionName': 'count'}).index.tolist()
gu_df[gu_df['sectionName'].isin(greater_10_lst)].groupby('sectionName')['sectionName'].agg(['count']).sort_values('count', ascending=False)
count | |
---|---|
sectionName | |
World news | 8873 |
Australia news | 5593 |
Opinion | 5169 |
Football | 5075 |
Sport | 4811 |
UK news | 3933 |
US news | 3877 |
Business | 3800 |
Politics | 3263 |
Environment | 3122 |
Life and style | 2951 |
Film | 2511 |
Music | 2409 |
Television & radio | 2261 |
Books | 2224 |
Society | 2169 |
Stage | 1550 |
Art and design | 1078 |
Food | 1051 |
Culture | 1050 |
Technology | 1013 |
Money | 970 |
Global development | 845 |
Media | 798 |
News | 695 |
Education | 675 |
Science | 668 |
Travel | 644 |
Fashion | 451 |
Law | 279 |
Games | 257 |
Crosswords | 188 |
From the Observer | 111 |
Guardian Masterclasses | 86 |
Global | 41 |
GNM press office | 33 |
Inequality | 26 |
Info | 20 |
The invested generation | 16 |
Cities | 13 |
Animals farmed | 11 |
There are a lot of sections that seem to be concerned with only one very specific topic. These also contain very few articles (
'Animals farmed' is a special series inside the environment section. It contains articleas about food production and climate issues, so it can also be considered news.
'Inequality' has articles about pocilies and current political and societal issues, so it belongs to news.
# Drop sections not related to 'real' news
sections_to_drop = ['Football', 'Sport', 'Life and style', 'Film', 'Music', 'Television & radio', 'Books', 'Society',
'Stage', 'Art and design', 'Food', 'Culture', 'Media', 'Travel', 'Fashion', 'Games', 'Crosswords',
'Guardian Masterclasses', 'GNM press office', 'Info', 'The invested generation', 'Cities']
Select only sections from news related categories that contain more than 10 items.
gu_news_df = gu_df[~gu_df['sectionName'].isin(sections_to_drop) & gu_df['sectionName'].isin(greater_10_lst)]
with pd.option_context('display.max_rows', None):
display(gu_news_df.groupby('sectionName').agg({'wordcount': [min, max, np.mean, np.median], 'sectionName': 'count'}) \
.sort_values([('sectionName', 'count')], ascending=False))
wordcount | sectionName | ||||
---|---|---|---|---|---|
min | max | mean | median | count | |
sectionName | |||||
World news | 44 | 6734 | 794.656148 | 706.0 | 8873 |
Australia news | 0 | 3674 | 839.524584 | 781.0 | 5593 |
Opinion | 0 | 3652 | 859.196750 | 892.0 | 5169 |
UK news | 41 | 7842 | 650.519959 | 587.0 | 3933 |
US news | 0 | 5215 | 862.981171 | 785.0 | 3877 |
Business | 0 | 6361 | 663.485000 | 614.0 | 3800 |
Politics | 0 | 5816 | 722.032179 | 657.0 | 3263 |
Environment | 0 | 6990 | 762.019539 | 682.0 | 3122 |
Technology | 79 | 4100 | 753.202369 | 634.0 | 1013 |
Money | 107 | 4231 | 714.265979 | 571.0 | 970 |
Global development | 117 | 6062 | 884.699408 | 811.0 | 845 |
News | 61 | 7824 | 503.050360 | 227.0 | 695 |
Education | 67 | 5699 | 648.908148 | 569.0 | 675 |
Science | 38 | 6769 | 702.636228 | 584.5 | 668 |
Law | 94 | 2719 | 730.716846 | 650.0 | 279 |
From the Observer | 269 | 1605 | 682.702703 | 387.0 | 111 |
Global | 0 | 2710 | 816.560976 | 657.0 | 41 |
Inequality | 152 | 2000 | 671.269231 | 614.0 | 26 |
Animals farmed | 882 | 1112 | 995.272727 | 989.0 | 11 |
gu_news_df[gu_news_df['wordcount'] == 0][['body', 'bodyText']]
body | bodyText | |
---|---|---|
379 | <figure class="element element-atom"> \n <gu-a... | NaN |
556 | <figure class="element element-interactive int... | NaN |
1001 | <figure class="element element-interactive int... | NaN |
1428 | <figure class="element element-interactive int... | NaN |
2024 | <figure class="element element-interactive int... | NaN |
... | ... | ... |
72608 | <figure class="element element-interactive int... | NaN |
73475 | <figure class="element element-interactive int... | NaN |
74057 | <figure class="element element-interactive int... | NaN |
74500 | <figure class="element element-image element--... | NaN |
74902 | <figure class="element element-interactive int... | NaN |
138 rows × 2 columns
Items with wordcount = 0 have some kind of HTML content that is not text.
print(gu_news_df[gu_news_df['wordcount'] == 0].iloc[0]['body'])
print(gu_news_df[gu_news_df['wordcount'] == 0].iloc[120]['body'])
<figure class="element element-atom">
<gu-atom data-atom-id="7c5c6f4d-068a-455b-88d9-3d0274c6c70d" data-atom-type="quiz">
<div>
<div class="quiz" data-questions-length="8" data-title="The Almost Great Electricity Crisis Quiz">
<ol class="quiz__questions">
<li class="quiz__question question" data-text="What is load shedding?"><p class="question__text">What is load shedding?</p>
<ol class="question__answers" type="A" data-answers-length="4">
<li class="question__answer answer" data-text="A yoga term describing the relief when moving from the downward dog to a low lunge." data-correct="false"><p class="answer__text">A yoga term describing the relief when moving from the downward dog to a low lunge.</p></li>
<li class="question__answer answer" data-text=" A fancy name for a blackout." data-correct="false"><p class="answer__text"> A fancy name for a blackout.</p></li>
<li class="question__answer answer" data-text="A last resort when electricity market bosses have tried everything else to balance out demand with supply. " data-correct="true"><p class="answer__text">A last resort when electricity market bosses have tried everything else to balance out demand with supply. </p></li>
<li class="question__answer answer" data-text=" Any time a power generator has to carry out planned repairs." data-correct="false"><p class="answer__text"> Any time a power generator has to carry out planned repairs.</p></li>
</ol></li>
<li class="quiz__question question" data-text="What is an example of dispatchability?"><p class="question__text">What is an example of dispatchability?</p>
<ol class="question__answers" type="A" data-answers-length="4">
<li class="question__answer answer" data-text="An off-the-shelf power source that can be put in place quickly, such as a battery or a solar panel." data-correct="false"><p class="answer__text">An off-the-shelf power source that can be put in place quickly, such as a battery or a solar panel.</p></li>
<li class="question__answer answer" data-text="Foreign minister Penny Wong being sent to the Pacific straight after an election to rebuild Australia’s reputation on climate change." data-correct="false"><p class="answer__text">Foreign minister Penny Wong being sent to the Pacific straight after an election to rebuild Australia’s reputation on climate change.</p></li>
<li class="question__answer answer" data-text="A shop promising immediate delivery of a power bank to keep your mobile phone going when it runs out of juice." data-correct="false"><p class="answer__text">A shop promising immediate delivery of a power bank to keep your mobile phone going when it runs out of juice.</p></li>
<li class="question__answer answer" data-text="A source of electricity that can be controlled to keep supply and demand balanced in the system." data-correct="true"><p class="answer__text">A source of electricity that can be controlled to keep supply and demand balanced in the system.</p></li>
</ol></li>
<li class="quiz__question question" data-text="What is a “default market offer” in the electricity sector?"><p class="question__text">What is a “default market offer” in the electricity sector?</p>
<ol class="question__answers" type="A" data-answers-length="4">
<li class="question__answer answer" data-text="An electricity company can’t pay back its loans, but someone down the market is offering to sell you that company in exchange for all those power banks you keep buying. " data-correct="false"><p class="answer__text">An electricity company can’t pay back its loans, but someone down the market is offering to sell you that company in exchange for all those power banks you keep buying. </p></li>
<li class="question__answer answer" data-text="A price set by an energy regulator that influences how much an electricity retailer can charge you." data-correct="true"><p class="answer__text">A price set by an energy regulator that influences how much an electricity retailer can charge you.</p></li>
<li class="question__answer answer" data-text="The standard price for borrowing a market stallholder’s extension cable which can rise or fall in line with the cost of lettuce." data-correct="false"><p class="answer__text">The standard price for borrowing a market stallholder’s extension cable which can rise or fall in line with the cost of lettuce.</p></li>
<li class="question__answer answer" data-text="The minimum cost that power generators such as a wind farm or a coal plant say they can provide electricity for." data-correct="false"><p class="answer__text">The minimum cost that power generators such as a wind farm or a coal plant say they can provide electricity for.</p></li>
</ol></li>
<li class="quiz__question question" data-text="What is the wholesale electricity market?"><p class="question__text">What is the wholesale electricity market?</p>
<ol class="question__answers" type="A" data-answers-length="4">
<li class="question__answer answer" data-text="The price a fruit and vegetable wholesaler pays for air-conditioning so a $12 iceberg lettuce doesn’t go limp." data-correct="false"><p class="answer__text">The price a fruit and vegetable wholesaler pays for air-conditioning so a $12 iceberg lettuce doesn’t go limp.</p></li>
<li class="question__answer answer" data-text="A place where electrons go to buy confectionery in bulk." data-correct="false"><p class="answer__text">A place where electrons go to buy confectionery in bulk.</p></li>
<li class="question__answer answer" data-text="A virtual marketplace in which retailers buy electricity from companies that generate it." data-correct="true"><p class="answer__text">A virtual marketplace in which retailers buy electricity from companies that generate it.</p></li>
<li class="question__answer answer" data-text="Any participant in the electricity market – from a coal plant to a battery owner – that can theoretically deliver at least 1,000 megawatts of electricity." data-correct="false"><p class="answer__text">Any participant in the electricity market – from a coal plant to a battery owner – that can theoretically deliver at least 1,000 megawatts of electricity.</p></li>
</ol></li>
<li class="quiz__question question" data-text="What is a Lack of Reserve Notice?"><p class="question__text">What is a Lack of Reserve Notice?</p>
<ol class="question__answers" type="A" data-answers-length="4">
<li class="question__answer answer" data-text="A notice from your boss at the cafe to buy more avocados (but not too many) because not enough are available to meet smashed-avo demand." data-correct="false"><p class="answer__text">A notice from your boss at the cafe to buy more avocados (but not too many) because not enough are available to meet smashed-avo demand.</p></li>
<li class="question__answer answer" data-text="A notice issued by the electricity market operator to all market participants, such as coal plant owners and large battery operators, about a potential or actual shortfall in energy supply." data-correct="true"><p class="answer__text">A notice issued by the electricity market operator to all market participants, such as coal plant owners and large battery operators, about a potential or actual shortfall in energy supply.</p></li>
<li class="question__answer answer" data-text="An email from the coach to say the squad is threadbare this week and does anyone have a mate that can play wing defence?" data-correct="false"><p class="answer__text">An email from the coach to say the squad is threadbare this week and does anyone have a mate that can play wing defence?</p></li>
<li class="question__answer answer" data-text="A notice issued by electricity generators to the market that they can no longer provide as much power as usual." data-correct="false"><p class="answer__text">A notice issued by electricity generators to the market that they can no longer provide as much power as usual.</p></li>
</ol></li>
<li class="quiz__question question" data-text="What does FCAS stand for?"><p class="question__text">What does FCAS stand for?</p>
<ol class="question__answers" type="A" data-answers-length="4">
<li class="question__answer answer" data-text="Flexible Contingency Auxiliary System. Also known as a parasitic load, it refers to energy used by generators themselves." data-correct="false"><p class="answer__text">Flexible Contingency Auxiliary System. Also known as a parasitic load, it refers to energy used by generators themselves.</p></li>
<li class="question__answer answer" data-text="Forecast Close Asset Strategy. A plan agreed between a regulator and a power provider for staged close-down of a power plant." data-correct="false"><p class="answer__text">Forecast Close Asset Strategy. A plan agreed between a regulator and a power provider for staged close-down of a power plant.</p></li>
<li class="question__answer answer" data-text="Frequency Control Ancillary Services. A market in which generators can provide services that keep the electricity system balanced." data-correct="true"><p class="answer__text">Frequency Control Ancillary Services. A market in which generators can provide services that keep the electricity system balanced.</p></li>
<li class="question__answer answer" data-text="Fully Cooked And Solared. Electricity market slang for when cheaper renewables push fossil fuels out of the market." data-correct="false"><p class="answer__text">Fully Cooked And Solared. Electricity market slang for when cheaper renewables push fossil fuels out of the market.</p></li>
</ol></li>
<li class="quiz__question question" data-text="What is the integrated system plan?"><p class="question__text">What is the integrated system plan?</p>
<ol class="question__answers" type="A" data-answers-length="4">
<li class="question__answer answer" data-text="A detailed plan produced every two years by the Australian Energy Market Operator on the optimal future design of the National Electricity Market." data-correct="true"><p class="answer__text">A detailed plan produced every two years by the Australian Energy Market Operator on the optimal future design of the National Electricity Market.</p></li>
<li class="question__answer answer" data-text="A plan produced every two years by the energy department to link the National Electricity Market with the Northern Territory’s electricity networks and Western Australia’s South West Interconnected System." data-correct="false"><p class="answer__text">A plan produced every two years by the energy department to link the National Electricity Market with the Northern Territory’s electricity networks and Western Australia’s South West Interconnected System.</p></li>
<li class="question__answer answer" data-text="A plan that owners of coal-power plants must submit each year detailing how quickly they are working to close down." data-correct="false"><p class="answer__text">A plan that owners of coal-power plants must submit each year detailing how quickly they are working to close down.</p></li>
<li class="question__answer answer" data-text="A plan to integrate all the systems in an integrated and systematic way that both plans and integrates all the things." data-correct="false"><p class="answer__text">A plan to integrate all the systems in an integrated and systematic way that both plans and integrates all the things.</p></li>
</ol></li>
<li class="quiz__question question" data-text="If you hear energy wonks and ministers talking about a “capacity mechanism”, what might they be referring to?"><p class="question__text">If you hear energy wonks and ministers talking about a “capacity mechanism”, what might they be referring to?</p>
<ol class="question__answers" type="A" data-answers-length="4">
<li class="question__answer answer" data-text="A proposal to make sure that the electricity system always has access to enough power generation." data-correct="true"><p class="answer__text">A proposal to make sure that the electricity system always has access to enough power generation.</p></li>
<li class="question__answer answer" data-text="The network of poles and wires that delivers electricity to consumers." data-correct="false"><p class="answer__text">The network of poles and wires that delivers electricity to consumers.</p></li>
<li class="question__answer answer" data-text="Energy worker jargon for a meal break. “I’m off for my capacity mechanism, boss.”" data-correct="false"><p class="answer__text">Energy worker jargon for a meal break. “I’m off for my capacity mechanism, boss.”</p></li>
<li class="question__answer answer" data-text="Google it, mate." data-correct="false"><p class="answer__text">Google it, mate.</p></li>
</ol></li>
</ol>
<h2 class="quiz__correct-answers-title">Solutions</h2>
<p class="quiz__correct-answers">1:C - When electricity market bosses have tried everything else to balance out demand with supply, they can resort to load shedding – deliberately turning off power to some places to reduce demand and stop a large-scale collapse. It’s different to a blackout, which refers to an unplanned outage. , 2:D - Technically, coal, gas, large-scale solar and wind are all dispatchable forms of electricity. Rooftop solar on its own isn’t as it can’t be controlled in the same way. “Dispatchable” is sometimes incorrectly used interchangeably with “firm”, which in market jargon relates to how dependable and predictable a source of power is, and “flexibility”, which refers to how quickly it can be turned up, down, on or off., 3:B - Households and small businesses normally buy electricity on either a promotional deal (a “market offer”) or a default “standing offer”. In South Australia, New South Wales and south-east Queensland, the Australian Energy Regulator sets a maximum price a retailer can charge – called the “default market offer” – and this acts like a soft cap on prices. In Victoria, the state’s essential services commission sets this, and calls it the Victorian Default Offer., 4:C - Retailers buy electricity either at a “spot price”, which changes every five minutes, or a contracted price agreed between retailers and generators over a set period. The wholesale price makes up about a third of a consumer’s electricity bill., 5:B - These notices come from the Australian Energy Market Operator. The most serious is an “actual” level 3 notice, which means power supply is probably being turned off somewhere to keep the system balanced., 6:C - Electricity market operators call on these services if there are sudden changes, such as a fault in a power plant or a big consumer stops needing power., 7:A - The plan looks at how the market could be developed over the coming decades that would be low cost, reliable and in line with climate targets. The National Electricity Market covers NSW, the ACT, Queensland, South Australia, Victoria and Tasmania. The next plan is due at the end of June., 8:A - This is an idea being explored by energy ministers and market regulators to pay electricity providers to have guaranteed power available at certain times when demand is high. </p>
<h3 class="quiz__scores-title">Scores</h3>
<ol class="quiz__scores" data-result-groups-length="9">
<li class="quiz__score score" data-title="Rating: Terrawatt. You know transitioning Australia’s electricity grid away from fossil fuels is a crucial part of getting to net zero, and so you want to know the detail. Either that or all your friends work at Aoemo." data-share="I got _/_ in <quiz title>" data-min-score="8"><p class="score__min-score">8 and above.</p><p class="score__title">Rating: Terrawatt. You know transitioning Australia’s electricity grid away from fossil fuels is a crucial part of getting to net zero, and so you want to know the detail. Either that or all your friends work at Aoemo.</p></li>
<li class="quiz__score score" data-title="Rating: Megawatt. You know the difference between your NEM and your elbow. Great effort." data-share="I got _/_ in <quiz title>" data-min-score="7"><p class="score__min-score">7 and above.</p><p class="score__title">Rating: Megawatt. You know the difference between your NEM and your elbow. Great effort.</p></li>
<li class="quiz__score score" data-title="Rating: Megawatt. You know the difference between your NEM and your elbow. Great effort." data-share="I got _/_ in <quiz title>" data-min-score="6"><p class="score__min-score">6 and above.</p><p class="score__title">Rating: Megawatt. You know the difference between your NEM and your elbow. Great effort.</p></li>
<li class="quiz__score score" data-title="Rating: Megawatt. You know the difference between your NEM and your elbow. Great effort." data-share="I got _/_ in <quiz title>" data-min-score="5"><p class="score__min-score">5 and above.</p><p class="score__title">Rating: Megawatt. You know the difference between your NEM and your elbow. Great effort.</p></li>
<li class="quiz__score score" data-title="Rating: Megawatt. You know the difference between your NEM and your elbow. Great effort." data-share="I got _/_ in <quiz title>" data-min-score="4"><p class="score__min-score">4 and above.</p><p class="score__title">Rating: Megawatt. You know the difference between your NEM and your elbow. Great effort.</p></li>
<li class="quiz__score score" data-title="Rating: Watt? You seem to think the future of Australia’s electricity supply is a joke. But seeing as you’re here. How many birds does it take to change a lightbulb? Toucan do it." data-share="I got _/_ in <quiz title>" data-min-score="3"><p class="score__min-score">3 and above.</p><p class="score__title">Rating: Watt? You seem to think the future of Australia’s electricity supply is a joke. But seeing as you’re here. How many birds does it take to change a lightbulb? Toucan do it.</p></li>
<li class="quiz__score score" data-title="Rating: Watt? You seem to think the future of Australia’s electricity supply is a joke. But seeing as you’re here. How many birds does it take to change a lightbulb? Toucan do it." data-share="I got _/_ in <quiz title>" data-min-score="2"><p class="score__min-score">2 and above.</p><p class="score__title">Rating: Watt? You seem to think the future of Australia’s electricity supply is a joke. But seeing as you’re here. How many birds does it take to change a lightbulb? Toucan do it.</p></li>
<li class="quiz__score score" data-title="Rating: Watt? You seem to think the future of Australia’s electricity supply is a joke. But seeing as you’re here. How many birds does it take to change a lightbulb? Toucan do it." data-share="I got _/_ in <quiz title>" data-min-score="1"><p class="score__min-score">1 and above.</p><p class="score__title">Rating: Watt? You seem to think the future of Australia’s electricity supply is a joke. But seeing as you’re here. How many birds does it take to change a lightbulb? Toucan do it.</p></li>
<li class="quiz__score score" data-title="Rating: Watt? You seem to think the future of Australia’s electricity supply is a joke. But seeing as you’re here. How many birds does it take to change a lightbulb? Toucan do it." data-share="I got _/_ in <quiz title>" data-min-score="0"><p class="score__min-score">0 and above.</p><p class="score__title">Rating: Watt? You seem to think the future of Australia’s electricity supply is a joke. But seeing as you’re here. How many birds does it take to change a lightbulb? Toucan do it.</p></li>
</ol>
</div>
</div>
</gu-atom>
</figure>
<figure class="element element-interactive interactive element--showcase" data-interactive="https://interactive.guim.co.uk/embed/iframe-wrapper/0.1/boot.js" data-canonical-url="https://interactive.guim.co.uk/2016/03/comics-master-2016/embed/embed.html?srcs-mobile=https://media.guim.co.uk/3ad1f0d19c9695ff3a2a32e975dcd4c0d0876d28/0_0_1874_5208/720.jpg+https://media.guim.co.uk/3ad1f0d19c9695ff3a2a32e975dcd4c0d0876d28/0_5252_1874_4207/445.jpg+https://media.guim.co.uk/3ad1f0d19c9695ff3a2a32e975dcd4c0d0876d28/0_9481_1874_3012/622.jpg&ratios-mobile=277.77777777777777+224.21524663677127+160.7717041800643&srcs-desktop=https://media.guim.co.uk/b671fe529dcbd3f70c31532b3bdf7673a880b557/0_0_3508_6130/3508.jpg&ratios-desktop=174.82517482517483&credit=Cartoon%20by%20First%20Dog%20on%20the%20Moon&background=336699&vpadding=5" data-alt="First Dog on ... poisonous Australian animals!"> <a href="https://interactive.guim.co.uk/2016/03/comics-master-2016/embed/embed.html?srcs-mobile=https://media.guim.co.uk/3ad1f0d19c9695ff3a2a32e975dcd4c0d0876d28/0_0_1874_5208/720.jpg+https://media.guim.co.uk/3ad1f0d19c9695ff3a2a32e975dcd4c0d0876d28/0_5252_1874_4207/445.jpg+https://media.guim.co.uk/3ad1f0d19c9695ff3a2a32e975dcd4c0d0876d28/0_9481_1874_3012/622.jpg&ratios-mobile=277.77777777777777+224.21524663677127+160.7717041800643&srcs-desktop=https://media.guim.co.uk/b671fe529dcbd3f70c31532b3bdf7673a880b557/0_0_3508_6130/3508.jpg&ratios-desktop=174.82517482517483&credit=Cartoon%20by%20First%20Dog%20on%20the%20Moon&background=336699&vpadding=5">First Dog on ... poisonous Australian animals!</a> </figure>
Items with 0 words contain interactive contents, like quizes and multimedia.
# remove rows where wordcount is 0
gu_news_df = gu_news_df[gu_news_df['wordcount'] != 0]
with pd.option_context('display.max_colwidth', None):
print(gu_news_df[gu_news_df['wordcount'] == 1]['body'].to_string()[:300])
12230 <figure class="element element-atom"> \n <gu-atom data-atom-id="59f05449-081a-4cbe-a869-2341aa1c4369" data-atom-type="quiz"> \n <div>\n <div class="quiz" data-questions-length="25" data-title="The bumper climate quiz">\n <ol class="quiz__questions">\n <li class="quiz__question que
# remove rows where wordcount is 0
gu_news_df = gu_news_df[gu_news_df['wordcount'] != 1]
with pd.option_context('display.max_rows', None):
display(gu_news_df.groupby('sectionName').agg({'wordcount': [min, max, np.mean, np.median], 'sectionName': 'count'}) \
.sort_values([('sectionName', 'count')], ascending=False))
wordcount | sectionName | ||||
---|---|---|---|---|---|
min | max | mean | median | count | |
sectionName | |||||
World news | 44 | 6734 | 794.656148 | 706.0 | 8873 |
Australia news | 80 | 3674 | 840.125425 | 781.0 | 5589 |
Opinion | 13 | 3652 | 881.013291 | 900.0 | 5041 |
UK news | 41 | 7842 | 650.519959 | 587.0 | 3933 |
US news | 99 | 5215 | 863.203818 | 785.5 | 3876 |
Business | 109 | 6361 | 663.659647 | 614.0 | 3799 |
Politics | 32 | 5816 | 722.253525 | 657.5 | 3262 |
Environment | 40 | 6990 | 762.507692 | 682.0 | 3120 |
Technology | 79 | 4100 | 753.202369 | 634.0 | 1013 |
Money | 107 | 4231 | 714.265979 | 571.0 | 970 |
Global development | 117 | 6062 | 884.699408 | 811.0 | 845 |
News | 61 | 7824 | 503.050360 | 227.0 | 695 |
Education | 67 | 5699 | 648.908148 | 569.0 | 675 |
Science | 38 | 6769 | 702.636228 | 584.5 | 668 |
Law | 94 | 2719 | 730.716846 | 650.0 | 279 |
From the Observer | 269 | 1605 | 682.702703 | 387.0 | 111 |
Global | 73 | 2710 | 858.435897 | 659.0 | 39 |
Inequality | 152 | 2000 | 671.269231 | 614.0 | 26 |
Animals farmed | 882 | 1112 | 995.272727 | 989.0 | 11 |
Tags in all articles
# convert tags webTitles string into list
gu_news_df['tag_webTitles'] = gu_news_df['tag_webTitles'].apply(lambda x: x.strip('[]').replace("'", '').split(', '))
# create set of tags
tags_lst = []
tags_lst.extend(gu_news_df['tag_webTitles'])
print("Number of unique tags: ", len(set(list(chain.from_iterable(tags_lst)))))
Number of unique tags: 7643
tags_set = set(list(chain.from_iterable(tags_lst)))
list(tags_set)[:10]
['William Morris',
'GDPR',
'Khalil El Halabi',
'Nikkei',
'Muska Najibullah',
'Wales',
'Australian Open',
'Paris climate agreement',
'The upside',
'Chris Riddell']
tag_count = pd.value_counts(np.array(list(chain.from_iterable(tags_lst))))
tag_count
Article 42825
News 26382
UK news 22981
The Guardian 18397
Main section 16733
...
Steven Borowiec 1
MIT - Massachusetts Institute of Technology 1
Royal Bank of Scotland 1
Contempt of court 1
Jarvis Cocker 1
Length: 7643, dtype: int64
tag_count.nlargest(25)
Article 42825
News 26382
UK news 22981
The Guardian 18397
Main section 16733
World news 11909
UK Home News 8171
Opinion 7809
Politics 7691
Australia news 6788
Features 6733
Business 6648
US news 6337
Comment 6167
Environment 5496
Europe 5325
Australia News 4808
Coronavirus 4304
UK Business 3821
US News 3637
Journal 3517
UK Foreign 3493
Ukraine 3354
Russia 3291
Conservatives 3215
dtype: int64
tag_count.nsmallest(25)
Rugby sevens 1
Mary Beard 1
Manuel Cortes 1
Brenna Hassett 1
James Cooray Smith 1
Alex Blasdel 1
El Niño southern oscillation 1
Michael Hogan 1
Dzhokhar Tsarnaev 1
Josie Dale-Jones 1
Aaliyah 1
Peter Bengtsen 1
Shaun Peter Qureshi 1
Chris Curtis 1
Boston Marathon bombing 1
Neal Katyal 1
Neelie Kroes 1
Roderick Beaton 1
Mark Bennister 1
Muska Najibullah 1
Hannah Brady 1
Paulina Velasco 1
Hulu 1
Spanish food and drink 1
Eve Fairbanks 1
dtype: int64
from wordcloud import WordCloud
wordcloud = WordCloud().generate(' '.join(list(chain.from_iterable(tags_lst))))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
Most tags are referring to sections.
sections_lst = list(gu_df['sectionName'].unique())
pillars_lst = list(gu_df['pillarName'].unique())
pillars_lst
['Sport', 'Arts', 'News', 'Opinion', 'Lifestyle', nan]
tag_count[~tag_count.index.isin(sections_lst)]
Article 74956
The Guardian 33307
Main section 20491
Features 19647
UK Home News 10575
...
Commonwealth Games 2002 1
Bill Paxton 1
Sara Paretsky 1
COP 21: Paris climate change conference 2015 1
Midnight Special 1
Length: 13894, dtype: int64
Tags in sections
gu_top25_df.groupby('sectionName').apply(lambda x: )