fterh/sneakpeek

Today article has title only (no body)

fterh opened this issue · 2 comments

todayonline.com returns a page with javascript that fetches the actual article via XHR by calling an API. This actually works to our advantage as the API returns a neatly structured JSON with title, author, publication date and of course the body.

$ URL='https://www.todayonline.com/commentary/why-we-are-pushing-divestment-fossil-fuel-fight-against-climate-change'
$ ARTICLEID=`curl -s "${URL}" | grep 'articleid' | grep -oP '(?<=content=")[0-9]+'`
$ ARTICLEBODY=`curl -s https://www.todayonline.com/api/v3/article/"${ARTICLEID}" | jq '.node.body'`
$ echo "${ARTICLEBODY}" | head -n3
"<p>Minister for Trade and Industry Chan Chun Sing recently spoke about how just as Singapore’s past 50 years have been defined by its successful water story, the next 50 will be defined by its ability to manage its energy challenges amidst the threat of climate change.</p>

<p>The <a href=\"http://ipcc.ch/report/sr15/\">special report from the Intergovernmental Panel on Climate Change (IPCC)</a> contained stark warnings to drastically reduce global greenhouse gas emissions. In particular, it called for governments to fully decarbonise their economies as soon as possible, which necessitates transitioning from fossil fuels to cleaner forms of energy.</p>

The JSON schema for the response returned from their API:

$ JSONSCHEMA=`curl -s https://www.todayonline.com/api/v3/article/"${ARTICLEID}" | jq '.node | keys'`
$ echo "${JSONSCHEMA}"
[
  "abstract",
  "author",
  "body",
  "bullet_tags",
  "external_author",
  "hero",
  "next",
  "node_id",
  "node_url",
  "opinion",
  "prev",
  "publication_date",
  "published",
  "quote",
  "quote_name",
  "related",
  "section",
  "short_title",
  "show_abstract",
  "sidebar",
  "sponsor",
  "status",
  "strapline",
  "title",
  "type",
  "updated_date"
]