alephdata/memorious

Support for media monitoring

pudo opened this issue · 1 comments

pudo commented

Problem: we want to make media reporting, especially the articles and investigations published by OCCRP itself and it's member centres, better accessible in Aleph. At the moment, the best option to do this is by actually crawling a news web site's HTML pages and indexing all of them. This has the following issues:

  • It's sort of hard to configure memorious to recognise what is an article and not. This can only be done by path, but some CMS don't provide reliable prefixes. As a consequence, we end up indexing a ton of article and category listings, which makes for super noisy search results.
  • HTML pages are then submitted to Aleph, which is kind of bad at showing them (due to security). What we really want is the plain text of the story body.
  • We also want to semantically interpret metadata like publication date, author, etc.

In order to improve this, I've introduced an Article schema in followthemoney 2.2, which describes a piece of news reporting. It's a pretty plain form of document. We should add a module to memorious that:

  • Lets you determine if something is an article both by traditional rules and by whether an xpath exists in the document (probably we just want to add xpath to memorious.helpers.rule:RULES.
  • Knows how to extract the raw article body (and ideally metadata) via newspaper.
  • Possibly: possibility to assign ftm properties from xpath queries, sort of as an HTML mapping language.

Sketch:

pipeline:
  parse:
    method: article
    params:
      match:
        or:
          - pattern: .*stories.*
          - xpath: .//div[@class="published-date"]
      parse:
        title:
          - .//h1[@class="title"]
        updatedAt:
          - .//div[@class="published-date"]
    handle:
      pass: store

Since this is necessarily dependent on followthemoney, we need to decide if a) this lives in it's own Python module, or b) if it's time to make memorious depend on ftm. I could see the latter enabling us to do quite a few good things, and maybe also resolve some weird inverted dependencies (like ftm-store knowing about memorious).

WIP pull request now open here: #167