Datafable/epu-index

Import old articles in database

Closed this issue · 11 comments

Import old articles in database

See #50 with link to tmp script.

Hi @bartaelterman, I'm implementing based on your script but have a few more questions:

  • In your script, url is set to "not provided". Does that still make sense, given that the field allows NULL values?
  • In your script, the fields epu_score and cleaned_text are not set, is this correct. It seems you're processing cleaned_text in the spiders. Maybe we should move this to the model's save method, so cleaned text is computed and saved each time the model is saved, whatever the creation method was?

Thanks!

Hi @niconoe, indeed, some things are not correct in the script.

  • We don't have urls for old data. Hence #63 It shouldn't be set to "not provided", it can be left empty.
  • The text is indeed cleaned_text, not just text.
  • Epu score should be there. I'll send a new file asap.

Allright, thanks!

Does that make sense to store the text from CSV in both text and cleaned_text attribute? Or should I make text optional and keep it empty?

Hmm.. Maybe best to save it in text and cleaned text.

Import script written, but I'm facing a small issue: we have a constraint that consider an article a duplicate if it has the same journal, publication date and title. Some source data doesn't look like duplicates but still fails the test.

It happens for example with generic titles such as "REACTIES", when the publishing time is set at midnight... For example, with the command:

$ cat articles_with_score.tsv | grep "'REACTIES'"

You'll see different articles called "REACTIES", with a publication time of "Mon Jul 13 00 00 00 2009"

Should we remove our constraint? Or change it so it only fails if text or epu_score is identical too? That seems a bit weird (especially the text option), but maybe that's a pragmatic solution... What do you think?

I've been looking at the articles you're pointing too. Indeed, for some the constraint is too stringent. I would say, let's add the text in the constraint. However, for the imported scripts, the text will be saved in cleaned_text so we should add that to the constraint too.

Hmmm, there's an additional issue there: text and cleaned_text cannot be added to the constraint, since this constraints implies an index, and the large text fields are too big to be indexed by Postgres...

I had a look at the initial issue (#54) about duplicates, but it seems we're a bit short of options here. I don't know how vital is this constraint... Is it to avoid that the scraper accidentally adds twice the same article? In that case, if the scraper always return an URL, maybe we can set the field as unique, but still allowing NULL values for the old articles.

Any other suggestion?

If setting the URL to unique does not imply that NULL values are not allowed, then that suggestion is ok!

Indeed, the SQL standard considers NULL values meaning "unknown", so duplicate NULLs are still allowed in case of unique constraints.

Script finally works!