LIAAD/KeywordExtractor-Datasets

Add new Datasets

AgaMiko opened this issue · 3 comments

Hello,

First of all: amazing repository! You are doing great work there. I have a recommendation of a huge keyword dataset to add:

  • The Directory of Open Access Journals -- 80 languages -- 7,005,466 article records https://doaj.org/. Most of the papers here have both abstracts and keywords.

Thank you,
Agnieszka

Dear Agnieszka,
Indeed. It seems to be a good resource. We would certainly consider including it here in case you can process the articles according to the format considered in this repository.
Best
Ricardo

Hello @rncampos and @arianpasquali,

I think it would be perfect as you simply listed it as a resource and provided a link to the database. I have downloaded the data dump and and it is of pretty good quality. The format is nice and clean (JSON), although sometimes some fields or missing. One of the disadvantages is that the language label does not always apply to both titles, abstracts and keywords, as sometimes only part of the text is really in that language. We have tested it in voicelab.ai and we have used the language detection model to check how many languages are really there. We have discovered 48 unique languages that have both abstract, title, and keywords in the same language. However, many of those had only one or a few positions. Most common languages are pretty well resourced here though.
Nevertheless, I think it is a good resource for keyword extraction. You could add samples e.g. from some selected languages and redirect users to DOAJ website: https://doaj.org/docs/public-data-dump/
What do you think about that? :)