Add new Datasets

Question

Add new Datasets

AgaMiko opened this issue 3 years ago · 3 comments

Hello,

First of all: amazing repository! You are doing great work there. I have a recommendation of a huge keyword dataset to add:

The Directory of Open Access Journals -- 80 languages -- 7,005,466 article records https://doaj.org/. Most of the papers here have both abstracts and keywords.

Thank you,
Agnieszka

Answer 1 · 2022-01-11T19:18:32.000Z

Dear Agnieszka,
Indeed. It seems to be a good resource. We would certainly consider including it here in case you can process the articles according to the format considered in this repository.
Best
Ricardo

Answer 2 · 2022-01-11T22:06:58.000Z

Would be perfect to have a small sample for each of those 80 languages :) Do you think it is feasible?

…

On Tue, Jan 11, 2022 at 8:18 PM rncampos ***@***.***> wrote: Dear Agnieszka, Indeed. It seems to be a good resource. We would certainly consider including it here in case you can process the articles according to the format considered in this repository. Best Ricardo — Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABRAEPL3YTZZALGSTNEXLDUVR7BJANCNFSM5LJUJ3GQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Answer 3 · 2022-01-12T08:03:05.000Z

Hello @rncampos and @arianpasquali,

I think it would be perfect as you simply listed it as a resource and provided a link to the database. I have downloaded the data dump and and it is of pretty good quality. The format is nice and clean (JSON), although sometimes some fields or missing. One of the disadvantages is that the language label does not always apply to both titles, abstracts and keywords, as sometimes only part of the text is really in that language. We have tested it in voicelab.ai and we have used the language detection model to check how many languages are really there. We have discovered 48 unique languages that have both abstract, title, and keywords in the same language. However, many of those had only one or a few positions. Most common languages are pretty well resourced here though.
Nevertheless, I think it is a good resource for keyword extraction. You could add samples e.g. from some selected languages and redirect users to DOAJ website: https://doaj.org/docs/public-data-dump/
What do you think about that? :)