Scrape sentences from wikidata
stefangrotz opened this issue · 4 comments
Wikidata is completely under CC0, this makes it very attractive for the project. In contains both, sentences and sometimes audio, but for this Issue I want to focus on sentences.
This Issue is work in progress, I want to collect possible sources for sentences in Wikidata:
- P5831 usage example : a example sentence for a word. Often with a language added in brackets.
- A "Description" in many languages exists for many Wikidata- items, but it isn't always a complete sentence.
The next step would be to write a script to scrap these sentences.
Looks like these are indeed CC0. I don't think we need to ask legal for this. @nukeador do you agree?
Would love to see a selection of these sentences. Also, I assume you are aware of the scraper capabilities for other resources? As long as we can get it into a parseable state, it then can directly be integrated in the scraper to use the rules and everything. More details in the last part of the README. Also happy to explain further if needed.
As long as we can get it into a parseable state, it then can directly be integrated in the scraper to use the rules and everything.
This was exactly what I was thinking. Right now the example sentences for a datatype called "lexemes" are relatively new. They exists since 2018. But they are planing to move all wiktionary data into wikidata, so we will likely have more sentences in the future.
Wikidata is huge, I am sure that there are more data types that contain sentences.
Would love to see a selection of these sentences.
I always wanted to learn wikidata queries, this is a nice little project to finally do it. I will post some examples tomorrow or so.
Note only these 4 namespaces is CC0.
All structured data from the main, Property, Lexeme, and EntitySchema namespaces is available under the Creative Commons CC0 License; text in the other namespaces is available under the Creative Commons Attribution-ShareAlike License;
Do we have data on how many sentences do we have for each language?
I've already suggested using P5831 earlier in the sentence-collector project (common-voice/sentence-collector#260), but, as per this query, there is currently only about 4000 sentences in P5831, some of which are probably repetitions. (After uncommenting the first line of the query you should be able to filter sentences by language using the query helper (accesible by clicking the (i) on the left sidebar)).
All of those should be in the Lexeme namespace, so license-wise should be of no issue.