dkpro/dkpro-c4corpus
DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
JavaApache-2.0
Stargazers
- AbromeitKOCH ESSEN Kommunikation + Design GmbH
- alexeygrigorev@DataTalksClub
- ammsaCanada
- andrekaa
- andrewwxyHong Kong
- arjenpdevriesRadboud University
- berezovskyi127.0.0.1
- bzz@jetbrains, @apache
- CraKeyBoy@NewBanker
- cyberlabe
- dav009Optimizely
- DeseausAmazon
- fbullini2
- FuehnixOther World Computing
- hpzorninovex GmbH
- jaypat87
- karanjeetsApple Inc.
- luto65
- mauroveronHyperfocal
- maxxkiaGermany
- mdbishopNew York, NY
- mlinksva☃
- mollerhojdeepdivr
- monologg@bhsn-ai
- nyimbiDatacraft
- oroszgy@ec-doris
- ruslanrfUniversity of Oxford, Vienna UT
- sachinjskVancouver, Canada
- songysSionic AI Inc.
- tensortalkYou're on TensorTalk.com!
- tfmorrisBoston, USA
- tokestermwCresta
- virgulvirgul
- wanghy6503
- wangyangtotUniversity
- zsg1990ok