Issues
- 0
dedupe.documents.attribute_name does not work
#166 opened by mathCrazyy - 10
make_wikipedia.py fails on linux
#58 opened by peterbjorgensen - 2
Running paragraph level deduplication on c4
#150 opened by andrewhojel - 1
- 2
Need help for installing dolma
#158 opened by mihara-bot - 0
Duplicate ids in Dolma v1.7
#157 opened by Vedaad-Shakib - 1
dtype option is not working as expected
#152 opened by tokenizer-decode - 1
Inquiry about Web Pipeline Availability
#151 opened by codefly13 - 4
Simplify how rules in the mixer are provided
#50 opened by soldni - 2
S3 mixer doesn't start
#143 opened by marcopasqua - 5
deduplication examples does not work
#96 opened by TTTTao725 - 3
make_wikipedia in getting_started.md
#125 opened by leeparkuky - 1
- 1
Some race condition in url taggers
#138 opened by peterbjorgensen - 1
- 2
Possible bug in `local_shuffle`?
#139 opened by hwijeen - 1
A Question about the meaning of dolma_v1.6_cc_en
#134 opened by aleien95 - 3
make_wikipedia.py: long running time
#121 opened by chschroeder - 0
Support providing streams into mixer via CLI
#130 opened by soldni - 1
- 0
- 1
How does Exact paragraph deduplication performed?
#111 opened by silverriver - 1
not_alphanum_paragraph_v1 tagger takes forever to run on certain inputs.
#123 opened by peterbjorgensen - 1
- 1
Provenance license?
#108 opened by boxabirds - 3
Tokenizer name or path must be found error
#110 opened by RohitRathore1 - 1
Data sheet link in README is broken
#106 opened by simonw - 0
Only the attributes written by the last tagger in the tagger list gets written in version 1.0.0
#113 opened by peterbjorgensen - 0
Progress Bar may use more resources than necessary
#77 opened by soldni - 3
Latest version is not on PyPi
#78 opened by KennethEnevoldsen - 0
- 0
The Law School Admission Council | LSAC
#89 opened by hannahzacharski55 - 0
- 0
AllenAI
#87 opened by hannahzacharski55 - 0
- 0
Hells Angels infinite loop
#85 opened by hannahzacharski55 - 0
New Albany Business, Family Law and Criminal Defense Lawyer Aaron Johnson
#83 opened by hannahzacharski55 - 0
- 0
$open create sudo port forward import for https://www.mattoxandwilson.com>>all
#82 opened by hannahzacharski55 - 0
$open Allen Wolf (infinite_loop)
#80 opened by hannahzacharski55 - 0
911
#81 opened by hannahzacharski55 - 0
Git - gitk Documentation
#79 opened by hannahzacharski55 - 1
Terminal20141030.zip - Google Drive
#75 opened by hannahzacharski55 - 1
- 0
make_wikipedia.py hardcoded to simple
#57 opened by peterbjorgensen - 1
Jessie ©
#48 opened by hannahzacharski55 - 1
Titles
#47 opened by hannahzacharski55 - 1
Ruby
#46 opened by hannahzacharski55 - 1
Adam Burden, I love you!
#45 opened by hannahzacharski55 - 1
Adam Burden
#44 opened by hannahzacharski55