rom1504/cc2dataset
Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
PythonMIT
Issues
- 10
support video platform
#27 opened by rom1504 - 1
- 0
- 1
- 2
consider expanding to WARC
#34 opened by rom1504 - 0
Extract robots metatags
#39 opened by sebastian-nagel - 0
Add diagram of the advised processing pipeline
#37 opened by rom1504 - 2
- 2
check structured CC extraction
#31 opened by rom1504 - 5
some numbers
#5 opened by rom1504 - 13
Partition the final merge + shuffle
#18 opened by rom1504 - 0
Add more text document type
#33 opened by rom1504 - 0
advise on stage 2
#35 opened by rom1504 - 0
Add thanks section
#30 opened by rom1504 - 0
support audio platform
#29 opened by rom1504 - 1
support text document_type
#25 opened by rom1504 - 1
support video
#26 opened by rom1504 - 3
Rename to cc2dataset?
#19 opened by rom1504 - 0
- 0
more references
#13 opened by rom1504 - 4
- 3
Implement restarting the spark app every part
#23 opened by rom1504 - 10
- 2
- 1
- 3
- 2
- 1
add date to output folder
#15 opened by rom1504 - 1
save input wat list at beginning
#12 opened by rom1504 - 0
- 12
faster write/read to s3 for dedup
#6 opened by rom1504 - 1
padd with 0 file names
#4 opened by rom1504 - 2
pandas udf and dedup
#2 opened by rom1504