rom1504/cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...

PythonMIT

Issues

support video platform
#27 opened 2 years ago by rom1504
10
Investigate implementation of url / metadata predictors
#43 opened a year ago by rom1504
1
investigate if computing count instead of drop duplicates would be fast
#47 opened a year ago by rom1504
0
adapt number of output files based on document type
#46 opened a year ago by rom1504
1
consider expanding to WARC
#34 opened 2 years ago by rom1504
2
Extract robots metatags
#39 opened 2 years ago by sebastian-nagel
0
Add diagram of the advised processing pipeline
#37 opened 2 years ago by rom1504
0
fix test
#32 opened 2 years ago by rom1504
2
check structured CC extraction
#31 opened 2 years ago by rom1504
2
some numbers
#5 opened 2 years ago by rom1504
5
Partition the final merge + shuffle
#18 opened 2 years ago by rom1504
13
Add more text document type
#33 opened 2 years ago by rom1504
0
advise on stage 2
#35 opened 2 years ago by rom1504
0
Add thanks section
#30 opened 2 years ago by rom1504
0
support audio platform
#29 opened 2 years ago by rom1504
0
support text document_type
#25 opened 2 years ago by rom1504
1
support video
#26 opened 2 years ago by rom1504
1
Rename to cc2dataset?
#19 opened 2 years ago by rom1504
3
get rid of useless spark warnings / improve speed logging
#28 opened 2 years ago by rom1504
0
more references
#13 opened 2 years ago by rom1504
0
add some options to make it possible to get other stuff than images
#22 opened 2 years ago by rom1504
4
Implement restarting the spark app every part
#23 opened 2 years ago by rom1504
3
Investigate using parquet bloom filter to reduce size on disk
#14 opened 2 years ago by rom1504
10
Consider optionally moving dedup and shuffle to a second step
#20 opened 2 years ago by rom1504
2
consider making dedup optional if local disk limited but remote is not
#17 opened 2 years ago by rom1504
1
would doing some parallism when retrieving shards make things faster?
#21 opened 2 years ago by rom1504
3
shuffle
#16 opened 2 years ago by rom1504
2
add date to output folder
#15 opened 2 years ago by rom1504
1
save input wat list at beginning
#12 opened 2 years ago by rom1504
1
implement multi steps processing to limit disk space need
#10 opened 2 years ago by rom1504
0
faster write/read to s3 for dedup
#6 opened 2 years ago by rom1504
12
padd with 0 file names
#4 opened 2 years ago by rom1504
1
pandas udf and dedup
#2 opened 2 years ago by rom1504
2