huggingface/OBELICS

Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.

PythonApache-2.0

Issues

LDA
#11 opened 5 months ago by jrryzh
0
How to use LDA for topic modeling
#12 opened 6 months ago by jrryzh
1
nsfw filtered texts only file missing at step 08_01
#10 opened 6 months ago by shaharukhkhan4350
2
Is the tot_counter saved twice in this code snippe？
#9 opened 7 months ago by haiqiang2017
4
Releasing trained topic models?
#8 opened 7 months ago by vishaal27
1
Missing TextMediaPairsExtractor from the repo
#7 opened 7 months ago by kckishan
1
Search engine over the training data
#5 opened a year ago by aleSuglia
1
common_words.json download issue
#6 opened 9 months ago by jrryzh
11
Training Details
#1 opened a year ago by vateye
1
Metadata process
#4 opened a year ago by ellenxtan
4
Which folder to use?
#2 opened a year ago by mckinziebrandon
2
When will the trained model be released?
#3 opened a year ago by chenxshuo
3