Create Dataset with metadata
SaulLu opened this issue · 0 comments
SaulLu commented
Steps:
- pseudo crawl ~10% of C4 web page from Common Crawl @tianjianjiang
- import pseudo crawled dataset on JZ @SaulLu
- run 1st step of extraction:
- run 2nd step of extraction:
- Extract Website descriptions @shanyas10 @SaulLu
- run 3rd step of extraction:
- run 4th step of extraction:
- (optional) clean final dataset:
- Remove empty lines @SaulLu
- Remove "errors" columns @SaulLu
- (optional) Gather all metadata into same column @cccntu @timoschick @SaulLu
- push dataset to Hub @SaulLu