bigscience-workshop/metadata
Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
PythonApache-2.0
Issues
- 1
feat: resume training from a checkpoint
#37 opened by SaulLu - 1
feat: add title
#148 opened by tianjianjiang - 0
feat: clean up website desc.
#151 opened by tianjianjiang - 0
feat: refine dependency imports for preprocessing
#71 opened by SaulLu - 0
feat: add paragraph-entity metadata
#149 opened by tianjianjiang - 2
- 1
wip: feat: add `exit-duration-in-mins` arguments
#40 opened by SaulLu - 0
fix: clean up OpenWebText URL fragments, duplicates, variants, and malformed ones
#61 opened by tianjianjiang - 1
- 1
feat: add URL tokenizer(s) for data source, time stamp, and website description
#54 opened by tianjianjiang - 1
feat: support HF dataset loader and streaming mode for bs-modeling-metadata/openwebtext-html-cc
#62 opened by tianjianjiang - 2
question: Error raised by entities extrator
#79 opened by SaulLu - 1
Improve: Entity extraction speed
#78 opened by SaulLu - 1
Which HTML tags should be used during training?
#110 opened by norakassner - 0
Create Dataset with metadata
#124 opened by SaulLu - 1
entity tagging speedup
#95 opened by norakassner - 0
feat: HTML scanner for text content & content sectioning elements → segment paragraphs
#125 opened by tianjianjiang - 0
Common simple eval function to calculate ppl
#138 opened by shanyas10 - 1
- 1
- 1
Evaluation bias
#108 opened by norakassner - 1
- 0
multi gpu training
#129 opened by norakassner - 1
How do we define a paragraph?
#114 opened by norakassner - 0
slurm script test multi-gpu training
#109 opened by norakassner - 0
feat: find a solution to load the dataset
#122 opened by SaulLu - 0
- 1
manage PR merge strategies
#112 opened by norakassner - 1
Remove the vendor folder
#105 opened by SaulLu - 0
Which HTML tags should be used during training?
#113 opened by norakassner - 0
simple zero-shot eval function: datasource
#87 opened by norakassner - 0
Discuss style evaluation for website description and data source with Anna
#100 opened by norakassner - 0
Error `IndexError: list index out of range` while testing the entities extraction
#101 opened by SaulLu - 0
- 0
Start joint training
#97 opened by norakassner - 0
eval hyperparameters: amount of metadata
#93 opened by norakassner - 0
eval hyperparameters: occupied tokens
#96 opened by norakassner - 0
simple zero-shot eval function: entity tags
#86 opened by norakassner - 0
estimate amount of data
#94 opened by norakassner - 0
simple zero-shot eval function: HTML tags
#85 opened by norakassner - 0
simple zero-shot eval function: time stamps
#89 opened by norakassner - 0
method to sample local metadata
#91 opened by norakassner - 0
method to sample global metadata
#92 opened by norakassner - 0
explore hyperparameters:
#90 opened by norakassner - 2
Conflict in torch version
#66 opened by manandey - 0
Include entity description
#68 opened by manandey - 4
- 2
Rename dateutil submodule with another name
#63 opened by cccntu - 1
- 6