spyysalo/wiki-bert-pipeline

why filtering?

Closed this issue ยท 2 comments

Quite a quality piece of code. ๐Ÿ‘

Please help me to see the purpose and details of filtering clearly.

  1. Why do you use filtering at all?
  2. Do you risk good quality of the pretrained model by disabling it? Do you have some numbers how much?
  3. Does filtering work per document? I.e. if some measure is not OK, then the whole document is thrown out?
  4. How did you come to the specific numbers you use in the config?
  5. What is --avg-len? I do not quite get the idea.
  6. In our case, using your filtering script with your config we lose quite a large part of our corpus. So we plan to adjust some config settings. According to your experience, what do you think which config values should remain unchanged for some theoretical reason and which can be freely adjusted?
  7. Is this whole filtering idea yours or taken from a publication of someone else? In both cases, point me please to the relevant literature.

Thanks.
A participant of FinTAL2006. :)

Provide some info on the above, if possible, please.

Thanks, and apologies for the late response! (Paper was just presented now.) Filtering is probably not really required for clean data (such as most Wikipedias), but allows the pipeline to be adapted for noisier input (such as web data). We only trained one model for each language, so we unfortunately don't have experimental support for including it. Most works pretraining on noisy data include some form of filtering. These specific filters are ours, based on common ideas in previous work. Please do adjust the filters according to what works for you, or disable them entirely -- these settings are just what worked for our data. Finally, --avg-len takes out documents that have a low average "sentence" length, which in Wikipedia should mostly filter list pages and similar.