Option to keep blank lines
jbrry opened this issue · 3 comments
Hi there,
I am using OpusFilter on the nlingual-rebase
branch to train a monolingual BERT model. In some of my corpora, there are empty lines which denote a document boundary, e.g. an empty line between two Wikipedia articles.
In the BERT README they mention:
"The input is a plain text file, with one sentence per line. (It is important that these be actual sentences for the "next sentence prediction" task). Documents are delimited by empty lines."
So I want to keep these empty lines where possible, so that BERT knows where a document ends for its next-sentence-prediction task (where a randomly sampled document is used as a negative example for its NSP task).
I'm just wondering would it be possible to add a feature to OpusFilter where the user can specify to keep empty lines? In my current configuration below, empty lines are removed from the example.txt
file.
common:
output_directory: tests/data
steps:
- type: filter
parameters:
inputs: [example.txt]
outputs: [example-filtered.txt]
filters:
- LengthFilter:
unit: word
min_length: 1
max_length: 100
- LongWordFilter:
threshold: 40
- HtmlTagFilter: {}
- CharacterScoreFilter:
scripts: [Latin]
thresholds: [0.5]
- LanguageIDFilter:
name: langid
id_method: langid
languages: [ga]
thresholds: [0.5]
- LanguageIDFilter:
name: cld2
id_method: cld2
languages: [ga]
thresholds: [0.5]
Thanks for the suggestion! This does sound like a useful feature.
I considered what would be the best way to implement it. A global pass_empty
setting for the filter
command would sound the best for me, but unfortunately it is difficult to do with the current implementation based on iterators and generators.
The second option is to fix all individual filters either to pass the blank lines through or add an option for that. I went through the simple filters implemented in the filters
module, and noticed that LengthRatioFilter
and LanguageIDFilter
actually didn't return very sensible results on blank lines. I fixed those and also added pass_empty
option for LengthFilter
. With these changes, I think your example should work.
However, this is not completely solved yet. I haven't looked at e.g. on the behavior of the language model and alignment filters on empty data.
Hi Sami,
Thank you for your prompt response and changes. You're right, the blank lines are still included with my example file/config now which is very helpful, thanks!
No worries that the language model and alignment filters do not support this yet. The above changes should be ok for my needs so there's no rush with this from my end but I will leave it to you to decide if you want to keep the issue open until they are changed.
I added score_for_empty
option for CrossEntropyFilter
, CrossEntropyDifferenceFilter
, and WordAlignFilter
. When set to a value lower/higher than the threholds, it can be ensured that the empty lines are always passed/rejected. For WordAlignFilter
, I noticed that eflomal
actually fails for empty input, so I needed to set a default value.
As far as I see, keeping black lines should now be possible for all the implemented filters.