TypeError when processing ParaCrawl
lefterav opened this issue · 1 comments
lefterav commented
Processing dies with a TypeError related to the HMTLparser probably.
The log:
Could not load varikn, language model filtering not supported
Please set enviroment variable EFLOMAL_PATH to use word alignment scores
INFO:opusfilter.opusfilter:Running step 1: {'type': 'opus_read', 'parameters': {'corpus_name': 'ParaCrawl', 'source_language': 'de', 'target_language': 'en', 'release': 'v5', 'preprocessing': 'raw', 'src_output': 'paracrawl.de.gz', 'tgt_output': 'paracrawl.en.gz'}}
No alignment file "/projappl/nlpl/data/OPUS/ParaCrawl/v5/xml/de-en.xml.gz" or "data/parallel/ParaCrawl_v5_xml_de-en.xml.gz" found
The following files are available for downloading:
3 GB https://object.pouta.csc.fi/OPUS-ParaCrawl/v5/raw/de.zip
13 GB https://object.pouta.csc.fi/OPUS-ParaCrawl/v5/raw/en.zip
469 MB https://object.pouta.csc.fi/OPUS-ParaCrawl/v5/xml/de-en.xml.gz
16 GB Total size
Downloading 3 file(s) with the total size of 16 GB. Continue? (y/n) y
data/parallel/ParaCrawl_v5_raw_de.zip ... 100% of 3 GB
data/parallel/ParaCrawl_v5_raw_en.zip ... 100% of 13 GB
data/parallel/ParaCrawl_v5_xml_de-en.xml.gz ... 100% of 469 MB
INFO:opusfilter.opusfilter:Running step 2: {'type': 'remove_duplicates', 'parameters': {'inputs': ['paracrawl.de.gz', 'paracrawl.en.gz'], 'outputs': ['paracrawl.dedup.de', 'paracrawl.dedup.en']}}
36936714it [08:24, 73153.97it/s]
INFO:opusfilter.opusfilter:Removed 17513 / 36936714 = 0.05% duplicate lines (duplicate types: 17144)
INFO:opusfilter.opusfilter:Running step 3: {'type': 'filter', 'parameters': {'src_input': 'paracrawl.dedup.de', 'tgt_input': 'paracrawl.dedup.en', 'src_output': 'paracrawl.filtered.de', 'tgt_output': 'paracrawl.filtered.en', 'filters': [{'LengthFilter': {'unit': 'word', 'min_length': 1, 'max_length': 100}}, {'LengthRatioFilter': {'unit': 'word', 'threshold': 3}}, {'LongWordFilter': {'threshold': 40}}, {'HtmlTagFilter': {}}, {'CharacterScoreFilter': {'src_script': 'Latin', 'tgt_script': 'Latin', 'src_threshold': 1, 'tgt_threshold': 1}}, {'TerminalPunctuationFilter': {}}, {'NonZeroNumeralsFilter': {}}]}}
28933972it [3:16:53, 2670.52it/s]/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/bs4/builder/_htmlparser.py:102: UserWarning: expected name token at '<![ INCLUDE [ Dieser'
warnings.warn(msg)
Traceback (most recent call last):
File "/local/stripe/elav01/learningcurve/miniconda3/bin/opusfilter", line 27, in <module>
of.execute_steps(overwrite=args.overwrite, last=args.last)
File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/opusfilter/opusfilter.py", line 109, in execute_steps
self.step_functions[step['type']](step['parameters'], overwrite=overwrite)
File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/opusfilter/opusfilter.py", line 208, in filter_data
for idx, pair in enumerate(pairs):
File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/opusfilter/__init__.py", line 52, in filter
for sent1, sent2 in pairs:
File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/opusfilter/__init__.py", line 52, in filter
for sent1, sent2 in pairs:
File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/opusfilter/__init__.py", line 52, in filter
for sent1, sent2 in pairs:
File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/opusfilter/__init__.py", line 53, in filter
if self.accept(next(self.score([(sent1, sent2)]))):
File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/opusfilter/filters.py", line 102, in score
src_tags = bool(bs(sent1, 'html.parser').find())
File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/bs4/__init__.py", line 348, in __init__
self._feed()
File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/bs4/__init__.py", line 434, in _feed
self.builder.feed(self.markup)
File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/bs4/builder/_htmlparser.py", line 377, in feed
parser.feed(markup)
File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/html/parser.py", line 110, in feed
self.goahead(0)
File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/html/parser.py", line 178, in goahead
k = self.parse_html_declaration(i)
File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/html/parser.py", line 263, in parse_html_declaration
return self.parse_marked_section(i)
File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/_markupbase.py", line 149, in parse_marked_section
sectName, j = self._scan_name( i+3, i )
TypeError: cannot unpack non-iterable NoneType object
28934137it [3:16:53, 2449.19it/s]
and this is the configuration file
common:
output_directory: data/parallel/
steps:
- type: opus_read
parameters:
corpus_name: ParaCrawl
source_language: de
target_language: en
release: v5
preprocessing: raw
src_output: paracrawl.de.gz
tgt_output: paracrawl.en.gz
- type: remove_duplicates
parameters:
inputs:
- paracrawl.de.gz
- paracrawl.en.gz
outputs:
- paracrawl.dedup.de
- paracrawl.dedup.en
- type: filter
parameters:
src_input: paracrawl.dedup.de
tgt_input: paracrawl.dedup.en
src_output: paracrawl.filtered.de
tgt_output: paracrawl.filtered.en
filters:
- LengthFilter:
unit: word
min_length: 1
max_length: 100
- LengthRatioFilter:
unit: word
threshold: 3
- LongWordFilter:
threshold: 40
- HtmlTagFilter: {}
- CharacterScoreFilter:
src_script: Latin
tgt_script: Latin
src_threshold: 1
tgt_threshold: 1
- TerminalPunctuationFilter: {}
- NonZeroNumeralsFilter: {}