Helsinki-NLP/OpusFilter

Opusfilter fails to compress data when it is downloaded via moses

thfrkielikone opened this issue · 3 comments

Running this:

steps:
  - type: opus_read
    parameters:
      corpus_name: OpenSubtitles
      source_language: fi
      target_language: en
      release: v2018
      preprocessing: moses
      src_output: opensubtitles.fi.gz
      tgt_output: opensubtitles.en.gz
      suppress_prompts: true

Results in files opensubtitles.fi.gz and opensubtitles.en.gz that are in fact plain text.

Seems that there are also some other issues regarding the integration with the latest OpusTools using moses preprocssing, like setting output_directory makes the process totally fail. I'll look into this, but I think the problems are on OpusTool's side (ping @miau1).

I suggest using the raw or xml options for preprocessing until we get this fixed.

Fixed in 3.2.0. It is now recommended to download corpora using the moses preprocessing.