GRAAL-Research/deepparse

Cannot config language for fasttext embedding model

lhlong opened this issue · 5 comments


def download_fasttext_embeddings(saving_dir: str, verbose: bool = True) -> str:
    """
    Simpler version of the download_model function from fastText to download pre-trained common-crawl
    vectors from fastText's website https://fasttext.cc/docs/en/crawl-vectors.html and save it in the
    saving directory (saving_dir).
    """
    os.makedirs(saving_dir, exist_ok=True)

    file_name = "cc.fr.300.bin"
    gz_file_name = f"{file_name}.gz"

    file_name_path = os.path.join(saving_dir, file_name)
    if os.path.isfile(file_name_path):
        return file_name_path  # return the full path to the fastText embeddings
....

Could you help to update this method to dowload fasttext model for other language? It's better if we can config language here.

Hi @lhlong,

As per our article, subword embeddings in French give state-of-the-art (SOTA) (or near SOTA) results in other languages. Therefore, we did not offer the download of different embeddings since it already gives good results. Moreover, with the retrain features, one can fine-tune the specific language without changing the embeddings layer.

Finally, are there reasons you think changing the embeddings could yield better results or other reasons to add this feature?

@davebulaval Thanks for your response.
Your default embedding model is suitable for multinational address parsing. However, if we only want to retrain for one specific language, why don't to use a corresponding word embedding model with language?
I haven't fine-tuned it with another embedding model yet, and I want to do that. So, in my opinion, this feature will make this step works easier.

@lhlong, I understand your point. However, BPEmb uses multilingual byte pair embeddings where FastText does not. Thus, passing a language argument from the API perspective would be confusing.

Also, our work was intended to reduce the need for training a model for each language out there and offering such a feature would differ from that initial objective. Moreover, since our pretrained model uses French embeddings, one would hypothetically need to always retrain the model to use new language embeddings.

I will follow up with our final thoughts (me and the coauthor) on that matters and how I could mitigate this feature against our initial idea.

We think it would mean too much refactoring into our codebase as a final thought. So instead, we recommend fine-tuning one of the models on single language addresses.

Thank @davebulaval, I will try to do that