artetxem/vecmap

Unicode error at line #31 in embeddings.py

sawan16 opened this issue · 3 comments

UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf6' in position 0: surrogates not allowed

This obviously looks like an encoding problem, but I would need more details to know where it happens. Please report the full stack trace.

Sometimes 'utf-8' encoding faces errors while encoding/decoding certain symbols or letters. In those cases, you can either try to ignore such errors by adding errors = 'ignore' with the encoding, or else maybe try some other specific encoding type like latin-1 or ISO-8859-1 for example. Hope this helps.

The input embed model is not in correct format. Use model.save_word2vec_format(filename) to save the fasttext or word2vec model.