Non ascii chars in train file

Question

Non ascii chars in train file

Opened this issue 10 years ago · 8 comments

Hi
I was training the redshift for an input with some non ascii characters and I encountered errors
I passed errors by replacing them but my goal is to train it for persian data and it will surely encounter with errors
I heared about some solution like transliterals but i know nothing about
I want to khow is that the best solution or you suggest better solutions?
thanks

Answer 1 · 2014-08-25T15:34:15.000Z

Hi,

I haven't tried this just yet, but first: are you sure you've decoded the text into bytes correctly before you pass it to the parser?

(Skip this if you know it, but: unicode is "serialized" into a byte-stream by the method unicode.encode(), e.g. byte_seq = u'Hello'.encode('utf8'). You can then deserialize the bytes back to a unicode object with unicode_string = byte_seq.decode('utf8'). It's easy to introduce bugs in Python 2, by passing in a unicode object where the function actual expects a byte stream.)

Answer 2 · 2014-08-26T06:05:02.000Z

I made no change to input text file
do you mean to encode input file that contains train data and then pass it to script
or edit your code to encode file after read?
(I hope I could understand what you meant well)

Answer 3 · 2014-09-05T19:16:27.000Z

Hi,

Sorry to leave you hanging out such a simple problem.

It turns out that I wasn't decoding the text into bytes properly in my train.py and parse.py scripts, as the files I've been running my experiments on have all been ASCII, and I'm using Python 2.

I've pushed a quick patch to the "develop" branch for the train.py and parse.py scripts, but I still haven't tested this for you unfortunately. I thought I'd get this out now, rather than waiting longer for time to do it properly.

I'll be returning to development on this project in about a month --- at the moment I'm finishing a tokenizer and lexicon, which will also improve unicode support for the parser and tagger. I'll then clean up the parser and finally write documentation.

So: checkout the branch "develop", try now, and let me know how you go.

Answer 4 · 2014-09-08T13:03:44.000Z

Hi again
special thanks for your attention
I checked out branch develop (maybe) and tried to use it but I encountered lots of errors for modules
first for index.lexicon and then for perception and I'm not sure about my checkout so I think maybe it is my mistake so if you could check it your self it would be very good or help me about the modules problem
I solved index.lexicon issue by
$ pip install index
and I am not sure about it but the error passed
lots of thanks again

Answer 5 · 2014-09-08T13:05:01.000Z

Did you recompile?

fab clean make

Answer 6 · 2014-09-08T13:07:48.000Z

No
I cloned again and done every thing in ReadMe again

Answer 7 · 2014-09-08T14:03:55.000Z

Okay well, apart from some OSX-specific installation problems (grr), cloning, and checking out develop works for me. Can you do git log and tell me what SHA hash your develop branch is on? It should say:

commit d4460a048116e79a9e635b47695c2a69d84fb20b
Merge: 1478a44 35e9db5
Author: Matthew Honnibal
Date:   Fri Sep 5 21:10:37 2014 +0200

* Merge

Otherwise, maybe try running "git fetch" and then "git pull origin develop"? Or...something. I'm not sure how git synchs remote branches to you.

Answer 8 · 2014-09-09T04:58:26.000Z

I did git log and it said what it should say but
first of all the line
from redshift.sentence import Input
has problem
and also other similar problems