SethForsgren/word2vec

Patch for /trunk/demo-train-big-model-v1.sh

Opened this issue · 4 comments

Fixed a couple of bugs:
1. name mismatch with the UMBC-webbase corpus
2. Downloading the phrases dataset

Original issue reported on code.google.com by roys...@gmail.com on 15 Sep 2014 at 6:42

Attachments:

Thanks, I fixed the second part (the missing download of 
questions-phrases.txt). However, I don't know what the first problem is about - 
this part of the script runs OK for me.

Original comment by tmiko...@gmail.com on 15 Sep 2014 at 9:23

1. Is your shell case-insensitive? Also, does it implicitly add the .tar.gz 
suffix?
You download UMBC-webbase-corpus and extract umbc_webbase_corpus.tar.gz. 

2. The corpus contains two types of files - plain txt (.txt) and parsed files 
(.possf2). I assume you are only interested in the txt files, so you want to 
iterate over these files only.

Original comment by roys...@gmail.com on 16 Sep 2014 at 8:30

I just noticed that when downloading 
http://ebiquity.umbc.edu/redirect/to/resource/id/351/UMBC-webbase-corpus 
through my browser I also get umbc_webbase_corpus.tar.gz, as in the script. 
However, when I download it using wget, I get UMBC-webbase-corpus. This might 
explain the difference. And I also noticed you also handle the txt files only, 
so that's cool. 

Original comment by roys...@gmail.com on 17 Sep 2014 at 8:25

I get umbc_webbase_corpus.tar.gz when using wget, so the issue must be in 
something else. If more people will have the same problem as you, I may have to 
update the script and give the output file an exact name.

Original comment by tmiko...@gmail.com on 17 Sep 2014 at 5:48