Fail to use it in Google Colab

Question

Fail to use it in Google Colab

ajason08 opened this issue 5 years ago · 3 comments

ajason08 commented 5 years ago

Hello,
Thank you for your effort at doing this python version.
I am struggling to run the first example.

My code (just 3 lines of code) can be reproduced with this notebook

Can you please help me to understand what is wrong?

Thank you!

Answer 1 · 2020-04-21T13:38:43.000Z

Hi

L.19 in example.py only works in python 2.
Try replacing yield line.lower().translate(None, delchars).split(' ')
with yield line.lower().translate({ord(x): None for x in delchars}).split(' ')

You're probably going to run into more issues down the line though as this code was written for python 2 and appears to be no longer maintained.

Cheers

Answer 2 · 2020-04-21T13:41:59.000Z

(The issue further down in your code is because model.fit() expects a list of lists, not a list of strings. Each document should be represented as a list of words.)

Answer 3 · 2020-04-21T16:05:19.000Z

Now working as expected!
Thank you

I paste my solved code here for future references to readers.

!pip install glove_python
!curl -o my_corpus.txt https://norvig.com/big.txt

from glove import Corpus, Glove

#Creating a corpus object
corpus = Corpus() 

""" The learner "model.fit()" expects a list of (list of string),
  not a big string nor a list of strings. 
  Each document should be represented as a list of words: [[doc1],[doc2]...]
  Next code will turn a txt file into this format.
  However it should have more efficient alternatives """

with open("my_corpus.txt",'r') as f:  
    lines = f.read().split()  

num_docs = 10
doc_list = []
last_index = 0
for i in range(num_docs):
  upper_index= (int(len(lines)/num_docs))*(i+1) #probably lossing last lines
  newdoc = lines[last_index:upper_index]
  doc_list.append(newdoc)

print("number of docs in doc_list:",len(doc_list))
print("first doc fragment:", doc_list[0][0:11])


#Training the corpus to generate the co occurence matrix which is used in GloVe
corpus.fit(doc_list, window=10)
glove = Glove(no_components=5, learning_rate=0.05) 
glove.fit(corpus.matrix, epochs=30, no_threads=1, verbose=True)
glove.add_dictionary(corpus.dictionary)
glove.save('glove.model')

glove = Glove.load('glove.model')
x = glove.most_similar("Sherlock", number=10)
x