maciejkula/glove-python

Fail to use it in Google Colab

ajason08 opened this issue · 3 comments

Hello,
Thank you for your effort at doing this python version.
I am struggling to run the first example.

My code (just 3 lines of code) can be reproduced with this notebook

Can you please help me to understand what is wrong?

Thank you!

Hi

L.19 in example.py only works in python 2.
Try replacing yield line.lower().translate(None, delchars).split(' ')
with yield line.lower().translate({ord(x): None for x in delchars}).split(' ')

You're probably going to run into more issues down the line though as this code was written for python 2 and appears to be no longer maintained.

Cheers

(The issue further down in your code is because model.fit() expects a list of lists, not a list of strings. Each document should be represented as a list of words.)

Now working as expected!
Thank you

I paste my solved code here for future references to readers.

!pip install glove_python
!curl -o my_corpus.txt https://norvig.com/big.txt

from glove import Corpus, Glove

#Creating a corpus object
corpus = Corpus() 

""" The learner "model.fit()" expects a list of (list of string),
  not a big string nor a list of strings. 
  Each document should be represented as a list of words: [[doc1],[doc2]...]
  Next code will turn a txt file into this format.
  However it should have more efficient alternatives """

with open("my_corpus.txt",'r') as f:  
    lines = f.read().split()  

num_docs = 10
doc_list = []
last_index = 0
for i in range(num_docs):
  upper_index= (int(len(lines)/num_docs))*(i+1) #probably lossing last lines
  newdoc = lines[last_index:upper_index]
  doc_list.append(newdoc)

print("number of docs in doc_list:",len(doc_list))
print("first doc fragment:", doc_list[0][0:11])


#Training the corpus to generate the co occurence matrix which is used in GloVe
corpus.fit(doc_list, window=10)
glove = Glove(no_components=5, learning_rate=0.05) 
glove.fit(corpus.matrix, epochs=30, no_threads=1, verbose=True)
glove.add_dictionary(corpus.dictionary)
glove.save('glove.model')

glove = Glove.load('glove.model')
x = glove.most_similar("Sherlock", number=10)
x