curiosity-ai/catalyst

How to get embedding matrix from StarSpace

jcperinan opened this issue · 1 comments

I would appreciate if you could give an example of the code required to use StarSpace, particularly when mapping a bag of words to a bag of tags, as originally described in:

https://github.com/facebookresearch/StarSpace#tagspace-word--tag-embeddings

Indeed, I'm having trouble when using StarSpace...

Suppose that I have a TXT file where each line contains a set of words that are semantically related to a tag (with the prefix "label"), as in:

decorate dress garnish adorn beautify embellish __label__decorate
knife cutlery cutter eat silverware butcher carve __label__knife
etc...

Here, one of the first questions is:

  • What should the input file format be? The default one in the original StarSpace is:

word1 word2 word3... [tab] __label__label1

Is this right? In my case, each line contains from 2 to 300 words and only one label.

With respect to the code, the goal is to get the label-embedding matrix generated from the input file, i.e. we should be able to get the vector corresponding to each label. As we work with unigrams and we expect to have vectors of 100 dimensions, the initial code could be as follows:

        languages.registerLanguage("English");
        Pipeline nlp = await Pipeline.ForAsync(languages.English);

        IEnumerable<IDocument> docs = GetDocsFromSingleFile(file); //this method converts each line of the file into an IDocument object
        IEnumerable<IDocument> parsed = nlp.Process(docs);

        StarSpace ss = new StarSpace(languages.lang, 0, "starspace-model", StarSpace.ModelType.TagSpace);			
	ss.Data.TrainWordEmbeddings = true;
	ss.Data.Dimensions = 100;
	ss.Data.WordNGrams = 1;
	ss.Data.InputType = "LabeledDocuments";
        ss.Train(parsed);
		...

Is this code right? I have just tried this code, and an error raises while training the model:

Exception thrown:
System.ArgumentOutOfRangeException: 'Specified argument was out of the range of valid values.'
Call Stack:
Mosaik.Core.dll!Mosaik.Core.ThreadSafeFastRandom.ThrowMaxValueOutOfRange()

By the way, how can I get the label-embedding matrix after training the model?

Thank you, and congratulations for your work.

stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.