Return Doc ID instead of text

Question

Return Doc ID instead of text

ArmandGiraud opened this issue 6 years ago · 3 comments

Hello,
I am wondering if it is possible to return a document ID instead of raw text with train mode 2.

I would like to display the text without normalization after query_predict therefore I need a way to match the raw documents.

Thank you

Answer 1 · 2019-05-27T09:24:38.000Z

Hi,

Did you manage to solve this problem somehow? I struggle with ./starspace test command, since this drops a few sentences, but the general output format is the same as for query_predict. Since out-of-vocabulary words also are dropped from "LHS", matching sentence by sentence is no longer doable.

What cues did I try to find that lead to this drop?:

sentences without __label__ tag in the RHS
sentences without any word
sentences containing any unicode character

However, all sentences to predict contains this tag, at least one word and at least one unicode character in each (these are still in the prediction).

For instance, there are 11054 lines to use for the prediction, but the last Example tag contains a number "10808". Unfortunately, I cannot share the data, since these are tweets and Twitter's policy forbids sharing them.

Answer 2 · 2019-05-27T11:51:15.000Z

Hello, I modified the source code of query_predict to print the doc ids instead of normalized text. The doc id corresponds to the document number found in baseDocs.
Here is my fork: https://github.com/ArmandGiraud/StarSpace/blob/master/src/apps/query_predict.cpp
Now I just need a mapping of baseDocs ids to my original documents. I hope it will help you. Let me know if it is unclear.

Answer 3 · 2019-05-27T13:15:47.000Z

Thanks! :) In the meantime, I worked on an option, such as test or train, to the main script, so the obtained results don't need to be parsed externally. It just returns in the format

id1, label1 label2 label3
id2, label4 label5

and so on.

Command to run it: starspace pred. Other options are as in starspace test. The pred requires the input in the following fastText format:
__id__<id> word_1 word_2 ... word_k __label__<labelid>

It supports multiple labels and a single id. I tried it on trainMode = 0 and data in fastText format. I don't know how about other formats.

Note: I struggle with the labelling. Some labels are treated as they had \n as a part of it. This causes the result to be split across multiple lines. Still, it's easier to parse this than the original format.

My fork: https://github.com/kacper1095/StarSpace

Btw. I added also a description of the pred under Predictions in trainMode = 0 (forked) heading.