Return Doc ID instead of text
ArmandGiraud opened this issue · 3 comments
Hello,
I am wondering if it is possible to return a document ID instead of raw text with train mode 2.
I would like to display the text without normalization after query_predict therefore I need a way to match the raw documents.
Thank you
Hi,
Did you manage to solve this problem somehow? I struggle with ./starspace test
command, since this drops a few sentences, but the general output format is the same as for query_predict
. Since out-of-vocabulary words also are dropped from "LHS", matching sentence by sentence is no longer doable.
What cues did I try to find that lead to this drop?:
- sentences without
__label__
tag in the RHS - sentences without any word
- sentences containing any unicode character
However, all sentences to predict contains this tag, at least one word and at least one unicode character in each (these are still in the prediction).
For instance, there are 11054 lines to use for the prediction, but the last Example
tag contains a number "10808". Unfortunately, I cannot share the data, since these are tweets and Twitter's policy forbids sharing them.
Hello, I modified the source code of query_predict to print the doc ids instead of normalized text. The doc id corresponds to the document number found in baseDocs.
Here is my fork: https://github.com/ArmandGiraud/StarSpace/blob/master/src/apps/query_predict.cpp
Now I just need a mapping of baseDocs ids to my original documents. I hope it will help you. Let me know if it is unclear.
Thanks! :) In the meantime, I worked on an option, such as test
or train
, to the main script, so the obtained results don't need to be parsed externally. It just returns in the format
id1, label1 label2 label3
id2, label4 label5
and so on.
Command to run it: starspace pred
. Other options are as in starspace test
. The pred
requires the input in the following fastText format:
__id__<id> word_1 word_2 ... word_k __label__<labelid>
It supports multiple labels and a single id. I tried it on trainMode = 0
and data in fastText format. I don't know how about other formats.
Note: I struggle with the labelling. Some labels are treated as they had \n
as a part of it. This causes the result to be split across multiple lines. Still, it's easier to parse this than the original format.
My fork: https://github.com/kacper1095/StarSpace
Btw. I added also a description of the pred
under Predictions in trainMode = 0 (forked)
heading.