nouhadziri/THRED

train data format

Closed this issue · 3 comments

Hello, I have a simple question for the input training data format. I've download the 5 turns train data, as you say each line is separated by TAB, but after the 5th sentence, there is no more topic words. Does the 5th sentence is the topic words? I sampled the first line from the training data as follows:
i 've never been this high i 'm going up [ 8 } \t my armed feel weird \t how 's your legs doing ? \t they feel like roots for a tree \t grow young entling grow .$
Are the topic words " grow young entling grow ."?
Thanks

ehsk commented

Hi @cyq130,

Thanks for your interest in our work.

Seems you have downloaded the version without the topic words. For the topic words included data (download from here), each line contains 5 utterances plus their corresponding topic words. The first line should be the following:

Utterance 1: i 've never been this high i 'm going up [ 8 }
Utterance 2: my armed feel weird
Utterance 3: how 's your legs doing ?
Utterance 4: they feel like roots for a tree
Utterance 5: grow young entling grow .
Topic words: guns gun people like weapons firearms re right rifle nra assault amendment need think automatic carry rifles control want use shooting know hunting point ve owners ownership weapon armed arms thing military 2nd things going rights pretty actually firearm time sure way magazine good safety auto ammo defense semi mean buy shoot mass background second shootings ban range ar15 yes talking militia pew hobby owner laws self able defend lobby checks common makes yeah ll better away check reason said having kill magazines concealed designed country anti world probably saying feel training protect look owning hands bear long idea argument

Thanks for answering my question. From your paper, I have noticed that you assign a topic T to the conversation, I am wondering how to decide select which topic T from so many topics?

ehsk commented

The topics are inferred using a pre-trained LDA model. We trained LDA on Reddit posts and comments. We employed gensim for this purpose.