Share full or more data
raghadeepa opened this issue · 6 comments
Dear Authors,
I am happy to see your results on the insurance data set and tempted to re-produce on my side. But, I could not replicate on my test data. The reasons are that you applied stop-word removal, text normalization and lemmatization on the text. So, my test data and your test data are not matching. If possible, could you please share full test data or 500 or 1000 test queries similar to what you have provided for (20 test queries).
--Veera.
wait for a while, i try to find the data script
updated,see gen.py
Thank you for sharing the updated code.
I believe that corpus is taken from below link and converted pickle file into ascii file.
https://github.com/codekansas/insurance_qa_python
and used following mapping.
answers --> answers.label.token_idx
train --> question.train.token_idx.label
dev --> question.dev.label.token_idx.pool
test1 --> question.test1.label.token_idx.pool
test2 --> question.test2.label.token_idx.pool
Could you please clarify following queries.
Which tools were used for lemmetization and text normalization. What are the stop words used to remove from the text.
--Veera.
https://github.com/shuzi/insuranceQA, the data link.
There's no lemmetization , text normalization and stop words in my code.
Did you remove "category" name from the data before you processing and did you generate test2 from validation file from the original source (I will be happy to copy test1 and test2 if you share links here).
Yes. I did not see lemmetization, text normalization and stop words usage. But, when I compared the answers from your train file and original source. Below are the comparison between "original" and "your set". My concern is that, if the words are changed, we don't find them in word embedding. So, I am seeing huge difference in the results. Please bear with me.
Original:
Life insurance is exempt from creditor in Connecticut if the beneficiary on the policy is not the estate . If the insured person passes away with debt , and the proceeds of his/her policy goes to the estate -LRB- either because the policy was set up that way , or because the beneficiaries are no longer living -RRB- , then the life insurance money will go through probate , where creditors can go after it . If the beneficiary is a person , they can not take it away from that person
You train set:
life_insurance_beexempt_from_creditor_in_Connecticut_if_the_beneficiary_on_the_policybenot_the_estate_if_the_insured_person_pass_away_with_debt_and_the_proceeds_ofhis_/_herpolicy_go_to_the_estate_either_because_the_policy_be_set_up_that_way_or_because_thebeneficiary_beno_longerlive_,then_the_life_insurance_money_will_go_through_probate_where_creditor_can_go_after_it_if_the_beneficiarybe_a_person_they_not_can_take_it_away_from_that_person
--Veera.
I am sorry to bug you again. Could you please share full test data if possible or stop words, lemmatizer and text normalizer that you might have used.