Could you please provide the original amazon review dataset instead of bag-of-words version?

Question

Could you please provide the original amazon review dataset instead of bag-of-words version?

hikaru-nara opened this issue 4 years ago · 9 comments

Hi, thank you for the brilliant work!
I'd like to develop application on your research and need to verify some points of my interest. Could you please provide the original amazon review dataset that you derive the bow version from?
I know the official website of the dataset https://www.cs.jhu.edu/~mdredze/datasets/sentiment/. But I found that the dataset provided on the official website https://www.cs.jhu.edu/~mdredze/datasets/sentiment/ doesn't match yours on the unlabeled data.
Thanks

Answer 1 · 2020-09-11T01:17:37.000Z

We checked with the processed data from the official website https://www.cs.jhu.edu/~mdredze/datasets/sentiment/processed_acl.tar.gz

And it does match with the unlabeled data in this repo. Can you please check again?

However, the processed BOW data doesn't provide the unique reviews ids for each document. So, I don't think there is an easy way to map it back to the original version. You can maybe contact the creators of the original data to obtain that version which was used to create the BOW version.

Answer 2 · 2020-09-11T01:59:57.000Z

Thanks for the clarification. I realized that there are two versions of Amazon review dataset. I was actually looking at the older one https://www.cs.jhu.edu/~mdredze/datasets/sentiment/domain_sentiment_data.tar.gz. That's why it doesn't match.
I believe here is the original data without bag-of-words preprocessing: https://www.cs.jhu.edu/~mdredze/datasets/sentiment/unprocessed.tar.gz in case anyone should need it.

Answer 3 · 2020-12-22T02:06:30.000Z

Thanks for the clarification. I realized that there are two versions of Amazon review dataset. I was actually looking at the older one https://www.cs.jhu.edu/~mdredze/datasets/sentiment/domain_sentiment_data.tar.gz. That's why it doesn't match.
I believe here is the original data without bag-of-words preprocessing: https://www.cs.jhu.edu/~mdredze/datasets/sentiment/unprocessed.tar.gz in case anyone should need it.

Could you please provide the original amazon review dataset instead of bag-of-words version？
https://www.cs.jhu.edu/~mdredze/datasets/sentiment/unprocessed.tar.gz this linke don't have the data without bag-of-words preprocessing. the link contain all data but the bag-of-words preprocessing is a part of it

Answer 4 · 2020-12-22T03:14:52.000Z

You're right. I've contact the original dataset creator and they too said there is no easy way to recover the data without BOW preprocessing because it was a project ten years ago.
In fact, from my literature research, most paper on cross-domain sentiment classification uses the older version https://www.cs.jhu.edu/~mdredze/datasets/sentiment/domain_sentiment_data.tar.gz. For example, check out this latest paper from UC Berkeley and their references, and you'll see (they all specify the dataset link they used). So unless one has to use the newer version, I would recommend the older one because it's raw texts.

Answer 5 · 2020-12-22T03:21:49.000Z

Thank you for your reply！  https://www.cs.jhu.edu/~mdredze/datasets/sentiment/domain_sentiment_data.tar.gz But the unlabeled data  in this version has no emotional polarity.

…

------------------ 原始邮件 ------------------ 发件人: "declare-lab/kingdom" <notifications@github.com>; 发送时间: 2020年12月22日(星期二) 中午11:15 收件人: "declare-lab/kingdom"<kingdom@noreply.github.com>; 抄送: "814291514"<814291514@qq.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [declare-lab/kingdom] Could you please provide the original amazon review dataset instead of bag-of-words version? (#3) You're right. I've contact the original dataset creator and they too said there is no easy way to recover the data without BOW preprocessing because it was a project ten years ago. In fact, from my literature research, most paper on cross-domain sentiment classification uses the older version https://www.cs.jhu.edu/~mdredze/datasets/sentiment/domain_sentiment_data.tar.gz. For example, check out this latest paper from UC Berkeley and their references, and you'll see (they all specify the dataset link they used). So unless one has to use the newer version, I would recommend the older one because it's raw texts. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Answer 6 · 2020-12-22T03:28:35.000Z

I'm not aware of the problem you said. Check out this github repo. They provided a copy of the older dataset which I personally use for research.

Answer 7 · 2020-12-22T03:34:35.000Z

Thank you very much. It's been a great help

…

------------------ 原始邮件 ------------------ 发件人: "declare-lab/kingdom" <notifications@github.com>; 发送时间: 2020年12月22日(星期二) 中午11:28 收件人: "declare-lab/kingdom"<kingdom@noreply.github.com>; 抄送: "814291514"<814291514@qq.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [declare-lab/kingdom] Could you please provide the original amazon review dataset instead of bag-of-words version? (#3) I'm not aware of the problem you said. Check out this github repo. They provided a copy of the older dataset which I personally use for research. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Answer 8 · 2020-12-22T03:36:54.000Z

My pleasure  发自我的iPhone

…

------------------ Original ------------------ From: ThomasJame <notifications@github.com> Date: Tue,Dec 22,2020 11:34 AM To: declare-lab/kingdom <kingdom@noreply.github.com> Cc: hikaru-nara <davidli@pku.edu.cn>, State change <state_change@noreply.github.com> Subject: Re: [declare-lab/kingdom] Could you please provide the original amazon review dataset instead of bag-of-words version? (#3) Thank you very much. It's been a great help

------------------&nbsp;原始邮件&nbsp;------------------ 发件人: "declare-lab/kingdom" <notifications@github.com&gt;; 发送时间:&nbsp;2020年12月22日(星期二) 中午11:28 收件人:&nbsp;"declare-lab/kingdom"<kingdom@noreply.github.com&gt;; 抄送:&nbsp;"814291514"<814291514@qq.com&gt;;"Comment"<comment@noreply.github.com&gt;; 主题:&nbsp;Re: [declare-lab/kingdom] Could you please provide the original amazon review dataset instead of bag-of-words version? (#3) I'm not aware of the problem you said. Check out this github repo. They provided a copy of the older dataset which I personally use for research. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe.

Answer 9 · 2021-03-29T01:28:41.000Z

could you provide the code of Glove-DANN，please