declare-lab/kingdom

Could you please provide the original amazon review dataset instead of bag-of-words version?

hikaru-nara opened this issue · 9 comments

Hi, thank you for the brilliant work!
I'd like to develop application on your research and need to verify some points of my interest. Could you please provide the original amazon review dataset that you derive the bow version from?
I know the official website of the dataset https://www.cs.jhu.edu/~mdredze/datasets/sentiment/. But I found that the dataset provided on the official website https://www.cs.jhu.edu/~mdredze/datasets/sentiment/ doesn't match yours on the unlabeled data.
Thanks

We checked with the processed data from the official website https://www.cs.jhu.edu/~mdredze/datasets/sentiment/processed_acl.tar.gz

And it does match with the unlabeled data in this repo. Can you please check again?

However, the processed BOW data doesn't provide the unique reviews ids for each document. So, I don't think there is an easy way to map it back to the original version. You can maybe contact the creators of the original data to obtain that version which was used to create the BOW version.

Thanks for the clarification. I realized that there are two versions of Amazon review dataset. I was actually looking at the older one https://www.cs.jhu.edu/~mdredze/datasets/sentiment/domain_sentiment_data.tar.gz. That's why it doesn't match.
I believe here is the original data without bag-of-words preprocessing: https://www.cs.jhu.edu/~mdredze/datasets/sentiment/unprocessed.tar.gz in case anyone should need it.

Thanks for the clarification. I realized that there are two versions of Amazon review dataset. I was actually looking at the older one https://www.cs.jhu.edu/~mdredze/datasets/sentiment/domain_sentiment_data.tar.gz. That's why it doesn't match.
I believe here is the original data without bag-of-words preprocessing: https://www.cs.jhu.edu/~mdredze/datasets/sentiment/unprocessed.tar.gz in case anyone should need it.

Could you please provide the original amazon review dataset instead of bag-of-words version?
https://www.cs.jhu.edu/~mdredze/datasets/sentiment/unprocessed.tar.gz this linke don't have the data without bag-of-words preprocessing. the link contain all data but the bag-of-words preprocessing is a part of it

You're right. I've contact the original dataset creator and they too said there is no easy way to recover the data without BOW preprocessing because it was a project ten years ago.
In fact, from my literature research, most paper on cross-domain sentiment classification uses the older version https://www.cs.jhu.edu/~mdredze/datasets/sentiment/domain_sentiment_data.tar.gz. For example, check out this latest paper from UC Berkeley and their references, and you'll see (they all specify the dataset link they used). So unless one has to use the newer version, I would recommend the older one because it's raw texts.

I'm not aware of the problem you said. Check out this github repo. They provided a copy of the older dataset which I personally use for research.