soumyac1999/hyperbolic-label-emb-for-hmc

Could you give more information about the dataset?

Closed this issue · 5 comments

Hello,

Thanks for your kindly shared code. However, I met several problems when dealing with the dataset. It seems that the authors of HiLAP are not here recently...

My questions are,

  1. what is the docs.txt in rcv1? I know it can be find at https://trec.nist.gov/data/reuters/reuters.html but there are not only one file. Which is the correct one?
  2. The yelp dataset is different from what the authors used. Would you please provide a version that you used?

Thanks in advance.

Hi

  1. You can download the following files mentioned here from the above trec link.

  2. Yelp keeps changing their dataset and doesn't mention their version. You can search to download
    yelp_academic_dataset_business.json containing 209393 lines
    and yelp_academic_dataset_review.json containing 8021122 lines.
    with 540 Labels.
    We also faced a similar issue while finding Yelp dataset for comparing with HiLAP. Eventually, we have to run their code on the current available version of Yelp.

You can download the following files mentioned here from the above trec link.

Yes, I have downloaded all files I can find in your mentioned code (only the 'docs.txt' is missing).
I guess you’re meant to construct the 'docs.txt' by myself according to the several 'train' and 'test' data in their code?

docs.txt contains complete RCV1 dataset (non-tokenized) containing 806791 lines. *.dat files are the tokenized version of the RCV1.
Since RCV1 is a proprietary dataset, you need to follow instructions given at the author's page to download

Pasting few lines of RCV1 docs.txt below

2290 USA: Planet Hollywood launches credit card. ....... Orlando, Florida-based Planet Hollywood is part of Planet Hollywood International Inc.
2333 USA: U.S. farm trade surplus $1.741 billion in June.....ell in June to $110 million from $157 million in May.

I understand. Thanks again for your patience.

Hello, I have applied to get the RCV1 dataset, how do I convert the data to TXT form? I would appreciate it if you could share your converted file!