NExTplusplus/TAT-QA

Original Reports for data

Closed this issue · 8 comments

Hi do you happen to have the CIK or the company name and the year for each sample in the dataset. i.e

{
   uid: ...,
   cik: ...,
   company_name: ...,
   year: ...
}

Thank you very much for any help you can offer.

Hi @nickmagginas Many thanks for your interest in our dataset.

Sorry, we do not have the CIK or the company name and the year in our dataset. We suggest that you may take advantage of the existing information in the dataset to solve this challenge. Thank you.

Hi thanks for your reply.

We want to match each sample in the dataset to its original report (e.g. APPLE annual report 2019) so we can go and extract some structured information from the original report and use it as extra context for QA models. Currently there is no way to match its sample to the reported from which it was created .e.g APPLE 2019. Barring doing text searches in public annual report dumps from e.g. EDGAR I see no way of actually matching the data. I am elaborating because I think my original description of the problem was not very clear.

Thanks in advance for any help you can offer

Just to clarify even more. In the paper you say you download 500 annual reports but we have no way of knowing what those reports actually are. This information is not contained within in the dateset. If you can share those reports then even if we don't have the matching from context to report (so in which report the table and text is found) the matching becomes far easier than searching all annual reports publicly available

@nickmagginas Hi I understand your requests, we are still working on a research problem and may release the information about the original report in the future. Please keep monitoring our future works. Hope this helps. Thank you.

Has there been an update on this matter? Knowing from which report each hybrid context has been extracted would be beneficial for my research as well. If you could share the filenames/company name + year of the 182 reports included in the dataset I think this would be a great improvement. I could then match hybrid contexts to the related pdf files for my research and, if you permit, would share the results with the community.

Yeah. I have the same concern. I can extract the original file names (pdf files) but don't have an API to batch download them. It's good to have those files here.

@doviettung96 Yes, we share the info in the extended TAT-DQA dataset https://nextplusplus.github.io/TAT-DQA/, please you may find the company name and the year from the source field in the JSON image . You can easily find and download the original PDF file with it. Hope it helps, thank you.

@fengbinzhu Yeah. I knew that. I have a list of files. But when I go to Annual Report and download with the pdf file name, it's note exactly the file name that you give. For example:
I try to find "bce-inc_2019.pdf". After a while, I get a similar file "NYSE_BCE_2019". For other files, I can't always find it. It's better if we have an API to download the file with the exact name.