Welcome to the TreeSQuAD2.0 dataset, a public resource created by the Dalhousie Natural Language Processing Lab (DNLP). This dataset is the result of my master's thesis research, focusing on incorporating Structural Embedding of Constituency Trees in the Attention-Based Model for Machine Comprehension. The thesis can be accessed here.
The 'Processed' folder contains the following:
-
Parsed Trees:
- Parsed trees generated using the Stanford CoreNLP parser. (Citation: Manning et al., 2014)
-
Simplified Trees:
- Trees simplified into Nodes, Leaves, spans, and POSTags using the fairseq library. (Citation: Ott et al., 2019)
-
Vocabulary:
- Vocabulary of Tokens.
Feel free to explore and utilize the dataset for your NLP and machine comprehension projects. If you find this resource helpful, consider citing this work or providing feedback.
I am sincerely grateful to Dr. Vlado Keselj for his invaluable guidance and support throughout this research.
Happy coding!