QA Augmentation for Textbooks

Primary LanguagePython

Augmentations for Textbooks


This project presents augmentations to handle the conceptual gaps present in textbooks. This augmentation system has major modules:

  1. Concept Extraction: Key-concepts are extracted from the textbook sections using an entity annotation service (TAGME).

  2. Deficiency Diagnosis: The deficient concepts are diagnosed.

  3. Query Generation: Based on the deficiency, the concepts are merged with their context to form a keyword-based query.

  4. Textbook Augmentation: Using the queries, a set of QA pairs are retrieved from the associated QA archives.


1. Data Collection:

A. Textbook data: 28 textbooks from 7 subjects (Physics, Chemistry, Biology, Mathematics, Science, Geography and Economics) across grade-levels 6-12 are collected from https://ncert.nic.in. These data are stored in 1_PDF folder in PDF format. You can download the data from https://drive.google.com/drive/folders/1bfzMKe-GATR-2rucQGPipN1qmFCCeH-S?usp=sharing

B. QA data: 6 different Stack Exchange sites (Mathematics, Physics, Chemistry, Biology, Earth Science, and Economics) are dumped as 7 XML files (Physics, Chemistry, Biology, Mathematics, Science, Geography and Economics) under 'QA' folder. You can download the dumps from https://archive.org/details/stackexchange

2. Preprocessing:

Textbook contents (PDFs) are converted into TXT format and pre-processed by removing spurious data (appendix, exercise etc.). The code '1_convert.py' converts and preprocesses the data from folder '1_PDF' and stores in '2_Text' folder in TXT format.

3. Segmentation:

The textbook contents are segmented into textbook sections. The code '2_section.py' segments the textbook contents from '2_Text' folder and stores in '3_Section' folder.

4. Concept Extraction:

Key-concepts are extracted for each textbook sections. The code '3_concept.py' extracts the concepts and stores in '4_Concept' folder in JSON format. We create 'uConcepts.txt' to store all the unique concepts for all the subjects. We also create 7 different files (with name of the corrsponding grade-levels, e.g., '6.txt' for grade-level 6) to store the concepts, used in the textbooks asscoiated to specific grade-levels.

4. Aspects:

A. Wikipedia links are extracted for the extracted concepts. The code '4_wiki.py' extracts the inlinks and outlinks, and stores in '5_Link' folder in TXT format.

B. The code '4_wiki.py' reads 'uConcepts.txt' and combines the inlinks and outlinks for the read concepts. This combination is written for randomly selected 100 concepts in 'GS' folder in TXT format as three files: 'Related.txt,' 'Prerequisite.txt,' and 'Dependent.txt.' We modify these files by assigning an annotator to tag these concepts and their aspcts as right/wrong. The annotated/modified files are available at https://drive.google.com/drive/folders/1sbZoJzqcABbMce9uj1AojsQbDOqjCQaq?usp=sharing.

5. Deficiency Diagnosis:

A. The concepts from '4_Concept' folder are shown to annotators and they are asked to tag the cooresponding deficiency. This annotation is stored as 'GS_Deficiency.txt' (under folder 'GS' available at https://drive.google.com/drive/folders/1sbZoJzqcABbMce9uj1AojsQbDOqjCQaq?usp=sharing).

B. '5_sfeature.py' extracts the section-specfic features and stores in '6_sfeature' folder.

C. Similarly, '6_cfeature.py' extracts the concept=specific features and stores in '7_cfeature' folder.

D. Combining this features values with the annotation lables from 'GS_Deficiency.txt', code '7_Deficiency.py' creates two files 'feature1.csv' and 'feature2.csv'. 'feature1.csv' / 'feature2.csv' is the combination of featue vector and labels for baseline / proposed deficiency diagnosis module.

E. '7_Deficiency.py' trains, validates and tests the deficiency diagnosis module. It shows the final and subject-wise performance of the deficiency module in details.

6. Query Generation:

A. From each of the subjects and grade-levels, '8_Query.py' generates the queries and stores in '9_Query' in JSON format.

B. From each of the subjects, we randomly sample 500 queries from these sets of queries in '9_Query' in TXT format.

7. Textbook Augmentation:

A. For each query, '9_Retrieval.py' extracts augmentations (question id, title, body, tags and best accepted answer) from the QAs and stores under 'QA' folder 'QA.txt'.

B. '9_Retrieval.py' retrieves, re-ranks and filters these augmentations for the queries, generated in the previous step and the fectched augmentations are stored under '10_Retrieval' folder as 'RT.txt', 'RR.txt' and 'AUG.txt', respectively.

C. '9_Retrieval.py' assesses these retrieved augmentations against 'GS_Augmentation.txt' using 6 metrics: MAP, MRR, RP, P@1, P@5, and P@10. You can download the gold-standard data from https://drive.google.com/drive/folders/1sbZoJzqcABbMce9uj1AojsQbDOqjCQaq?usp=sharing.

8. Data Stats & Annotation Quality:

'10_get_stat.py' (a) offers statistics on the textbook data and QA data, (b) assesses the annotation quality for gold-standards for deficiency diagnosis and textbook augmentation.

9. Augmentations for Interface:

'11_augmentation.py' generates the augmentated textbooks where the concepts are linked with augmentations, directly.


Here is a list of python libraries. Install them befoe running the codes:

  • Wikipedia-api
  • NLTK
  • Spacy
  • Numpy
  • Sling
  • Tagme
  • word2vec
  • pickle
  • pandas
  • rank_bm25
  • stackapi
  • sklearn
  • scipy
  • libsvm
  • skmultilearn


In case of any queries, you can reach us at kghosh.cs@gmail.com


