/NUS-MOOC-Transacts-Corpus

An annotated corpus of discussion forum threads from Massive Open Online Courses.

Primary LanguagePerlMIT LicenseMIT

NUS MOOC Transacts Corpus

This is an annotated corpus of discussion forum threads from Massive Open Online Courses (MOOCs). The annotations are grounded on a pedagogy based discourse framework that adapts from and codifies 'transactivity' as proposed by Berkowitz and Gibbs, 1983. This is a simplified adadpation of their pedagogical/psychology based coding scheme, for instructor posts and replies in MOOC discussion forums.

We also propose inter-annotator agreement measures for a piecewise crowdsourcing annotation task to annotate the forum discussions with our modified taxonomy of transactive interventions.

However, due to privacy concerns and copyright claimed by MOOC platforms such as Coursera.org, we have encrypted the data. We also reserve the rights for access, use and distribution of the data.

All rights reserved.

For access and use, please fill out the academic research purpose license form at http://bit.ly/wing-nus-mooc-transacts-corpus-request-form. We hold personal liability for the data to NUS and Coursera. We will review your request and get back to you within five (5) business days.

Citation:

This dataset is proposed as part of the foll. Ph.D. thesis which can be downloaded here: https://scholarbank.nus.edu.sg/handle/10635/155763

If you use the corpus for your research please cite:

@phdthesis{Chandrasekaranthesis2019,
    author = {MUTHU KUMAR CHANDRASEKARAN},
    school   = {National University of Singapore},
    title = {A DISCOURSE CENTRIC FRAMEWORK FOR FACILITATING INSTRUCTOR INTERVENTION IN MOOC DISCUSSION FORUMS},
    year = {2019},
}

Data

Repository contains serially annotated data for 3 natural language processing tasks on MOOC discussion threads.
Given a complete thread of posts from a MOOC forum up until an instructor intervenes (writes a post / comment), we ask annotators / an NLP system:

  • Task 1 (Marking Task): to link the instructor post to the earlier student post(s) to which it acts as a reply or as a comment.
  • Task 2 (Categorization Task): to categorize the pair thus identified with the most suitable type(s) from our predefined inventory of discourse types in Table. Task 2 is subdivided into two where first we ask to classify the post pair with a Top level category (see table below) and then into a subcategory (see table below) beneath the chosen top level category.

Annotation Categories

Level 1 Category Level 2 Category Transactive?
(Top level) (Low level)
Requests Feedback Request Yes
Justification Request Yes
Elaborates Extension Yes
Juxtaposition Yes
Clarification Yes
Refinement Yes
Reasoning Critique Yes
Resolves Completion Yes
Paraphrase Yes
Integration & Summing up Yes
Agreement Yes
Disagreement Yes
Generic Answer No
Appreciation No
Social Other logistics No
Social No

File Format

Annotaded data grouped by course and forums under each course is provided in an encrypted zip file at https://github.com/WING-NUS/NUS-MOOC-Transacts-Corpus/blob/master/data/nus-mooc-transacts-corpus-pswd-protected.zip For example, a file annotated threads from 'Lecture' forum of course warhol-001 is named as: warhol-001.lecture.1.csv

Directory Structure:


--|__ Task1-Marking_Task 
|__ Task2-Categorisation_Task_low_lvl
|__ Task2-Categorisation_Task_top_lvl

Each file 'Task1-Marking_Task' consists of following headers:


"HITId", "HITTypeId", "Title", "Description", "Keywords", "Reward", 
"CreationTime", "MaxAssignments", "RequesterAnnotation", "AssignmentDurationInSeconds",
"AutoApprovalDelayInSeconds", "Expiration", "NumberOfSimilarHITs", "LifetimeInSeconds",
"AssignmentId", "WorkerId", "AssignmentStatus", "AcceptTime", "SubmitTime",
"AutoApprovalTime", "ApprovalTime", "RejectionTime", "RequesterFeedback",
"WorkTimeInSeconds", "LifetimeApprovalRate", "Last30DaysApprovalRate", "Last7DaysApprovalRate",
"Input.threadtype", "Input.threadtitle", "Input.posts", "Input.inst_post",
"Answer.1", "Answer.2", ... "Answer.n" (where n is the total number of posts in the thread.)

Each Answer.x is either Marked or Unmarked by the annotator

Each file 'Task2-Categorisation_Task_top_lvl' consists of following headers:


"HITId", "HITTypeId", "Title", "Description", "Keywords", "Reward", 
"CreationTime", "MaxAssignments", "RequesterAnnotation", "AssignmentDurationInSeconds",
"AutoApprovalDelayInSeconds", "Expiration", "NumberOfSimilarHITs", "LifetimeInSeconds",
"AssignmentId", "WorkerId", "AssignmentStatus", "AcceptTime", "SubmitTime",
"AutoApprovalTime", "ApprovalTime", "RejectionTime", "RequesterFeedback",
"WorkTimeInSeconds", "LifetimeApprovalRate", "Last30DaysApprovalRate", "Last7DaysApprovalRate",
"Input.threadtype", "Input.threadtitle", "Input.posts", "Input.inst_post",
"Answer.1_discourse_type", ..., "Answer.X_discourse_type", "Answer.noreply", "Approve", "Reject"

Each Answer.x is is a top level discourse category (see table above) for each Marked post from the previous task output

File format for Task2-Categorisation_Task_low_lvl is similar except the discourse categories are chosen from low level discourse category (see table above)

In all three files formats columns: "Input.posts", "Input.inst_post" are in html format. When processing your input we strongly recommend you to drop the columns to easily visualize the data and the annotation.

The following columns are an artefact of the MTurk system and are unlikely to be of use for model development. We recommend you to drop them as well before processing the annotations for model development. The columns are:


"HITTypeId", "Title", "Description", "Keywords", "Reward", 
"CreationTime", "MaxAssignments", "RequesterAnnotation", "AssignmentDurationInSeconds",
"AutoApprovalDelayInSeconds", "Expiration", "NumberOfSimilarHITs", "LifetimeInSeconds",
"AssignmentId", "AssignmentStatus", "AcceptTime", "SubmitTime",
"AutoApprovalTime", "ApprovalTime", "RejectionTime", "RequesterFeedback",
"WorkTimeInSeconds", "LifetimeApprovalRate", "Last30DaysApprovalRate", "Last7DaysApprovalRate",

Annotators

All annotators were crowdworkers recruited from Amazon MTruk platform. Each thread is annotated by 7 workers. You can aggregate the categories and calculate Inter Annotator Agreements using the scripts https://github.com/cmkumar87/NUS-MOOC-Transacts-Corpus/blob/master/scripts/analysis/fleiss_kappa_per_post.py and https://github.com/cmkumar87/NUS-MOOC-Transacts-Corpus/blob/master/scripts/analysis/fleiss_kappa_per_thread.py.

If you have questions / issues with preprocessing and/or IAA please raise a github issue.

ACKNOWLEDGEMENTS

The corpus creation was partially funded by National University of Singapore (NUS) - Office of the Provost through Learning Innovation Fund - Technology (LIF-T) grant # C-252-000-123-001