/dataset_moral.maze

A dataset of dialogue data drawn from the BBC Radio 4 Moral Maze discussion show

dataset_moral.maze

A dataset of dialogue data drawn from the BBC Radio 4 Moral Maze discussion show.

The Project web pages are located at: http://siwells.github.io/dataset_moral.maze/

Cleaning Process:

  • Extract individual episodes from word document
  • Check no instances of '|' character within the text. NB. Because the data consists of free-text it is difficult to select a delimiter that is guaranteed not to occur. The pipe is a safe bet
  • Create a pipe delimited file consisting of line oriented data of the following form:
  • speaker | utterance
  • Use the csvtojson.py file in /tools to convert each csv file into a JSON document
  • Create a JSON document for each episode that has the structure shown in schema/datafile.json
  • Calculate an MD5 sum for each episode, e.g.
  • for f in *.json; do md5 "$f" > `basename "$f" .json`.md5; done