[FEAT] New dataset of parser IF

Question

[FEAT] New dataset of parser IF

Opened this issue 5 years ago · 0 comments

🚀 Feature Request

This webpage has gameplay transcripts for a huge variety of parser based Interactive Fiction games.
With some processing this seems like a good source of training data to me, though some of the edits would probably need to be done manually.

The processing on the game transcripts themselves should include things such as substituting '> x desk' for 'You look at the desk', removing the commands/responses that are unrecognised, and removing the out-of-world about sections of the transcripts.
There's also the issue that IRC chat is interspersed with the game transcript on the page, but this should be easily filterable for someone who knows what they're doing I assume? Or perhaps the webmaster will have separate logs.

With this data I would hope the AI could improve on describing objects/people/places, since IF does a lot of that, while the current data seems to be better at conversations and actions.

I started writing a script to download them and filter out the IRC chat, but I've realised my approach isn't a great one. I'm not super proficient at CLI, so I leave this to more capable hands if there are any takers?

#!/bin/bash
elinks -dump -dump-width 999 $1 \
| egrep " Floyd \||(to floyd)|(to Floyd)" \
| sed -e 's/[[:space:]]*Floyd [|][[:space:]]*//g' \
-e 's/[[:space:]][[:alnum:]]* says (to [Ff]loyd),/>/gi' \
-e 's/^>$//g' \
-e '/>/ { s/"//g}' \
-e 's/^\*.*el.$//g' \
-e 's/^ *> /> /g' \
-e 's/        .*//g' \
> ./Dumps/$2.txt```