/squid

Primary LanguagePython

An extensible, clean implementation of DocumentQA, and a basis for developing RCQA models

Code style: black

Work plan:

  • Prepare harness for tokenization, batch building and evaluation
  • Make a basic LSTM->Dense->Spans&no-answer outputting model to get the whole training/testing process running
  • Think about data cleanup, tokenization and all the other shenanigans of working with SQuAD
    • Lowercasing
    • Dealing with abbreviations
    • Dealing with numbers, dates etc
  • Add encoding of character-level info as well as word-level info
  • Add unit testing for core components
  • Make GPU compatible
  • Add option to read in a single answer span per question for training
  • Make a distinction between train and non-train datasets for proper handling of char/word -> idx mappings
  • Write dev validation during training
  • Implement BiDAF on top
  • Implement self attention as described in DocQA
  • Implement memory and runtime profiling
  • Add max context size
  • Do proper dropout
  • Test implementation with self attention
  • Do better structured config objects to pass around instead of bajillion parameters as it is used now
  • Implement char CNN for char embeddings
  • Reproduce DocQA Performance
  • Add the option to output no-answer probabilities with the output
  • Add encoding of sentence-level info
  • Integrate ELMo vectors