kochkinaelena/branchLSTM

Check install-run process

Closed this issue · 4 comments

I'm starting with a clean installation on my computer - this issue is to document any changes that have to be made to the code, file organisation or documentation in order to get the Lasagne version of branchLSTM working.

1bebd4c introduces a .gitignore file and makes a couple of changes to preprocessing.py:

  • The punkt NLTK data is downloaded and stored in the same directory as the code
  • The method for loading the GoogleNews data has been updated (original was deprecated in an earlier version of gensim)

One remaining problem - preprocessing.py expects there to be a replies folder for each tweet. If this folder is empty, it isn't uploaded to GitHub. So we have two options:

  • Ask the user to download the data directly from the SemEval-2017 website (which we already do for the GoogleNews data)
  • Add a file to every subfolder in the semeval2017-task8-dataset directory to ensure that otherwise empty folders are still uploaded to GitHub

I'm inclined to go with the first option, but let me know if any opinions.

Yes, first option seems fine to me.

22f80eb introduces a requirements file for easy installation (a couple of follow-up commits got this working when not logged into GitHub).

bcf0ec7 adds the option to perform a test tun of outer.py with unrealistic parameters values, and removes the SemEval-2017 datasets as discussed in the previous comments.

5d4ef03 adds instructions for downloading the required datasets via the command line.

I've now got preprocessing.py and outer.py running on my main computer, laptop and an basic Ubuntu VM on Azure, and the scorerA script successfully analyses the output. Now onto the final task related to this issue - get the GPU version running and document how to do this.

We now have a full set of installation instructions:
Initial version: db7b262 4ef756c 1ddc9e1
Updated version (switch to a general Azure VM and older version of Theano): cb8fbff a281021

I revised the versions in the requirements file to be closer to what Elena used for the SemEval task in
593d5e2.

A couple of small issues which may affect reproducibility have been fixed; any outstanding problems will be dealt with in a separate issue.

  • Order in which tweets are loaded cbd0b97
  • Hardcoded Theano options f5e97bf

I added Elena's script for generating Tables 3 and 4 from the paper in 6687689, which means we now have all the necessary scripts to repeat Elena's work. Some changes are required to the script now that file organisation has been altered in the commits described earlier in this issue, so I'll open a separate issue to focus on that script.