/university-language-understanding-systems-project-2

2nd project of Language Understanding Systems @ UniTN

Primary LanguageTeXMIT LicenseMIT

Concept Sequence Tagging for a Movie Domain

2nd project of Language Understanding Systems, Fall Semester 2017, University of Trento.

The project consist in building a model for concept sequence tagging for a Movie Domain. The task is similar to first project, but this time we use Conditional Random Fields instead of Finite State Transducers. All results are discussed in the report file.

Repository Structure

The repository is organized in 2 main folders: report and code. The report folder contains the LaTeX report source files, the code folder contains the scripts used to train and test the models, plus some utilities used to parse the results and compute statistics.

Model

The code/model folder contains the CRF++ template files. The code/scripts/run.sh script can be used to train a given model, for example

./code/scripts/run.sh 03_advanced/10

The script will automatically create the code/computations folder. Inside this folder you will find a folder for each model you trained.

Genetic Algorithm

The code/scripts/genetic.py file implements a genetic algorithm to automatically select the best features for the CRF model. The script is written in Python3 and requires the DEAP library to work.

By default, the script required an initial population inside the code/scripts/initial_population folder. Please create this folder before running the script and populate it with some models trained generated by hand. Please make sure to invoke the script from the code/scripts folder.

Licence

The model source code is licences under the MIT license. A copy of the license is available in the LICENSE file. The LaTeX sources and the report are licenced under the Creative Commons Attribution-ShareAlike 4.0 International License.