The task for this project is to segment a sequence of English characters into the most likely word sequence.
This is mainly to set up your groups and programming environment
Make sure you setup your virtual environment:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
You can optionally copy and modify the requirements for when we test your code:
cp requirements.txt answer/requirements.txt
You must create the following files:
answer/ensegment.py
answer/ensegment.ipynb
To create the output.zip
file for upload to Coursys do:
python3 zipout.py
For more options:
python3 zipout.py -h
To create the source.zip
file for upload to Coursys do:
python3 zipsrc.py
For more options:
python3 zipsrc.py -h
To check your accuracy on the dev set:
python3 check.py
For more options:
python3 check.py -h
In particular use the log file to check your output evaluation:
python3 check.py -l log
The accuracy on data/input/test.txt
will not be shown. We will
evaluate your output on the test input after the submission deadline.
The default solution is provided in default.py
. To use the default
as your solution:
cp default.py answer/ensegment.py
cp default.ipynb answer/ensegment.ipynb
python3 zipout.py
python3 check.py
Make sure that the command line options are kept as they are in
default.py
. You can add to them but you must not delete any
command line options that exist in default.py
.
Submitting the default solution without modification will get you zero marks.
The data files provided are:
data/count_1w.txt
-- counts taken from the Google n-gram corpus with 1TB tokensdata/input
-- input filesdev.txt
andtest.txt
data/reference/dev.out
-- the reference output for thedev.txt
input file