This class is meant to teach Data Science students how to work with text data using tools from the Python ecosystem.
List of topics to be covered in this class:
- Dealing with text data: bytes, charsets and unicode symbols
- Scraping data from web pages
- Bag-of-words and supervised text classification
- Indexing full-text documents
- Clustering: grouping similar text documents
- Unsupervised Latent space models for semantic information retrieval
Some possible topics for future extended versions of this class:
- Word embeddings and supervised text classification with continuous bag of words
- Introduction to Knowledge Bases
- Using pre-trained Named entity detection and Semantic role labeling to extract knowledge from text
- Sequence to Sequence for machine translation
- Machine Reading
Install git, clone this repository and change the working directory to the newly created folder:
git clone https://github.com/ogrisel/text-mining-class
cd text-mining-class/
If you forked this repository on github (recommended to be able to do the code review exercises), you should register your own github fork as a remote repository:
git remote add myname https://github.com/myname/text-mining-class
git remote -v
You should see 2 remote repo: one named "origin" and pointing to
ogrisel/text-mining-class.git
and the other with your github name and
pointing to your own github fork.
Install Python 3.6 or later. If you have never installed Python before, you can use the Anaconda distribution.
Open a terminal and check that you can use your newly installed python3 command:
python3 -V
Make sure that this command points to the location you where you installed anaconda and not a system installation of Python.
To check where the python3
command comes from (depending on your PATH
environment variable), use one of the following commands:
Under Windows:
where python3
or under Linux and macOS:
which python3
You can also do the same for conda
and later for the pip
command.
You can alternatively use the following commands:
python3 -c "import sys; print(sys.executable)"
Finally you can check the configuration of conda with:
conda info
Then install the project dependencies in a dedicated conda environment named
tmclass
(see the content of the environment.yml
file for details:
conda env create -f environment.yml
Then activate the tmclass
conda environment:
Under Windows:
activate tmclass
or under Linux / macOS:
source activate tmclass
Alternatively, it is also possible to use a virtualenv and pip instead of conda. See the end of this file for more details.
Install the 'text-mining-class-exercises' python package in "editable" mode:
pip install --editable exercises/
Also install the 'text-mining-class-solutions' python package side-by-side:
pip install --editable solutions/
At this point it should be possible to run the tests of the solutions and they should all pass:
pytest solutions/
On the contrary the tests of the exercises should not pass (they are marked
xfail
):
pytest exercises/
Some tests require test data to run properly and are skipped if the data is missing. You can pre-download those datasets with the following command (a few 10s of MB in total).
python -m tmclass_exercises.data_download
Install Visual Studio Code (vscode) and the official Python extension .
Open this project folder ("text-mining-class") in the editor.
Make sur to select the tmclass
conda environment in the bottom left of
the main window. You can switch with Shitf-Ctrl-P
> Python: Select Interpreter
.
Configure you editor to enable the flake8
linter (code checker)
instead of pylint
and to run the tests with pytest
.You can press
Ctrl-,
to open the user settings editor. Set the following options:
{
"python.unitTest.pyTestEnabled": true,
"python.unitTest.pyTestArgs": ["-vv"],
"python.linting.pylintEnabled": false,
"python.linting.flake8Enabled": true
}
Open the text-mining-class
folder in your editor. The exercises can be found
in the exercises/tmclass_exercises/
folder.
Use the Ctrl-P
command to open the file navigation menu, start to type
"manipulation" and select the open the following file in a new tab:
exercises/tmclass_exercises/text_manipulation.py
Similarly open a new tab the test file named:
exercises/tmclass_exercises/tests/test_text_manipulation.py
The first exercises is to implement the function named code_points
in
text_manipulation.py
.
Read the instructions in the file and run the first test for this function:
pytest -vv -k test_code_points exercises
Alternatively you can use vscode
to launch the tests. Here are some useful
commands to try in the launch palette Shift-Ctrl-P
:
>Python: Discover Unit Tests
>Python: Run Current Unit Test File
>Python: Run Unit Test Method...
>Python: Debug Unit Test Method...
(set a break point in the code first)
You can also use Ctrl-j
to toggle the output view panel.
Notice that the test is marked xfail
. To actually see the error message,
remove the @pytest.mark.xfail
for the first test in text_manipulation.py
and try again.
Then implement the code_points
function in text_manipulation.py
and launch
the first test again and iterate until it passes
Once done, move to the next test. You can run all the tests of given group of exercises as follows:
pytest -vv exercises/tmclass_exercises/tests/test_text_manipulation.py
TODO:
Instead of using a conda env and the conda command, it is also possible to use a virtualenv and install dependencies using pip:
python3 -m venv tmclass
Then under Windows:
tmclass/Scripts/activate
or under Linux and macOS:
source tmclass/bin/activate
Finally, install the dependencies:
python3 -m pip install -r requirements.txt