Want to learn how to use Python for Text Mining / Natural Language Processing (NLP)?
This repository has everything that you need to get started!
Author: Ties de Kok (Personal Page)
These materials accompany a PhD session on NLP for Accounting Research: slides
Quick link to the notebook: open notebook
The goal of this GitHub page is to provide you with everything you need to get started with Python and Text Mining.
The following topics are discussed:
(Note: the neural network part is only a reference to the Stanford course CS224n (Syllabus))
The topics and techniques demonstrated in this repository are primarily oriented towards empirical research projects in fields such as Accounting, Finance, Marketing, Political Science, and other Social Sciences.
However, many of the basics are also perfectly applicable if you are looking to use Python for any other type of Data Science!
This repository is written to facilitate learning by doing.
To facilitate this all the materials are written up in a Jupyter Notebook. See: NLP_notebook.ipynb
.
The topics are split up by task description.
It is best to view the notebook locally or on nbviewer using this link: click here
Please check out my "Getting started with Python for Research" repository: click here
To run the provided notebook you need to use the Jupyter Notebook.
Jupyter comes pre-installed with the Anaconda distribution so you should have everything already installed and ready to go.
What is the Jupyter Notebook?
From the Jupyter website:
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.
In other words, the Jupyter Notebook allows you to program Python code straight from your browser!
How does the Jupyter Notebook work in the background?
The diagram below sums up the basics components of Jupyter:
At the heart there is the Jupyter Server that handles everything, the Jupyter Notebook which is accessed and used through your browser, and the kernel that executes the code. We will be focusing on the natively included Python Kernel but Jupyter is language agnostic so you can also use it with other languages/software such as 'R'.
It is worth noting that in most cases you will be running the Jupyter Server
on your own computer and will connect to it locally in your browser (i.e. you don't need to be connected to the internet). However, it is also possible to run the Jupyter Server on a different computer, for example a high performance computation server in the cloud, and connect to it over the internet.
How to start a Jupyter Notebook?
The primary method that I would recommend to start a Jupyter Notebook is to use the command line (terminal) directly:
- Open your command prompt / terminal (on Windows I recommend the Anaconda Prompt)
cd
(i.e. Change) to the desired starting directory
for example:cd "C:\Files\Work\Project_1"
Note: if you are changing do folder on another drive you might have to also switch drives by typing, for example,E:
- Start the Jupyter Notebook server by typing:
jupyter notebook
This should automatically open up the corresponding Jupyter Notebook in your default browser.
You can also manually go to the Jupyter Notebook by going to localhost:8888
with your browser.
How to close a Jupyter Notebook server?
If you want to close down the Jupyter Server: open up the command prompt window that runs the server and press CTRL + C
twice.
Make sure that you have saved any open Jupyter Notebooks!
How to use the Jupyter Notebook?
I recommend to watch this excellent YouTube video: Awesome Data Science: 1.0 Jupyter Notebook Tour
Some shortcuts are worth mentioning for reference purposes:
command mode
--> enable by pressing esc
edit mode
--> enable by pressing enter
command mode |
edit mode |
both modes |
---|---|---|
Y : cell to code |
Tab : code completion or indent |
Shift-Enter : run cell, select below |
M : cell to markdown |
Shift-Tab : tooltip |
Ctrl-Enter : run cell |
A : insert cell above |
Ctrl-A : select all |
|
B : insert cell below |
Ctrl-Z : undo |
|
X : cut selected cell |
You can essentially "download" the contents of this repository by cloning the repository.
You can do this by clicking "Clone or download" button and then "Download ZIP":
If you extract the downloaded ZIP to a folder you can start the Jupyter Notebook in that folder and access the notebook.
There are a couple of packages not included with the Anaconda distribution that are used in the notebook:
- NLTK (make sure to install the language data)
- TextBlob (make sure to install the language data)
- Spacy (make sure to install the model data)
- Textacy
- pyLDAvis
- langdetect
- fuzzywuzzy
- textstat
(for the word2vec example I use the gensim package)
If you have questions or experience problems please use the issues
tab of this repository.
MIT - Ties de Kok - 2018
https://github.com/teles/array-mixer for having an awesome readme that I used as a template.