- a dataset to be analysed!
- an idea of what you want to get out of the model
- get tensorflow installed
- get familiar with tensorflow in an example or two
- process the data, split into training and test sets
- initial modelling and plotting of results
- iterative improvements to above
- profit!
Tensorflow is Python based, and I've found these following tools useful when doing data-science things in Python:
-
Virtualenv: allows multiple versions of libraries to be installed in parallel, basically alleviating with "dependency hell"
-
The usual "Scientific Python" toolkit: i.e. run this in a fresh
virtualenv
:pip install -U numpy scipy pandas matplotlib jupyter
A Jupyter notebook "allows you to create and share documents that contain live code, equations, visualizations and narrative text". Apparently they even have baked in support for Tensorflow these days!
-
Git and Github: working with code in a source control system makes the world a better place.
Tensorflow provides a example models and these seem like a good place to start. See:
Quite a big repository, so I'd suggest a shallow clone (still ~360MB), and then proceed as suggested above:
git clone --depth 10 https://github.com/tensorflow/models.git
cd models/samples/core/get_started/
python premade_estimator.py
The problem we want to solve is automatically classifying a telephone conversation between two NHS health professionals. Because this is pretty sensitive I'm going to be working with a similar "open dataset" in this example. I found the following subreddit useful for finding some data to play with:
I'm starting with the Cornell Movie-Dialogs Corpus. Mostly because this is conversational data and it includes a set of genres that we can predict from the conversations.
As is the case with most data, the people producing it have invented
their own weird, wonderful and idiosyncratic file format. They use a
delimiter separated file format, like CSV, but use a delimiter of +++$+++
for some weird reason. They also include various fields
containing a JSON-like "array of strings"---but of course it's invalid
JSON and this needs to be munged before any standard JSON parser will
touch it.
That said, it's pretty easy to coerce pandas
into handling these
files and end up with a mostly conventional set of DataFrame
s.