/imdb_data_on_terminal

"Yet another" tutorial on dealing with data on command line; takes IMDB dataset as the sample

The UnlicenseUnlicense

Yet Another Tutorial for Data on Command Line

Analyzing IMDB Data on the Command Line

Data is the new oil - Clive Humby

Introduction

Data professions (Engineers, Scientists, Analysts) always receive data from various sources and they have to quickly decide if the data is worth working on. This is akin to doctors who see patients with various symptoms. What almost all doctors do universally is to use their stethoscope first. But data professionals jump to write code in Python or R or some programing language or even worse try to analyse with spreadsheets. This is when simple command line tools are present at their disposal.

This tutorial attempts to open the data professionals to the world of exploring data on the command line interface.

Is this not yet an other tutorial on command line?

Yes. At a distance that is what it might seem like. But most tutorials/books do this:

  • Introduce a lot of commands; however without a proper data to work on one will now know what to use when.
  • Take a stance that everything can be done on the command line; that again is like using the golden hammer. Command line is great for initial analysis but beyond that you will need to use programming languages

So, what we do in this tutorial is to take a dataset and explain how the dataset can be analysed on the command line. The same dataset is used throughout the tutorial because that is what you would do practically when you get a dataset.

How to navigate this tutorial?

Navigate this tutorial from Lesson 1 onwards; do not jump to lessons in a non-sequential manner as you will not be able to follow.

Lesson 1: Obtaining the Dataset

Lesson 2: Doing basic checks on data files

Lesson 3: How to do a qualitative assessment of the data

Lesson 4: How to do a basic quantitative assessment of the data

Lesson 5: Stream first, download next

Prerequisite to go through this tutorial

  • This is a practical tutorial which works on a Unix based terminal; first set up one before you begin - if you are working on Mac or Ubuntu, you are already set
  • You do not have to be conversant with Unix command line, but should have the patience to stick to it as command line is an acquired taste
  • There will be occasions where the tools used in this tutorial is not available pre-installed on your machine; in such cases figure out a way to install teh tools. All tools used are popularly available and hence you will not face road blockers.

Inspired by

Jeroen Janssens's popular book Data Science at the Command Line