Data professions (Engineers, Scientists, Analysts) always receive data from various sources and they have to quickly decide if the data is worth working on. This is akin to doctors who see patients with various symptoms. What almost all doctors do universally is to use their stethoscope first. But data professionals jump to write code in Python or R or some programing language or even worse try to analyse with spreadsheets. This is when simple command line tools are present at their disposal.
This tutorial attempts to open the data professionals to the world of exploring data on the command line interface.
Yes. At a distance that is what it might seem like. But most tutorials/books do this:
- Introduce a lot of commands; however without a proper data to work on one will now know what to use when.
- Take a stance that everything can be done on the command line; that again is like using the golden hammer. Command line is great for initial analysis but beyond that you will need to use programming languages
So, what we do in this tutorial is to take a dataset and explain how the dataset can be analysed on the command line. The same dataset is used throughout the tutorial because that is what you would do practically when you get a dataset.
Navigate this tutorial from Lesson 1 onwards; do not jump to lessons in a non-sequential manner as you will not be able to follow.
Lesson 1: Obtaining the Dataset
Lesson 2: Doing basic checks on data files
Lesson 3: How to do a qualitative assessment of the data
Lesson 4: How to do a basic quantitative assessment of the data
Lesson 5: Stream first, download next
- This is a practical tutorial which works on a Unix based terminal; first set up one before you begin - if you are working on Mac or Ubuntu, you are already set
- You do not have to be conversant with Unix command line, but should have the patience to stick to it as command line is an acquired taste
- There will be occasions where the tools used in this tutorial is not available pre-installed on your machine; in such cases figure out a way to install teh tools. All tools used are popularly available and hence you will not face road blockers.
Jeroen Janssens's popular book Data Science at the Command Line