/infs2822quickrefresher

A quick refresher of INFS2822 for UNSW students.

INFS2822 Quick Refresher

This is a quick refresher of INFS2822 (Programming for Data Analytics) for UNSW students.

These notes will reacquaint you with some concepts that you have previously studied in depth, but these notes will not fully explain these concepts for you. For that, please refer to your notes from when you studied INFS2822.

Contents:

(Itemised based on concepts, not teaching weeks.)

 

Authors: Blair Wang (UNSW Business School). See also the GitHub repo.


πŸ’»  1. Command Line

People by and large are used to graphical user interfaces (GUIs), which are easy to use. However, for precision and power, it is beneficial to use command line interfaces (CLIs). There are two families of CLIs - the DOS-based family (DOS, cmd.exe on Windows, PowerShell) and the UNIX-based family (UNIX, macOS, Linux). We use the UNIX-based family in INFS2822 because it's pretty much the standard for science and data work.

At the command line, there are various shells in which you can execute commands. The most common shell for UNIX-based work is based on Bash. In this course, we use zsh which is an upgrade from Bash. Note that from one shell you can move to another shell, e.g. from zsh we can move to python3.

Getting started at the UNIX command line, there are a lot of useful commands:

  • date (not to be confused with time)
  • ls -lah
  • find
  • cat / head / tail / less / nano / grep

A sequence of commands can be stored in a script. For example SQL comamnds can be stored in a .sql file. Likewise, Bash/zsh commands can be stored in a .sh file.

🐍  2. Python Essentials

This section of INFS2822 is based on the Software Carpentry Programming with Python text.

Basics

This section is based on Python Software Carpentry set 1. If you are also taking INFS1609, you may wish to compare with INFS1609 Elementary Programming for the equivalent in Java.

  • Variable assignment, e.g. weight_kg = 60 (integer), weight_kg = 60.0 (float), weight_kg = 'sixty' (String)
  • Printing to the command line with print(..)

Working with CSV data

This section is based on Python Software Carpentry set 2.

import numpy
csvdata = numpy.loadtxt(fname='file.csv', delimiter=',')

# -- Data Types --
print(type(csvdata))       # will be <class 'numpy.ndarray'>
print(type(csvdata.dtype)) # will be float64
print(type(csvdata.shape)) # will be (rowcount, colcount)

# -- Rows and Columns --
print(csvdata[0,0])        # will be value at row 0, col 0
                           # python numbering starts at 0
print(csvdata[31,41])      # will be value at row 31, col 41
print(csvdata[31, :] )      # will be all of row 31
print(csvdata[: ,41] )      # will be all of col 41

# -- Quick Stats Examples --
print(numpy.mean(csvdata))         # mean across all rows and cols
print(numpy.max(csvdata[31, :]))   # maximum value across row 31
print(numpy.min(csvdata[: ,41]))   # minimum value across col 31
print(numpy.std(csvdata))          # standard deviation across all rows and cols

# -- Array of Statistics --
print(numpy.mean(csvdata, axis=1))  # array of averages for each row
print(numpy.mean(csvdata, axis=0))  # array of averages for each column
print(numpy.diff(csvdata[31, :])    # cell minus previous cell for row 31

Generating charts/graphs

This section is based on Python Software Carpentry set 3.

import numpy
csvdata = numpy.loadtxt(fname='file.csv', delimiter=',')

import matplotlib.pyplot as pp

# heat map
pp.imshow(data)
pp.show()

# line graph
pp.plot(numpy.mean(csvdata, axis=0))
pp.set_xlabel('x axis label goes here')
pp.set_ylabel('y axis label goes here')
pp.show()

# figure with mulitple plots
fig = pp.figure(figsize = 10.0, 3.0)
part1 = fig.add_subplot(1, 3, 1) # }
part2 = fig.add_subplot(1, 3, 2) # } each of these begins with 1,3 because the grid is 1x3
part3 = fig.add_subplot(1, 3, 3) # } then the 3rd one is just the index (starts at 1 here)

part1.plot(..)
part2.plot(..) # same syntax as above
part3.plot(..)

fig.tight_layout()
pp.savefig('file.png')
pp.show()

Loops and Arrays

This section is based on Python Software Carpentry set 4 and Python Software Carpentry set 5. If you are also taking INFS1609, you may wish to compare with INFS1609 Loops and Arrays for the equivalent in Java.

# incrementing
for number in range(1, 6):
    print(number)
    # will give you 1, 2, 3, 4, 5

# arrays (lists)
foods = ["Fish Fingers", "Custard", "Grapes"]
print(len(foods))  # will give you 2
print(foods[1])    # will give you Custard
print(foods[-1])   # will give you last element = Grapes
print(foods[-2])   # will give you 2nd last element = Custard
for this_food in foods:
    print(this_food)

Combine data from multiple files

This section is based on Python Software Carpentry set 6.

import glob
import numpy

filenames = sorted(glob.glob('dataset-2020-08-*.csv'))
for filename in filenames:
    print(filename)

Selections

This section is based on Python Software Carpentry set 7. If you are also taking INFS1609, you may wish to compare with INFS1609 Selections for the equivalent in Java.

# using else-if (elif)
if x > y:
    print('x is bigger than y')
elif x > z:
    print('x is not bigger than y, but it is bigger than z')
else:
    print('x is less than y and z')

# using logical operator (and, or, etc)
if (x > y) and (x > z):
    print('z is bigger than y and z')

Functions

This section is based on Python Software Carpentry set 8. If you are also taking INFS1609, you may wish to compare with INFS1609 Methods for the equivalent in Java.

def bmi(mass_kg, height_m):
    numerator = mass_kg
    denominator = height_m ** 2
    return numerator / denominator

πŸ’Ό  3. Project Management

CRISP-DM methodology:

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Modeling
  • Evaluation
  • Deployment

🌐  4. Web Publishing

Web documents are written in Hypertext Markup Language (HTML). This involves tags (e.g. <p>) that must later be closed (e.g. </p>). These tags are nested inside one another in a hierarchy known as the Document Object Model (DOM). The visual style of the DOM when rendered in a web browser can be designed using Cascading Style Sheets (CSS).

HTML example β€” index.html

<!DOCTYPE html>
<html>
    <head>
        <title>Hello World</title>
        <meta charset="UTF-8">
        <link rel="stylesheet" type="text/css" href="styles.css" />
    </head>
    <body>
        <h1>Hello World!</h1>
        <p>The quick brown fox jumps over the lazy dog.</p>
    </body>
</html>

Corresponding CSS example β€” styles.css

body {
    max-width: 900px;
    margin: 0 auto;
    font-family: 'Verdana';
}

h1 {
    color: blue;
}

πŸ—ΊοΈ  5. Geographical Visualisations

πŸ₯£  6. Web Scraping

πŸ€–  7. Machine Learning

βš–οΈ  8. Social, Legal and Ethical Issues

πŸ’‘  9. Tips and Tricks