data-analysis-exercises: A Python repository from leomrocha

#Data Analysis

This project contains some exercises in data analysis with python-pandas and some comparisons with raw python.

The goal is to learn and have some fun.

The files processed to obtain the data will not be published (are not mine to decide)

##Tasks

###First exercise: count the number of lines in Python for each file

Solution and discussion here

###Second exercise:

Arrival airport is the column arr_port. It is the IATA code for the airport

Print the top 10 arrival airports in the standard output, including the number of passengers.
(E) Get the name of the city or airport corresponding to that airport

Solutions and discussion here

(E) Write a Web Service taht wraps the output of the exercise that returns the data in JSON format. The web service should accept a parameter n>0. For the top 10 airports, n is 10. For the X top airports, n is X.

###Third exercise:

Plot the monthly number of searches for flights arriving at Málaga, Madrid or Barcelona.

Solutions and discussion here

(E) For every search in the searches file, find out whether the search ended up in a booking or not. Write a file with the results.

###Bonus Exercise: Ranking Web Service

Wrap the output of the second exercise in a web service that returns the data in JSON format (instead of printing to the standard output). The web service should accept a parameter n>0. For the top 10 airports, n is 10. For the X top airports, n is X.

Notes

This exercise is solved in the files:

ranker.py : contains the ranking algorithm
ws_rank.py : contains the web service (depends on bottle.py)

Dependences

bottle.py
pandas

System requirements

The program load the complete dataset in memory for the table bookings.py. The system needs enough memory for pandas to be able to load it.

Launching

To launch the program: $ python ws_rank.py

Usage

The program returns a json object containing the airport and the number of passengers (pax).

Getting the n top arrival airports:

localhost:8080/arrivals/first/100

Getting the n last arrival airports:

localhost:8080/arrivals/last/100

##Possible problems

###Lacking familiarity with Pandas library Solutions:

Study
Search documentation, forums, StackOverflow, and tutorials
Compare results with pure python for checking correctness and performance comparison

###Memory

Solutions:

Load file in chunks
Load only needed columns
Use file storage (PyTables)

##Plan (not strict, but as a guide)

Read documentation about pandas.
Do some testing on dummy data
Do a basic analysis of the data (format, separators, columns, number of rows)
Go to more complex tasks
Publish one result a day until finish
Packaging
Installation and usage documentation

##Conclussions

I had some fun and started learning a new library ###Lessons learned:

I started learning Python Pandas library
I had some fun
I underestimated the effort needed to learn to use the library
I found more problems than expected
Check, once and again formatting and separators in input files.
Warning with missing data

leomrocha/data-analysis-exercises

Notes

Dependences

System requirements

Launching

Usage