/project_kschool_ds_challenge

Famous company challenge as part of the data science division recruting process. The handicap of the challenge, is not the logic, but the size of files.

Primary LanguageJupyter Notebook

Data Science Challenge

--

Data Science Challenge for the recruiting process of a famous company.

Suggested Approach

  • Get familiar with the data
  • Select columns of interest
  • Decide what to do with NaNs
  • Make processing plan
  • Develop code that works with a sample
  • Adjust the code to work with big data
  • Test big data approach on a sample
  • Run program with big data

Instructions

  • You should do all the work in a Python Notebook
  • Export the Notebook to a public notebook viewer
  • Required: Include all your code in GitHub
  • Please promptly update the repository with your ongoing work

Exercises

1. Count the number of lines in Python for each file

2. Top 10 arrival airports in the world in 2013 (using the bookings file)

  • Arrival airport is the column arr_port. It is the IATA code for the airport

  • To get the total number of passengers for an airport, you can sum the column pax, grouping by arr_port. Note that there is negative pax. That corresponds to cancelations. So to get the total number of passengers that have actually booked, you should sum including the negatives (that will remove the canceled bookings).

  • Print the top 10 arrival airports in the standard output, including the number of passengers.

  • Bonus point: Get the name of the city or airport corresponding to that airport (programatically, we suggest to have a look at GeoBases in Github)

  • Bonus point: Solve this problem using pandas (instead of any other approach)

3. Plot the monthly number of searches for flights arriving at Málaga, Madrid or Barcelona

  • For the arriving airport, you can use the Destination column in the searches file.

  • Plot a curve for Málaga, another one for Madrid, and another one for Barcelona, in the same figure.

  • Bonus point: Solving this problem using pandas (instead of any other approach)

4. Match searches with bookings

  • For every search in the searches file, find out whether the search ended up in a booking or not (using the info in the bookings file). For instance, search and booking origin and destination should match.
  • For the bookings file, origin and destination are the columns dep_port and arr_port, respectively.
  • Generate a CSV file with the search data, and an additional field, containing 1 if the search ended up in a booking, and 0 otherwise.