Data Curation (D2489-1819_T1), Nova School of Business and Economics, Portugal
Instructor: Qiwei Han
TA: Sónia Nascimento Vilares
Course description
This course serves as the entry point of a series of data science skillset for business analytics in the modern big data era. We will introduce concepts of data curation and management with applications. Students will explore the characteristics of data and perform data curation through hands-on experiences, such as the data extraction, data wrangling, data exploration, database and data science workflow in terms of reproducible Extract-Transform-Load (ETL) processes.
Students do not need to have programming experience but programming knowledge, such as R, Matlab, Java, etc would be highly preferred.
This course contains 6 modules that students learn about data curation through hands-on programming exercise. This course will also serve as crash course of Python, the most popular programming language in Big Data era. Most of lectures will be presented using Python/SQL examples.
- Week 1: Introduction to Python for Data Curation. We will review common python functionality and data structure which data scientists use, and introduce the Anaconda Jupyter notebook for the lecture.
- Week 2: Pyhon Basics and Built-in Data Structures. We will give a crash course on Python basics for data analytics with respect to language semantics, control flows and functions as well as its widely used data structures, such as list, tuple, dictionary.
- Week 3: Numerical Python and Data Extraction and Cleaning using Pandas. We will introduce techniques for effectively loading, storing, and manipulating in-memory data in Python by using two powerful toolkits for data processing – Numpy and Pandas. We will show how to create efficient storage and manipulation of numerical arrays and how to read in data from different sources into DataFrame structures.
- Week 4: Data Cleaning and Preparation using Pandas. We will deepen your understanding of the Pandas by learning how to generate tidy structures and query these structures manipulate data, such as summarizing, aggregating, grouping and merging data. We also show how to deal with real-world scenarios when there are missing or wrong values in the data.
- Week 4: Data Wrangling and Data Exploration Analysis. We will show how to join, combine and reshape data across a number of files or be arranged in a form that is easy to analyze. We will also show how to apply aggregation or transformation on each group,
- Week 5: Introduction to Database and SQL. We will introduce modern database concepts and help you learn and apply knowledge of the Structural Query Language (SQL) to run query large-scale data stored in the PostgreSQL relational database
- Week 6: Data analysis examples and review all class materials.
ASSESSMENT The overall evaluation of performance consists of 4 parts
- Class participation through 5 quizzes (20%)
- 3 bi-weekly assignment (30%)
- Final exam (50%)
Students need to participate in class quizzes for at least 4 times. If students are present in all quizzes, 4 out of 5 quizzes with highest points will be counted.
Assignments are issued every two weeks. Students need to submit the assignment by the due date and will lose 20% of points for each late day. For example, if the assignment is late for 3 days, student may get 40% of points at maximum. If the assignment is late for 6 days, assignment will be returned without evaluation. Assignment questions should be sent at least 24 hours before the deadline to assure a timely response. Please CC the TA so that we can all stay coordinated, and include the course code 2489 in the subject line of your emails. Again, don't forget to CC the TA. They are the homework graders and I will usually defer to their judgement on matters of scoring.
You are encouraged to discuss general approaches and clarification questions with your fellow students. However, you should do your homework yourself.
- Do not look at (or copy) another student's homework.
- Do not copy from another student's homework.
If you receive any help from another student or outside the class (such as stackoverflow or other forums or websites), you must give credit where credit is due, and clearly identify where you received help. The expectation is that your grade must reflect the work that you alone did.
RESOURCES. The online resources that provides additional information of this course are quite sufficient. Below students may find the following resources that are useful for self-study and exercises:
-
Online Data Science Encyclopedia
-
Online Python tutorial:
-
Online SQL tutorial:
-
Source code of reference books: