/Database-Systems-for-Data-Science

Database-Systems-for-Data-Science

Primary LanguageJupyter NotebookCreative Commons Attribution Share Alike 4.0 InternationalCC-BY-SA-4.0

Database-Systems-for-Data-Science (Fall 2023)

MS Applied Data Science

Instructor: Dimitri Yatsenko, Ph.D.

Syllabus

Organization concepts and terminology of data models and the underlying data architectures needed to support them. Presentation of the relational database management systems including an introduction to SQL programming: relational database design and data queries with integration into application programming languages, with Python used for examples.

I will assume basic Python proficiency: we want to use databases directly from Python.

The course will include practical exercises and will be graded based on several indvidual and group projects using real-world datasets.

Class and attendance

Class participation is required and much of the material will be presented in class only. The class will be held in room Malloy 024 and also broadcast in Zoom. We will also use a Slack channel. If you have not received access to any of these resources, please notify your instructor.

Textbook

The University provided you with a Safari Books subscription account. Readings, examples, and projects will be selected primarily from the following references:

  1. Jan L. Harrington, Relational Database Design and Implementation: 4th edition (2016)
  1. Alan Beaulieu, Learning SQL, 3rd Edition (2020)
  1. Michael Hernandez, Database Design for Mere Mortals: A Hands-On Guide to Relational Database Design; 4th edition (2020). ISBN-13: 978-0136788041 ISBN-10: 0136788041
  1. Josephine Bush, Learn SQL Database Programming (2020)

Additional slides and notes will be provided in class.

Grading

10-12 Assignments/Projects

grade percent
A >=94%
A- >=90%
B+ >=86%
B >=82%
B- >=78%
C+ >=74%
C >=70%
C- >=66%
D+ >=62%
D >=58%
F >=0%

Weeks 1-2: (Aug 23, Aug 30)

Databases in data science. Data models: diverse ways to think about data: hiearchical, network, relational, object, graph, and document data models. History of datatabases and database technologies. Next-generation databases.

Database access. Creating SQL tables and inserting data. Simple queries. sqlite and mysql.

Issue SQL queries from Jupyter using SQL Magic.

Issuing SQL queries from Python code using a database client library, (pymysql).

CREATE SCHEMA, CREATE TABLE, INSERT, DELETE, DROP TABLE, DROP SCHEMA.

Create schemas and tables, insert rows, delete, and drop.

Using the faker module to populate tables.

Using datajoint as a high-level interface.

Datatypes: numerical, character strings, and enum.

Homework 1 (due Sep 1)

  1. Installation of SQL Magic for Jupyter

  2. Connecting to the database from Jupyter

  3. On the MySQL server, create a database named <username>_university and define a table named person. Make sure it has a well chosen primary key.

  • Connect to the database server
  • Create a schema, a table within it, and insert at least one row.

Readings:

  1. Create a university database
  2. Create a table named person with basic attributes: name, date of birth, address, phone,
  3. Insert at least one record into the person table.

Key terms

Database, database system, database server, data model, data integrity, data consistency, ACID, relational data model, SQL, imperative queries, schema.

Key skills

  • Connect to the MySQL database provided for the class, create a simple table, and insert data into it.
  • Issue queries in SQL (using the SQL magic in jupyter for quick SQL scripting)
  • Execute queries with a client library (pymysql). Generate fake data. See notebook Fake-It.ipynb
  • Execute queries using DataJoint. See notebooks DataJoint-config and DataJoint-Intro.

Homework 2 (due Sep 8)

  1. Answer questions in Block 1. Submit as a PDF file as a direct message to the TA and the instructor on Slack.

Weeks 3-5: (Sep 6, Sep 13, Sep 20)

Key concepts

  • Data models: structured (schema) and self-desccribing (schema-less).
  • How a database table works.
  • Schema design. Simple queries. Primary key. Foreign keys. Entity integrity. Referential integrity.
  • Normalization. First normal form. Entity normalization.
  • Diagramming. Entity-Relationship Diagrams. DataJoint diagrams.
  • See the Language notebook used in Lecture 3.
  • Relational algebra: restriction and projection. The structure of the SELECT Statement.
  • SELECT ... FROM ... WHERE ... ORDER BY ... LIMIT ...
  • Fetch. Order by. Offset and Limit.
  • Declarative queries vs. imperative queries
  • Default values. Nullable attributes. Nullable foreign keys.
  • UUID, surrogate keys, secondary unique constraints

Readings:

Assignment 3 -- Due Sep 22 (extended by one week)

Resources and exercises

Weeks 6-8 (Sep 27, Oct 4, Oct 11 (Fall Break), Oct 18)

Surrogate keys / Natural keys. Indexes, secondary unique keys. Data serialization - blobs.

Notebooks

Advanced Queries:

  • Subqueries
  • Inner Joins: cross join, theta join, equijoin, natural join
  • Aggregations

Reading: Harrington Chapters 16-19

Weeks 9, 10 (Oct 25, Nov 1, Nov 8)

Modeling relationships:

  • Sequences and workflows.
  • Specialization / generalization.
  • Hierarchies (ownership, composite keys, partial keys)
  • Groupings
  • Directed Graphs (including trees)
  • Unidrected Graphs.
  • Master-part relationships

Transaction processing.

Reading:

  • Harrington: Chapters 13, 14, 15 (case studies), 22 (concurrency and transactions)

Weeks 11-13 (Nov 15 --- no lecture, work on Final; Nov 22, Nov 29):

Putting it all together. Practical examples

Final: Design due on Nov 27. Complete submission Dec 13.