/CS673-Scalable-Databases

Repository for storing code for my MS in Data Science course CS673 Scalable Databases at Pace University.

Primary LanguageJupyter Notebook

CS673-Scalable-Databases

Repository for storing code for my MS in Data Science course CS673 Scalable Databases at Pace University.

Course description: After reviewing relational databases and SQL, students will learn the fundamentals of alternative data storage schemas to deal with large amounts of data (structured and unstructured). The course covers big data and the development of the Hadoop file system, the MapReduce programming paradigm, and database management systems such as Cassandra, HBase, and Neo4j. Students will learn about NoSQL, distributed databases, and graph databases. The course emphasizes the differences between traditional database management systems and alternatives with respect to accessibility, cost, transaction speed, and structure. Part of the course is dedicated to accessing, handling, and processing data from different sources and of different types using Python. The course provides hands-on practice.

Project 1

In this project, I analyzed a dataset of my choosing using SQL. Specifically, I analyzed the Data Scientist Salaries 2023 dataset, and created a local database, created tables, and wrote queries to explore this dataset.

Project 2

In this project, I had to complete some basic Python commands.

Project 3

In this project, I completed some tasks using SparkSQL in the Spark big data context. You can also find my results here on Databricks Community.

Midterm Project

In this project, my partner and I analyzed the Data Scientist Salaries 2023 dataset, and performed EDA, data cleaning, wrangling, manipulation, etc. in order to answer targeted queries and extract insights from the dataset. We recorded our presentation on this project here: https://youtu.be/z1-39Pkm-2E