/DataEngineering

The Data Engineering subteam of Cornell Data Science

Primary LanguageXSLT

Data Engineering

Who we are:

The CDS Data Engineering subteam exists to provide analysis and processing support to CDS project teams, and to develop institutional knowledge in high throughput computing.

Advisor: Professor Immanuel Trummer
Team Leads: Julia Ng (CS, ECE 2021), Haram Kim (A&S CS 2020)

Team objectives:

  • Improve on existing high throughput computing frameworks
  • Develop solutions for data analysis problems in CDS projects
  • Provide a reservoir of reference information in data engineering
  • Research and publish means of improving existing DE frameworks

Current Projects:

  • Snapbee: A collaboration effort with Munich Re to improve how students and professors interact with course material.

  • TypeScript DiscreetORM: DiscreetORM is a type-safe minimal-hassle easy-to-use ORM (object relational mapping) for TypeScript

Previous Projects:

  • Resturaunt Review Dashboard: A project to gather reviews and produce analytics for local restaurants in Ithaca.

  • Kubernetes Real-Time Face Recognition: An effort to create a real time face recognition system using CDS compute servers and Kubernetes, in conjunction with a React frontend.

  • Raspberry Pi Distributed Face Detection: An effort to build a face detection system spread across three Raspberry Pi boards.

  • Formally Verified DBMS: We are setting out to build a prototype formally verified database system. Formal verification is a technique of mathematically proving certain functionality and algorithms. Essentially pre and post conditions are represented as mathematical theorems and proven with a mechanical proof solving system. The core system would be built in Coq and compiled down to OCaml. I/O and command parsing will be written in OCaml.

  • Deterministic Query Approximation: Several recent publications have outlined methods to allow high-speed query approximation with deteministic bounds, but have not yet been applied to a wide range of queries. The objective of this project is to apply several of these techniques to the TPC-H query benchmarks to demonstrate broader applicability.

  • GPU Acceleration: The distributed GPU computing deals with the unique task of handling distributed deep learning tasks, which is currently well-optimized for multiple GPUs, but not necessarily across multiple machines. Our goal is to research and optimize current tools in development so that it can be adopted by CDS teams deploying large DL models.

  • Spark ML Optimization: Apache Spark's machine learning modules are not as well-studied as those of other platforms. This project seeks to empirically identify optimal settings for Spark's ML modules to best utilize the platform's unique capabilities.

  • SkinnerDB Parallelization: This project's objective is to experiment with parallelism in Professor Trummer's recently developed database engine, SkinnerDB. The SkinnerDB uses a machine learning approach to query optimization, in contrast to the heuristic model used by most current database engines, but has not yet been expanded to allow multi-core execution.

Members (FA2019):