/INFO_7390_Art_and_Science_of_Data

INFO 7390 The Art and Science of Data

Primary LanguageJupyter NotebookMIT LicenseMIT

INFO 7390 Advances in Data Sciences and Architecture (The Art and Science of Data)

Garbage-In Garbage Out (GIGO) may be the most widely used maxim in machine learning, but how does one assess the quality of each step in an analysis pipeline? This course teaches students how to understand their data, models, and pipelines using visualization.

Part I - Understanding Data

The first part of the course covers understanding the statistical properties of a data set visually, how to fix issues with their data, and how to graphically demonstrate how the data was improved. The choice of the right chart for a particular question is covered. The principles of visual design, including typography, contrast, balance, emphasis, movement, white space, proportion, hierarchy, repetition, rhythm, pattern, unity, and variety are covered.

1A - Data Preprocessing and Prep

In this segment of the course, students will be immersed in the crucial initial steps of data science: data preprocessing and preparation. Before any robust analysis can occur, it's essential to ensure the data is clean, relevant, and ready for exploration. We'll begin by introducing the foundational techniques to clean and transform raw data, ensuring its quality and integrity. This involves handling missing data, outliers, and potential errors that can skew results.

Furthermore, students will learn about normalization and standardization processes, enabling them to make disparate data sets comparable. Techniques such as one-hot encoding and binning will be covered, emphasizing the need to make data machine-readable, particularly when preparing for machine learning or statistical modeling.

As data comes in various forms - from textual and categorical to numerical - we'll delve deep into transforming these different data types to be more suited for analysis. The end goal of this segment is to equip students with the skills needed to turn raw, messy data into a polished and prepared asset, setting a solid foundation for subsequent analysis and visualization.

1B - Data Analysis and Improvement

In this segment of the course, students will be introduced to a visual understanding of a dataset's statistical properties. This includes identifying potential problems within the data and implementing corrective measures. We'll also delve into effective graphical methods to showcase the enhancement and transformation of data. An essential part of this section is selecting the most suitable chart type to address specific questions or insights about the dataset.

1C - Principles of Visual Design in Data Presentation

In this segment, we put emphasis on the art and science of visual design as it pertains to data presentation. Students will explore various fundamental principles, including typography, contrast, and balance. Additionally, we'll discuss advanced design concepts such as emphasis, movement, the strategic use of white space, proportion, hierarchy, and more. We'll also delve into the significance of repetition, rhythm, pattern, unity, and variety, ensuring that the data is not only accurate but also aesthetically compelling and easily comprehensible.

Part II - Generative AI for Data

2A - Understanding Generative AI

Dive deep into the world of generative AI and its impressive capability to produce content. This section introduces its practical uses and constraints. We will differentiate between traditional machine learning models, generative AI, and artificial general intelligence (AGI). Additionally, we'll uncover the primary elements fueling the progress of generative AI.

2B - Building Generative AI Systems

This segment details the crucial procedures involved in crafting generative AI systems. Topics covered include research, design, data gathering, model training, and assessment. Emphasis is placed on the importance of varied datasets and cutting-edge training strategies. We will also explore the different evaluation techniques, highlighting their advantages and drawbacks.

2C - Employing Generative AI for Synthetic Data Creation

In this section, we venture into the practical application of generative AI in fabricating synthetic data. Whether it's text, visuals, videos, or soundscapes, generative AI offers innovative solutions for generating authentic-seeming content. We'll break down the mechanics behind these processes, demonstrating how cutting-edge models can craft content that's nearly indistinguishable from real-world data, and discuss the potential advantages and challenges of using synthetic data across various sectors.

2D - Leveraging Generative AI for Data Verification

In this segment, we explore how generative AI can play a pivotal role in data verification. By harnessing the power of large language models (LLMs) and cross-referencing information across them, we can enhance the accuracy and reliability of our data. We'll discuss the methodologies behind this innovative approach, detailing how multiple LLMs can be employed in tandem to validate the authenticity of a piece of information. Additionally, we'll touch upon the benefits of this process, emphasizing the increased trustworthiness of data and the reduction in misinformation.

Part III - Causal Inference

The third part of the course covers visualizing causal relationships in data. The emphasis is on understanding visual techniques for separating causal relationships for correlation.

3A - What is Causal Inference?

In this segment of the course, we will delve into the fundamental concept of causal inference, demystifying the distinction between correlation and causation. Students will be introduced to the idea that while many factors might be correlated, not all are causative. We will explore the principles underlying causality, elucidating why it's a cornerstone in many scientific disciplines, especially in social sciences, medicine, and economics.

Through real-world examples, learners will gain a grasp on the importance of discerning causative factors in various scenarios - from public policy decisions to medical treatments. Discussions will revolve around the challenges faced in establishing causality, especially when experiments are impractical or unethical.

The segment will lay the foundation for understanding the differences between observational studies and randomized controlled trials (RCTs), highlighting the strengths and pitfalls of each approach. Moreover, students will be introduced to potential outcomes and counterfactual frameworks, which provide a structured way to think about cause and effect. By the end of this section, students will have a solid grasp of why causal inference is vital, its challenges, and the fundamental tools and concepts used to determine causality in data.

3B - Visual Techniques in Causal Data

In this segment of the course, the primary focus will be on the visualization of causal relationships within datasets. We aim to provide a robust understanding of visual methods that can be employed to distinguish genuine causal connections from mere correlations. This involves diving deep into concepts such as confounding, causal graphs, and the intricate relationship between Directed Acyclic Graphs (DAGs) and probability distributions. The segment will also highlight the significance of paths and associations, along with the idea of conditional independence through d-separation.

3C - Advanced Causal Analysis Techniques

Venturing further into the realm of causality, this segment dives into the practical aspects and methodologies used in observational studies. Techniques and concepts such as optimal matching, sensitivity analysis, and Inverse Probability of Treatment Weighting (IPTW) will be detailed. The course will subsequently move into the nuances of marginal structural models, providing insights on IPTW estimation and the meticulous process of causal effect identification and estimation. This part of the course aims to equip students with the advanced tools and knowledge necessary for in-depth causal analyses in complex scenarios.

Learning Objectives

Upon completion of this course, students will be able to:

  • General Understanding:

    • Recognize the importance of data quality in machine learning and analysis pipelines.
    • Utilize visualization techniques to gain insights into data, models, and pipelines.
  • Understanding Data:

    • Visually interpret the statistical properties of datasets.
    • Identify and rectify issues within datasets.
    • Select appropriate chart types to address specific analytical questions.
    • Apply principles of visual design, including but not limited to typography, contrast, balance, and hierarchy, to enhance data presentation.
  • Data Analysis and Improvement:

    • Implement corrective measures to address potential problems within datasets.
    • Use graphical methods to demonstrate the enhancement and transformation of data.
  • Principles of Visual Design in Data Presentation:

    • Comprehend and apply foundational visual design principles to data representation.
    • Understand advanced design principles like emphasis, movement, proportion, hierarchy, repetition, and unity in data visualization.
  • Generative AI for Data:

    • Understand the fundamental differences and relationships between traditional machine learning, generative AI, and AGI.
    • Comprehend the practical applications and constraints of generative AI.
    • Construct generative AI systems, from research and design to data gathering and training.
    • Utilize generative AI for synthetic data creation and data verification.
  • Causal Inference:

    • Visualize and comprehend causal relationships in datasets.
    • Separate genuine causal connections from correlations using advanced visual techniques.
    • Utilize advanced causal analysis techniques in observational studies.
    • Understand and apply concepts like confounding, causal graphs, Directed Acyclic Graphs (DAGs), and the relationship between DAGs and probability distributions.

Weekly Schedule

Week 1

  • Information Visualization: Foundations, Data Abstraction
  • Fundamental Graphs and Data Transformation
  • Graphical Components and Mapping Strategies
  • Basic data statistics and Exploratory Data Analysis (EDA)

Week 2

  • Perception for Information Visualization
  • Effectiveness of Visual Channels
  • Identifying Statistical Properties: A Visual Approach
  • Data Issue Diagnosis and Rectification Strategies

Week 3

  • Demonstrating Data Improvement through Graphics
  • Chart Selection for Specific Analytical Questions
  • Introduction to Principles of Visual Design
  • Typography, Contrast, and Balance in Data Presentation

Week 4

  • Advanced Visual Design: Emphasis, Movement, White Space
  • The Role of Proportion, Hierarchy, Repetition in Visualization
  • Delving Deeper: Rhythm, Pattern, Unity, and Variety in Design

Week 5

  • Introduction to Generative AI: Distinguishing Traditional ML, Generative AI, and AGI
  • Real-world Applications and Limitations of Generative AI

Week 6

  • Crafting Generative AI Systems: Research and Design Phases
  • The Importance of Diverse Data Gathering

Week 7

  • Advanced Training Strategies in Generative AI
  • Evaluation Techniques: Strengths and Limitations

Week 8

  • Using Generative AI for Synthetic Data Creation: Text, Images, Videos
  • Mechanics Behind Generative Processes and Real-world Implications
  • Generating text

Week 9

  • Generating numeric data
  • Generating images

Week 10

  • Generating audio
  • Generating video

Week 11

  • Leveraging Large Language Models for Data Verification
  • Benefits and Challenges of Cross-referencing Information

Week 13

  • Introduction to Causal Inference in Data Visualization
  • Visual Techniques for Understanding Causality: Basics

Week 14

  • Advanced Causal Analysis Techniques: Observational Studies and Beyond
  • Wrapping Up: IPTW Estimation, Causal Effect Identification, and Future Trends

Week 15

  • Review