/genAI-SE

Site for "Generative AI for Science and Engineering" course

Apache License 2.0Apache-2.0

Large-scale Generative AI for Science and Engineering

In this graduate-level course, we will investigate large generative AI models for scientific and engineering problems: machine learning models that can generate outputs, such as hypotheses, designs, or simulations, based on patterns learned from scientific and engineering data. They. In more detail:

  • Generative: The model can produce new content based on patterns and structures learned during training. In the science and engineering context, it could generate predictions about untested physical phenomena, propose new design configurations for a mechanical system, or simulate the performance of a new material, to name a few examples.
  • AI model: It uses machine learning algorithms that the model uses to learn from data and generate outputs. These algorithms are designed to identify patterns and make predictions or decisions without being explicitly programmed to perform a specific task.
  • Large: The model has many parameters, i.e., elements that are learned from the data during training. More parameters allow the model to learn more complex patterns, but also requires more computational resources to train and use.
  • For Science and Engineering: Training data might include scientific articles, textbooks, lab reports, CAD models, numerical simulation data, experimental data, or any other type of data that is relevant to these fields. Training the model on this type of data equips it to generate outputs that are relevant to scientific and engineering tasks.

This course is directly relevant to the AuroraGPT project, which is developing a trillion-parameter generative AI model to be trained on Argonne's new 64,000-GPU Aurora supercomputer. It also connects to the work of the Trillion Parameter Consortium, which engages researchers worldwide seeking to apply generative AI to scientific problems.

The course will take place as CMSC 35200-1: Deep Learning Systems in the fall quarter of 2023 at the University of Chicago, on Tuesdays and Thursdays, 3:30-4:50pm. For more information, please contact Profs Ian Foster and Rick Stevens.

Topics

We will study theoretical underpinnings of such models, their training paradigms, and applications. We will explore how these models can generate new data that are statistically similar to their training data, including text, images, and potentially more abstract representations, and how this capacity can be harnessed for scientific discovery and engineering solutions. Key topics include:

  • Fundamentals of machine learning and deep learning
  • Overview of large-scale generative models
  • Deep dive into generative models like GPTs, GANs, VAEs
  • Training and fine-tuning strategies for generative models
  • Use cases of generative AI in various scientific fields like physics, chemistry, and biology, and in engineering disciplines such as materials science and electrical engineering
  • Practical sessions on implementation of these models with popular deep learning frameworks
  • Exploration of the limitations and ethical considerations of using AI in science and engineering

By the end of the course, students will have an understanding of how to implement and use generative AI models, how to apply them to problems in science and engineering, and how to navigate the ethical considerations that arise with the use of AI in these fields.

Practical

We will spend much time reading and discussing key papers in this area. In addition, the course will have a strong practical component, with students working to train models, apply them to science and engineering problems, evaluate their performance, etc. Initial ideas of things to cover:

  • Experiment with use of LLMs for simple scientific problems.
  • LLM definition and training: Define and train a small LLM from scratch.
  • Fine-tuning/specialization for various tasks.
  • Extend AutoGPT or similar for a scientific problem.

Potential topics with relevant readings

(A work in progress!)

Concepts

State of the art in large language models

  • test [rick, need to add. Attention paper, GPT-3 and GPT-4 paper, Claude Paper, Falcon and Llama 2 papers]

Scientific discovery and AI

Training data considerations

Scaling

Model Evaluation

  • EluetherAI -- Evaluation harness
  • HELM paper
  • BIG paper
  • Elo cmparison and HF leaderboard

Reinforcement Learning with Human Feedback

Memorization and copyright protection

Applications in chemistry and materials science

Optimizations

Autonomous LLM-based agents

Robotics

Risks

Tabular data

Limitations

Tools

A four-stage path to creating a LLM for science

  1. Scientific Data Acquisition and Organization: Robust data lies at the heart of any sophisticated model. Thus we first must curate large-scale scientific datasets, designing approximately 20 specialized “bundles” across domains like biology/biochemistry, materials/chemistry, physics/cosmology, and climate/environment. By addressing gaps in existing large language models (LLMs) tailored for intricate scientific challenges, our data collection aims to be highly targeted and efficient, enhancing the overall capabilities of our models.

  2. Model Evaluation Suite Development: With the curated data in place, the second phase centers on constructing expansive model evaluation suites. These suites, tailored to specific dataset collections and subdomains, will validate data and lay the groundwork for model testing. We plan on utilizing current LLMs to shape problems that AuroraGPT can solve, targeting around 1,000 problems for each scientific subdomain, resulting in an infrastructure ready to evaluate models on a staggering 20,000 problem sets.

  3. Model Construction and Performance Analysis: This pillar is about breathing life into our data through model building. We aim to construct models across diverse scales, from 7B to 1000B, leveraging general texts, code, and niche scientific data. Rigorous testing will be conducted on elite supercomputers Polaris and Aurora, ensuring optimal performance. The setup will harness technologies like Megatron and DeepSpeed to determine the best strategies for parallelism and fine-tuning hardware choices.

  4. Model Refinement and Deployment: The final phase ensures that our models do not just exist but also thrive in real-world applications. Refinement processes will utilize post-processing tools such as “instruct,” “RLHF,” and “Chat” and might employ pipelines like DeepSpeed RLHF or Alpaca. Automation will be a focus, especially for post-processing and safety checks. As the finishing touch, we plan on launching a Web and API platform for internal testing of AuroraGPT at Argonne before its broader release.


This site is accessible at https://tpc-ai.github.io/genAI-SE/.