In this graduate-level course, we will investigate large generative AI models for scientific and engineering problems: machine learning models that can generate outputs, such as hypotheses, designs, or simulations, based on patterns learned from scientific and engineering data. They. In more detail:
- Generative: The model can produce new content based on patterns and structures learned during training. In the science and engineering context, it could generate predictions about untested physical phenomena, propose new design configurations for a mechanical system, or simulate the performance of a new material, to name a few examples.
- AI model: It uses machine learning algorithms that the model uses to learn from data and generate outputs. These algorithms are designed to identify patterns and make predictions or decisions without being explicitly programmed to perform a specific task.
- Large: The model has many parameters, i.e., elements that are learned from the data during training. More parameters allow the model to learn more complex patterns, but also requires more computational resources to train and use.
- For Science and Engineering: Training data might include scientific articles, textbooks, lab reports, CAD models, numerical simulation data, experimental data, or any other type of data that is relevant to these fields. Training the model on this type of data equips it to generate outputs that are relevant to scientific and engineering tasks.
This course is directly relevant to the AuroraGPT project, which is developing a trillion-parameter generative AI model to be trained on Argonne's new 64,000-GPU Aurora supercomputer. It also connects to the work of the Trillion Parameter Consortium, which engages researchers worldwide seeking to apply generative AI to scientific problems.
The course will take place as CMSC 35200-1: Deep Learning Systems in the fall quarter of 2023 at the University of Chicago, on Tuesdays and Thursdays, 3:30-4:50pm. For more information, please contact Profs Ian Foster and Rick Stevens.
We will study theoretical underpinnings of such models, their training paradigms, and applications. We will explore how these models can generate new data that are statistically similar to their training data, including text, images, and potentially more abstract representations, and how this capacity can be harnessed for scientific discovery and engineering solutions. Key topics include:
- Fundamentals of machine learning and deep learning
- Overview of large-scale generative models
- Deep dive into generative models like GPTs, GANs, VAEs
- Training and fine-tuning strategies for generative models
- Use cases of generative AI in various scientific fields like physics, chemistry, and biology, and in engineering disciplines such as materials science and electrical engineering
- Practical sessions on implementation of these models with popular deep learning frameworks
- Exploration of the limitations and ethical considerations of using AI in science and engineering
By the end of the course, students will have an understanding of how to implement and use generative AI models, how to apply them to problems in science and engineering, and how to navigate the ethical considerations that arise with the use of AI in these fields.
We will spend much time reading and discussing key papers in this area. In addition, the course will have a strong practical component, with students working to train models, apply them to science and engineering problems, evaluate their performance, etc. Initial ideas of things to cover:
- Experiment with use of LLMs for simple scientific problems.
- LLM definition and training: Define and train a small LLM from scratch.
- Fine-tuning/specialization for various tasks.
- Extend AutoGPT or similar for a scientific problem.
(A work in progress!)
- A jargon-free explanation of how AI large language models work, Timothy Lee and Sean Trott, 7/31/2023
- On the Opportunities and Risks of Foundation Models, Rishi Bommasan et al., 2020.
- 4 Charts That Show Why AI Progress Is Unlikely to Slow Down, Time, August 2, 2023.
- test [rick, need to add. Attention paper, GPT-3 and GPT-4 paper, Claude Paper, Falcon and Llama 2 papers]
- Scientific discovery in the age of artificial intelligence, Hanchen Wang et al., 2023.
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling, Leo Gao et al., 2020.
- The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only, Guilherme Penedo et al., 2023.
- Textbooks Are All You Need, Suriya Gunasekar et al., 2023.
- Scaling TransNormer to 175 Billion Parameters, Zhen Qin et al., 2023.
- TBD
- EluetherAI -- Evaluation harness
- HELM paper
- BIG paper
- Elo cmparison and HF leaderboard
- [Llama Adapter] https://arxiv.org/pdf/2303.16199.pdf
- Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback, Stephen Casper et al., 2023.
- DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales, Zhewei Yao et al., 2023.
- On Provable Copyright Protection for Generative Models, Nikhil Vyas et al., 2023.
- Emergent and Predictable Memorization in Large Language Models, Stella Biderman et al., 2023.
- Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4, Kent K. Chang et al., 2023.
- Generative Models as an Emerging Paradigm in the Chemical Sciences, Dylan M. Anstine et al., 2023.
- GPT-4 Reticular Chemist for MOF Discovery and ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF Synthesis, Zhiling Zheng et al., 2023.
- ChatMOF: An Autonomous AI System for Predicting and Generating Metal-Organic Frameworks, Yeonghun Kang et al., 2023.
- QLoRA: Efficient Finetuning of Quantized LLMs, Tim Dettmers et al., 2023.
- Auto-GPT: An Autonomous GPT-4 Experiment
- ChemCrow: Augmenting large-language models with chemistry tools, Andres Bran et al., 2023.
- RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation, Konstantinos Bousmalis et al., 2023.
- Can large language models democratize access to dual-use biotechnology?, Emily Soice et al., 2023.
- Harms from Increasingly Agentic Algorithmic Systems, Alan Chan et al., 2023.
- TabR: Unlocking the Power of Retrieval-Augmented Tabular Deep Learning, Yury Gorishniy et al., 2023
- Reclaiming AI as a theoretical tool for cognitive science, Iris van Rooij et al. (2023).
- minGPT with accompanying video.
- Dive into Deep Learning -- online textbook with notebooks.
- ChatALL: Chat with ALL AI Bots Concurrently, Discover the Best
-
Scientific Data Acquisition and Organization: Robust data lies at the heart of any sophisticated model. Thus we first must curate large-scale scientific datasets, designing approximately 20 specialized “bundles” across domains like biology/biochemistry, materials/chemistry, physics/cosmology, and climate/environment. By addressing gaps in existing large language models (LLMs) tailored for intricate scientific challenges, our data collection aims to be highly targeted and efficient, enhancing the overall capabilities of our models.
-
Model Evaluation Suite Development: With the curated data in place, the second phase centers on constructing expansive model evaluation suites. These suites, tailored to specific dataset collections and subdomains, will validate data and lay the groundwork for model testing. We plan on utilizing current LLMs to shape problems that AuroraGPT can solve, targeting around 1,000 problems for each scientific subdomain, resulting in an infrastructure ready to evaluate models on a staggering 20,000 problem sets.
-
Model Construction and Performance Analysis: This pillar is about breathing life into our data through model building. We aim to construct models across diverse scales, from 7B to 1000B, leveraging general texts, code, and niche scientific data. Rigorous testing will be conducted on elite supercomputers Polaris and Aurora, ensuring optimal performance. The setup will harness technologies like Megatron and DeepSpeed to determine the best strategies for parallelism and fine-tuning hardware choices.
-
Model Refinement and Deployment: The final phase ensures that our models do not just exist but also thrive in real-world applications. Refinement processes will utilize post-processing tools such as “instruct,” “RLHF,” and “Chat” and might employ pipelines like DeepSpeed RLHF or Alpaca. Automation will be a focus, especially for post-processing and safety checks. As the finishing touch, we plan on launching a Web and API platform for internal testing of AuroraGPT at Argonne before its broader release.
This site is accessible at https://tpc-ai.github.io/genAI-SE/.