This is the code repository for Modern-Computer-Vision-with-PyTorch, Second Edition, published by Packt.
A practical roadmap from deep learning fundamentals to advanced applications and Generative AI
The author of this book is - Kishore Ayyadevara and Yeshwanth Reddy
Whether you are a beginner or are looking to progress in your computer vision career, this book guides you through the fundamentals of neural networks (NNs) and PyTorch and how to implement state-of-the-art architectures for real-world tasks.
The second edition of Modern Computer Vision with PyTorch is fully updated to explain and provide practical examples of the latest multimodal models, CLIP, and Stable Diffusion.
You’ll discover best practices for working with images, tweaking hyperparameters, and moving models into production. As you progress, you'll implement various use cases for facial keypoint recognition, multi-object detection, segmentation, and human pose detection. This book provides a solid foundation in image generation as you explore different GAN architectures. You’ll leverage transformer-based architectures like ViT, TrOCR, BLIP2, and LayoutLM to perform various real-world tasks and build a diffusion model from scratch. Additionally, you’ll utilize foundation models' capabilities to perform zero-shot object detection and image segmentation. Finally, you’ll learn best practices for deploying a model to production.
By the end of this deep learning book, you'll confidently leverage modern NN architectures to solve real-world computer vision problems.
- Get to grips with various transformer-based architectures for computer vision, CLIP, Segment-Anything, and Stable Diffusion, and test their applications, such as in-painting and pose transfer
- Combine CV with NLP to perform OCR, key-value extraction from document images, visual question-answering, and generative AI tasks
- Implement multi-object detection and segmentation
- Leverage foundation models to perform object detection and segmentation without any training data points
- Learn best practices for moving a model to production
This book provides a hands-on approach to solving over 30 prominent real-world computer vision problems using PyTorch 2.x on actual datasets. Here you’ll learn to build a neural network from scratch and optimize hyperparameters, perform image classification, multi-object detection, segmentation, and more. You'll also explore facial expression manipulation and combining CV with NLP and RL techniques, build generative AI applications, and take your model to production on AWS. By the end of this book, you'll master modern NN architectures and confidently solve real-world CV problems.
In this chapter, we will create a very simple architecture on a simple dataset and mainly focus on how the various building blocks (feedforward, backpropagation, and learning rate) of an ANN help in adjusting the weights so that the network learns to predict the expected outputs from given inputs. We will first learn, mathematically, what a neural network is, and then build one from scratch to have a solid foundation. Then we will learn about each component responsible for training the neural network and code them as well. Overall, we will cover the following topics:
- Comparing AI and traditional machine learning
- Learning about the ANN building blocks
- Implementing feedforward propagation
- Implementing backpropagation
- Putting feedforward propagation and backpropagation together
- Understanding the impact of the learning rate
- Summarizing the training process of a neural network
In this chapter, we will dive into the foundations of building a neural network using PyTorch, which we will leverage multiple times in subsequent chapters when we learn about various use cases in image analysis. We will start by learning about the core data type that PyTorch works on – tensor objects. We will then dive deep into the various operations that can be performed on tensor objects and how to leverage them when building a neural network model on top of a toy dataset (so that we strengthen our understanding before we gradually look at more realistic datasets, starting with the next chapter). This will allow us to gain an intuition of how to build neural network models using PyTorch to map input and output values. Finally, we will learn about implementing custom loss functions so that we can customize them based on the use case we are solving. Specifically, this chapter will cover the following topics:
- Installing PyTorch
- PyTorch tensors
- Building a neural network using PyTorch
- Using a sequential method to build a neural network
- Saving and loading a PyTorch model
In this chapter, we will shift gears and learn how to perform image classification using neural networks. Essentially, we will learn how to represent images and tweak the hyperparameters of a neural network to understand their impact. For the sake of not introducing too much complexity and confusion, we only covered the fundamental aspects of neural networks in the previous chapter. However, there are many more inputs that we tweak in a network while training it. Typically, these inputs are known as hyperparameters. In contrast to the parameters in a neural network (which are learned during training), hyperparameters are provided by the person who builds the network. Changing different aspects of each hyperparameter is likely to affect the accuracy or speed of training a neural network. Furthermore, a few additional techniques such as scaling, batch normalization, and regularization help in improving the performance of a neural network. We will learn about these concepts throughout this chapter.
In this chapter, we will cover the following topics:
- Representing an image
- Why leverage neural networks for image analysis?
- Preparing data for image classification
- Training a neural network
- Scaling a dataset to improve model accuracy
- Understanding the impact of varying the batch size
- Understanding the impact of varying the loss optimizer
- Understanding the impact of varying the learning rate
- Building a deeper neural network
- Understanding the impact of batch normalization
- The concept of overfitting
In this chapter, we will learn about where traditional deep neural networks do not work. We’ll then learn about the inner workings of convolutional neural networks (CNNs) by using a toy example before understanding some of their major hyperparameters, including strides, pooling, and filters. Next, we will leverage CNNs, along with various data augmentation techniques, to solve the issue of traditional deep neural networks not having good accuracy. Following this, we will learn about what the outcome of a feature learning process in a CNN looks like. Finally, we’ll put our learning together to solve a use case: we’ll be classifying an image by stating whether the image contains a dog or a cat. By doing this, we’ll be able to understand how the accuracy of prediction varies by the amount of data available for training. By the end of this chapter, you will have a deep understanding of CNNs, which form the backbone of multiple model architectures that are used for various tasks. The following topics will be covered in this chapter:
- The problem with traditional deep neural networks
- Building blocks of a CNN
- Implementing a CNN
- Classifying images using deep CNNs
- Implementing data augmentation
- Visualizing the outcome of feature learning
- Building a CNN for classifying real-world images
In this chapter, we will learn about two different families of transfer learning architectures – variants of Visual Geometry Group (VGG) architecture and variants of residual network (ResNet) architecture. Along with understanding the architectures, we will also understand their application in two different use cases, age and gender classification, where we will learn about optimizing over both cross-entropy and mean absolute error losses at the same time to estimate the age and predict the gender of a person (given an image of the person), and facial keypoint detection (detecting the keypoints like eyes, eyebrows, and chin contour, given an image of a face as input), where we will learn about leveraging neural networks to generate multiple (136, instead of 1) continuous outputs in a single prediction. Finally, we will learn about a new library that assists in reducing code complexity considerably across the remaining chapters. In summary, the following topics are covered in the chapter:
- Introducing transfer learning
- Understanding the VGG16 and ResNet architectures
- Implementing facial keypoint detection
- Multi-task learning: Implementing age estimation and gender classification
- Introducing the torch_snippets library
This chapter will further solidify our understanding of CNNs and the various practical aspects to be considered when leveraging them in real-world applications. We will start by understanding the reasons why CNNs predict the classes that they do by using class activation maps (CAMs). Following this, we will learn about the various data augmentations that can be done to improve the accuracy of a model. Finally, we will learn about the various instances where models could go wrong in the real world and highlight the aspects that should be taken care of in such scenarios to avoid pitfalls. The following topics will be covered in this chapter:
- Generating CAMs
- Understanding the impact of batch normalization and data augmentation
- Practical aspects to take care of during model implementation Further, you will learn about the preceding topics by implementing models to:
- Predict whether a cell image indicates malaria
- Classify road signals
In this chapter and the next, we will learn about some of the techniques for performing object detection. We will start by learning the fundamentals – labeling the ground truth of bounding box objects using a tool named ybat, extracting region proposals using the selectivesearch method, and defining the accuracy of bounding box predictions by using the intersection over union (IoU) and mean average precision metrics. After this, we will learn about two region proposal-based networks – R-CNN and Fast R-CNN –by first learning about their working details and then implementing them on a dataset that contains images belonging to trucks and buses. The following topics will be covered in this chapter:
- Introducing object detection
- Creating a bounding box ground truth for training
- Understanding region proposals
- Understanding IoU, non-max suppression, and mean average precision
- Training R-CNN-based custom object detectors
- Training Fast R-CNN-based custom object detectors
In this chapter, we will learn about different modern techniques, such as Faster R-CNN, YOLO, and single-shot detector (SSD), that overcome slow inference time by employing a single model to make predictions for both the class of object and the bounding box in a single shot. We will start by learning about anchor boxes and then proceed to learn how each of the techniques works and how to implement them to detect objects in an image. We will cover the following topics in this chapter:
- Components of modern object detection algorithms
- Training Faster R-CNN on a custom dataset
- Working details of YOLO
- Training YOLO on a custom dataset
- Working details of SSD
- Training SSD on a custom dataset In addition to the above, as a bonus, we have covered the following in the GitHub repository:
- Training YOLOv8
- Training EfficientDet architecture
In this chapter, we will go one step further by not only drawing a bounding box around an object but also by identifying the exact pixels that contain the object. In addition to that, by the end of this chapter, we will be able to single out instances/objects that belong to the same class. We will also learn about semantic segmentation and instance segmentation by looking at the U-Net and Mask R-CNN architectures. Specifically, we will cover the following topics:
- Exploring the U-Net architecture
- Implementing semantic segmentation using U-Net
- Exploring the Mask R-CNN architecture
- Implementing instance segmentation using Mask R-CNN
In this chapter, we will take our learning a step further – we will work on more realistic scenarios and learn about frameworks/architectures that are more optimized to solve detection and segmentation problems. We will start by leveraging the Detectron2 framework to train and detect custom objects present in an image. We will also predict the pose of humans present in an image using a pre-trained model. Furthermore, we will learn how to count the number of people in a crowd in an image and then learn about leveraging segmentation techniques to perform image colorization. Next, we will learn about a modified version of YOLO to predict 3D bounding boxes around objects by using point clouds obtained from a LIDAR sensor. Finally, we will learn about recognizing actions from a video. By the end of this chapter, you will have learned about the following:
- Multi-object instance segmentation
- Human pose detection
- Crowd counting
- Image colorization
- 3D object detection with point clouds
- Action recognition from video
In this chapter, we will learn about representing an image in a lower dimension using autoencoders and then leveraging the lower-dimensional representation of an image to generate new images by using variational autoencoders. Learning how to represent images in a lower number of dimensions helps us manipulate (modify) the images to a considerable degree. We will also learn about generating novel images that are based on the content and style of two different images. We will then explore how to modify images in such a way that the image is visually unaltered; however, the class corresponding to the image is changed from one to another. Finally, we will learn about generating deepfakes: given a source image of person A, we generate a target image of person B with a similar facial expression as that of person A. Overall, we will go through the following topics in this chapter:
- Understanding and implementing autoencoders
- Understanding convolutional autoencoders
- Understanding variational autoencoders
- Performing an adversarial attack on images
- Performing neural style transfer
- Generating deepfakes
In this chapter, we will start by learning about the idea behind what makes GANs work, before building one from scratch. GANs are a vast field that is expanding as we write this book. This chapter will lay the foundation of GANs by covering three variants; we will learn about more advanced GANs and their applications in the next chapter. In this chapter, we will explore the following topics:
- Introducing GANs
- Using GANs to generate handwritten digits
- Using DCGANs to generate face images
- Implementing conditional GANs
In this chapter, we will learn about leveraging GANs to manipulate images. We will learn about two variations of generating images using GANs – paired and unpaired methods. With the paired method, we will provide the input and output pair combinations to generate images based on an input image, which we will learn about in the Pix2Pix GAN. With the unpaired method, we will specify the input and output; however, we will not provide one-to-one correspondence between the input and output, but expect the GAN to learn the structure of the two classes, and convert an image from one class to another, which we will learn about when we discuss CycleGAN. Another class of unpaired image manipulation involves generating images from a latent space of random vectors and seeing how images change as the latent vector values change, which we will learn about in the Leveraging StyleGAN on custom images section. Finally, we will learn about leveraging a pre-trained GAN – Super Resolution Generative Adversarial Networks (SRGAN), with which we can turn a low-resolution image into an image with high resolution. Specifically, we will learn about the following topics:
- Leveraging the Pix2Pix GAN
- Leveraging CycleGAN
- Leveraging StyleGAN on custom images
- Super-resolution GAN
In this chapter, we will learn how to combine reinforcement learning-based techniques (primarily, deep Q-learning) with computer vision-based techniques. This is especially useful in scenarios where the learning environment is complex and we cannot gather data for all the cases. In such scenarios, we want the model to learn by itself in a simulated environment that resembles reality as closely as possible. Such models come in handy when used for self-driving cars, robotics, bots in games (real as well as digital), and the field of self-supervised learning, in general. We will start by learning about the basics of reinforcement learning, and then about the terminology associated with identifying how to calculate the value (Q-value) associated with taking an action in a given state. Then, we will learn about filling a Q-table, which helps to identify the value associated with various actions in a given state. We will also learn about identifying the Q-values of various actions in scenarios where coming up with a Q-table is infeasible, due to a high number of possible states; we’ll do this using Deep Q-Network (DQN). This is where we will understand how to leverage neural networks in combination with reinforcement learning. Then, we will learn about scenarios where the DQN model itself does not work, addressing this by using the DQN alongside the fixed targets model. Here, we will play a video game known as Pong by leveraging CNN in conjunction with reinforcement learning. Finally, we will leverage what we’ve learned to build an agent that can drive a car autonomously in a simulated environment – CARLA. In summary, in this chapter, we will cover the following topics:
- Learning the basics of reinforcement learning
- Implementing Q-learning
- Implementing deep Q-learning
- Implementing deep Q-learning with fixed targets
- Implementing an agent to perform autonomous driving
In this chapter, we will switch gears and learn about how a convolutional neural network (CNN) can be used in conjunction with algorithms in the broad family of transformers, which are heavily used (as of the time of writing this book) in natural language processing (NLP) to develop solutions that leverage both computer vision and NLP. To understand combining CNNs and Transformers, we will first learn about how vision transformers (ViTs) work and how they help in performing image classification. After that, we will learn about leveraging transformers to perform the transcription of handwritten images using Transformer optical character recognition (TrOCR). Next, we will learn about combining Transformers and OCR to perform question answering on document images using a technique named LayoutLM. Finally, we will learn about performing visual question answering using a transformer architecture named Bootstrapping Language Image Pre-training (BLIP2). By the end of this chapter, you will have learned about the following topics:
- Implementing ViT for image classification
- Implementing LayoutLM for document question answering
- Transcribing handwritten images
- Visual question answering using BLIP2
In this chapter, we will learn about:
- Leveraging image and text embeddings to identify the most relevant image for a given text and vice versa
- Leveraging image and text encodings to perform zero-shot image object detection and segmentation
- Building a diffusion model from scratch, using which we’ll generate images both conditionally (with text prompting) and unconditionally
- Prompt engineering to generate better images More specifically, we will learn about contrastive language-image pre-training (CLIP), which can identify images relevant to a given text and vice versa, combining an image encoder and prompt (text) encoder to identify regions/segments within an image. We’ll learn how to leverage CNN-based architectures with the Segment Anything Model (SAM) to perform zero-shot segmentation ~50X faster than transformer-based zero-shot segmentation. We will also learn about the fundamentals of Stable Diffusion models and the XL variant.
In this chapter, we will learn about the model training process and coding some of the applications of diffusion that help in achieving the above. In particular, we will cover the following topics:
- In-painting
- ControlNet
- DepthNet
- SDXL Turbo
- Text2Video
In this chapter, we will deploy a simple application, progressively improve latency while modifying model parameters/architecture, and build a mechanism to identify input drift. The following topics will be covered in this chapter:
- Understanding the basics of an API
- Creating an API and making predictions on a local server
- Quantizing a model to fp16
- Identifying data drift
- Building a vector store using Facebook AI Similarity Search (FAISS)
If you feel this book is for you, get your copy today!
With the following software and hardware list you can run all code files present in the book.
Chapter | Software required | OS required |
---|---|---|
1 - 18 | Minimum 8 GB RAM, Intel i5 processor or better | Windows, Mac OS X, and Linux (Any) |
NVIDIA 8+ GB graphics card – GTX1070 or better | ||
Minimum 50 Mbps internet speed | ||
Python 3.6 and above | ||
PyTorch 1.7 | ||
Google Colab (can run in any browser) |
You can get more engaged on the discord server for more latest updates and discussions in the community at Discord
If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost. Simply click on the link to claim your free PDF. Free-Ebook
We also provide a PDF file that has color images of the screenshots/diagrams used in this book at GraphicBundle
V Kishore Ayyadevara leads a team focused on using AI to solve problems in the healthcare space. He has more than 10 years' experience in the field of data science with prominent technology companies. In his current role, he is responsible for developing a variety of cutting-edge analytical solutions that have an impact at scale while building strong technical teams. Kishore has filed 8 patents at the intersection of machine learning, healthcare, and operations. Prior to this book, he authored four books in the fields of machine learning and deep learning. Kishore got his MBA from IIM Calcutta and his engineering degree from Osmania University.
Yeshwanth Reddy is a senior data scientist with a strong focus on the research and implementation of cutting-edge technologies to solve problems in the health and computer vision domains. He has filed four patents in the field of OCR. He also has 2 years of teaching experience, where he delivered sessions to thousands of students in the fields of statistics, machine learning, AI, and natural language processing. He has completed his MTech and BTech at IIT Madras.