
This study investigates the efficacy of modern deep learning architectures in image classification tasks, focusing on the recognition of dog breeds. Leveraging the Stanford Dogs Dataset, we evaluate the performance of Vision Transformer (ViT), VGG-16, and ResNet-50 models, aiming to surpass previous benchmarks set by Hsu (2015) using conventional convolutional neural networks (CNNs). The Vision Transformer (ViT) architecture, originally designed for natural language processing, represents a modern approach to image classification by processing entire images as sequences of tokens. Our results demonstrate significant accuracy improvements over the baseline established by Hsu (2015). VGG-16 achieved 65% testing accuracy, ResNet-50 achieved 84%, and surprisingly, ViT outperformed both with 91% accuracy. These findings suggest the potential of transformer architectures in handling smaller-scale datasets with fine-grained categories. The study contributes to the growing body of research indicating the viability of transformer models in various image classification tasks and calls for further exploration to enhance their performance as the architecture continues to evolve.