/Data-Engineer-Books

This repo contains some of the most famous books on Data Engineering

Data-Engineer-Books

Here is a list of some excellent books in data engineering here, some of which inevitably crossover into the spheres of data science and analysis, amongst other disciplines.

  1. Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

    The Book provides a foundational overview of data engineering in a modern Big Data context. Many data tools, methods and processes are covered, detailing everything from collecting and storing data to cleaning and transforming data for use in a number of modern tools and platforms. This book covers key topics such as data storage and warehousing, structures, distributed systems, batch and stream processing, encoding, replication, partitioning and much more

    A must read if you want to gain deep understanding on Distributed System Design.

    https://amzn.to/39KdAVI

  2. Spark: The Definitive Guide: Big Data Processing Made Simple — Bill Chambers, Matei Zaharia

    Learn how to use, and store Apache Spark with this comprehensive guide, written by creators of open-source cluster computing. With an emphasis on development and new features in Spark 2.0, authors Bill Chambers and Matei Zaharia divide Spark titles into separate categories, each with different goals.

    https://amzn.to/3l5VfoK

  3. The Data Warehouse Toolkit — Ralph Kimball, Margy Ross

    This book is for Data Engineering which offers an overview of all the good and the modern and current trends and includes a clear discussion of new topics such as big data. This book also incorporates new and improved star schema model patterns.

    https://amzn.to/3suwTck

  4. Data Engineering with Python: Work with massive datasets to design data models and automate data pipelines using Python

    This is a real data engineering-oriented book that focuses on building solid foundations for data science and analysis. There’s tons of useful how-to information in there that promises to be of pretty much immediate use to anyone working in data engineering right now, or in the future

    https://amzn.to/37Cdtee

  5. Data Pipelines Pocket Reference: Moving and Processing Data for Analytics

    This compact, well-designed back is full of excellent diagrams and contextualised examples. It covers key areas like ensuring data quality, testing pipelines prior to deployment and some other overlooked areas. A must-have for any established or budding data engineer.

    https://amzn.to/39P2kaF

  6. 97 Things Every Data Engineer Should Know: Collective Wisdom from the Experts

    This book features essays and interviews with data engineers at top companies including Google, LinkedIn, Twitter and Microsoft. The book is full of practical tips and guidance, and will get you up to speed on the latest best practices in data science engineering.

    https://amzn.to/3MdNrgr

  7. Python Data Cleaning Cookbook: Modern techniques and Python tools to detect and remove dirty data and extract key insights

    Some of the specific data cleaning techniques it covers are removing duplicate data, handling missing values, monitoring particularly high volumes of data, validating errors, handling outliers and dealing with invalid dates. It also explores how to uncover unexpected values and classification errors using visualisation and exploratory data analysis. This is the perfect book for anyone that works with large volumes of messy or unclean data.

    https://amzn.to/3FEYeh5

  8. Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control

    The book covers many areas within both data science and engineering such as data mining, dimensional reduction, applied optimisation, machine learning and artificial intelligence and is probably aimed at higher-level researchers and professionals.

    https://amzn.to/3Pgf3DC

  9. Rebuilding Reliable Data Pipelines Through Modern Tools — Ted Malaska

    This book teaches participants in the data space and what the ETL (Extract, Transform, Load) data landscape looks like.It uses a lot of simple but effective metaphors to ‘feel’ what it would be like to work as a data engineer in the area described in the book.

  10. Big Data: Principles and best practices of scalable realtime data systems — Nathan Marz, James Warren

    Big Data teaches you to build large data systems using an architecture designed specifically for capturing and analyzing web-scale data. This book introduces Lambda Architecture, a fast, easy-to-understand method that can be developed and operated by a small team

    Amazon Link

  11. Designing Distributed Systems: Patterns and Paradigms for Scalable, Reliable Services

    This practical guide presents a collection of repeatable, generic patterns to help make the development of reliable distributed systems far more approachable and efficient.It demonstrates how you can adapt existing software design patterns for designing and building reliable distributed applications.

    Amazon Link