Welcome to my repository of Data Engineering with Python resources!
Throughout my time as a Data Engineer, I've gathered many bookmarks and resources that have really helped me learn and do my job. I organized these bookmarks and put them in this repository so they can help others too, whether you're new to Data Engineering with Python or looking to know more. I hope you find these resources as helpful as they were for me.
This repository is a handpicked collection of resources for Python developers in data engineering, machine learning, and AI. Inside, you'll discover a neatly arranged selection of frameworks, libraries, and tools crucial for machine learning, ETL, ORM, data/schema validation, database migration, and more, all focused on Python.
Each section includes:
- A concise description of the tools within that category.
- A list of the most relevant tools found in that category.
- A guide on selecting the appropriate tool from each category.
-
ORMs for Python: Including popular ORMs like SQLAlchemy, Django ORM, Peewee, etc.
-
Data/Schema Validation: Including libraries like Pydantic, Marshmallow, Cerberus, etc.
-
Database Migration Tools: Tools like Alembic, Flyway, or Django's own migration system.
-
Data Wrangling Tools: Libraries that help in cleaning, transforming, and preparing data, such as Pandas, Dask, etc.
-
ETL (Extract, Transform, Load) Frameworks: Tools that help in the process of extracting data from various sources, transforming it, and loading it into a data store.
-
Orchestration Tools: Tools such as Apache NiFi, Luigi, Airflow, and Prefect, are designed to automate and orchestrate ETL workflows, managing job scheduling and execution. However, the specific ETL tasks are typically defined with other dedicated libraries or frameworks.
-
Data Visualization Libraries: Libraries that can help in visualizing data, such as Matplotlib, Seaborn, Plotly, Bokeh, etc.
-
Machine Learning Libraries: While not exclusively for data engineering, having resources related to machine learning is useful. This includes libraries like scikit-learn, TensorFlow, and PyTorch.
-
Big Data Processing Tools: Includes links to resources for tools like Apache Spark, Apache Hadoop, etc.
-
Streaming Data Processing: Tools and frameworks for processing streaming data, such as Apache Kafka, Apache Flink, and Apache Storm.
-
Data Modeling Tools: Resources for data modeling tools that can help in designing database schemas, such as dbdiagram.io, ER/Studio, or MySQL Workbench.
-
API Development Frameworks: Since data engineering often involves API development for data access, includes resources for frameworks like Flask, FastAPI, or Django REST Framework.
-
Data Governance and Metadata Management: Tools and frameworks that help in managing data access, security, and compliance, such as Apache Atlas, Collibra, or Amundsen.
-
Cloud SDKs for Python: These SDKs, like boto3 for AWS, provide Python developers with the tools necessary to interact with cloud services efficiently, allowing for the automation of resource management and the utilization of cloud services within Python applications.
-
Cloud Services and Tools: Include resources related to cloud services that are widely used in data engineering, like AWS, Azure, and GCP, particularly focusing on their data storage, processing, and analytics services.
-
Data Storage Solutions: Resources on various data storage solutions like relational databases, NoSQL databases, data lakes, and data warehouses.
-
Data Quality Tools: Tools that help in ensuring data quality, such as Great Expectations, Deequ, or Pandas Profiling.
-
Learning Resources: Links to courses, tutorials, blogs, and books that offer in-depth knowledge about data engineering in Python.
-
Community and Forums: Links to relevant forums and communities where developers can ask questions, share knowledge, and stay updated with the latest trends in data engineering.
-
Free datasets and APIs: Great list of free datasets and APIs - a very useful collection of free data resources for people learning data engineering. These resources are great for getting a hands-on experience.
We welcome contributions to this repository! If you'd like to add a resource, please submit a pull request or open an issue to suggest changes. Please ensure your suggestions align with contribution guidelines outlined in CONTRIBUTING file.
This project is licensed under the MIT License - see the LICENSE file for details.
- Vajo Lukic - Initial work - vajol
- Inspired by Free Datasets and APIs
During my career as Data Engineer, I've used many free resources that have really helped me grow. I created this repository to give something back to the community that has helped me so much. I hope these resources will help others just like they helped me. Let's help each other and learn together as we move forward in our learning path!
For any inquiries or comments about this repository, feel free to connect with me on LinkedIn, follow and reach out on Twitter, or subscribe and send your thoughts via my Substack newsletter.
Your feedback and questions are always welcome!