My Awesome Data Ops Resources Awesome

A curated list of data operations resources, focused for Cultural Heritage Organizations usage.

Books

Papers and Blogs

ETL

Data Quality

Metadata

Pipeline Engineering

Data Ops Software

Data Pipeline Orchestration

  • Airflow an open-source platform to programmatically author, schedule and monitor data pipelines.
  • Apache Oozie an open-source workflow scheduler system to manage Apache Hadoop jobs.
  • DBT (Data Build Tool) is a command line tool that enables data analysts and engineers to transform data in their warehouse more effectively.
  • BMC Control-M a digital business automation solution that simplifies and automates diverse batch application workloads.
  • DataKitchen a DataOps Platform that reduces analytics cycle time by monitoring data quality and providing automated support for the deployment of data and new analytics.
  • Reflow Reflow is a system for incremental data processing in the cloud. Reflow enables scientists and engineers to compose existing tools (packaged in Docker images) using ordinary programming constructs.
  • ElementL A current stealth company founded by ex-facebook director and graphQL co-creator Nick Schrock. Dagster Open Source.
  • Astronomer.io Astronomer recently re-focused on Airflow support. They make it easy to deploy and manage your own Apache Airflow webserver, so you can get straight to writing workflows.
  • Piperr.io Use Piperr’s pre-built data pipelines across enterprise stakeholders: From IT to Analytics, From Tech, Data Science to LoBs.
  • Prefect Technologies Open-source data engineering platform that builds, tests, and runs data workflows.
  • Genie Distributed Big Data Orchestration Service by Netflix

Testing and Production Quality

  • ICEDQ software used to automate the testing of ETL/Data Warehouse and Data Migration.
  • Naveego A simple, cloud-based platform that allows you to deliver accurate dashboards by taking a bottom-up approach to data quality and exception management.
  • DataKitchen a DataOps Platform that improves data quality by providing lean manufacturing controls to test and monitor data.
  • FirstEigen Automatic Data Quality Rule Discovery and Continuous Data Monitoring
  • Great Expectations Great Expectations is a framework that helps teams save time and promote analytic integrity with a new twist on automated testing: pipeline tests. Pipeline tests are applied to data (instead of code) and at batch time (instead of compiling or deploy time).
  • Enterprise Data Foundation Open-source enterprise data toolkit providing efficient unit testing, automated refreshes, and automated deployment.

Deployment Automation and Development Sandbox Creation

  • Jenkins a ‘CI/CD’ tool used by software development teams to deploy code from development into production
  • DataKitchen a DataOps Platform that supports the deployment of all data analytics code and configuration.
  • Amaterasu is a deployment tool for data pipelines. Amaterasu allows developers to write and easily deploy data pipelines, and clusters manage their configuration and dependencies.
  • Meltano aims to be a complete solution for data teams — the name stands for model, extract, load, transform, analyze, notebook, orchestrate — in other words, the data science lifecycle.

Data Science Model Deployment

  • Domino accelerates the development and delivery of models with infrastructure automation, seamless collaboration, and automated reproducibility.
  • Hydrosphere.io deploys batch Spark functions, machine-learning models, and assures the quality of end-to-end pipelines.
  • Open Data Group a software solution that facilitates the deployment of analytics using models.
  • ParallelM moves machine learning into production, automates orchestration, and manages the ML pipeline.
  • Seldon streamlines the data science workflow, with audit trails, advanced experiments, continuous integration, and deployment.
  • Metis Machine Enterprise-scale Machine Learning and Deep Learning deployment and automation platform for rapid deployment of models into existing infrastructure and applications.
  • Datatron Automate deployment and monitoring of AI Models.
  • DSFlowGo from data extraction to business value in days, not months. Build on top of open source tech, using Silicon Valley’s best practices.
  • DataMo-Datmo tools help you seamlessly deploy and manage models in a scalable, reliable, and cost-optimized way.
  • MLFlow An open source platform for the complete machine learning lifecycle from MapR.
  • Studio.ML Studio is a model management framework written in Python to help simplify and expedite your model building experience.
  • Comet.ML Comet.ml allows data science teams and individuals to automagically track their datasets, code changes, experimentation history and production models creating efficiency, transparency, and reproducibility.
  • Polyaxon An open source platform for reproducible machine learning at scale.
  • Missinglink.ai MissingLink helps data engineers streamline and automate the entire deep learning lifecycle.
  • kubeflow The Machine Learning Toolkit for Kubernetes
  • Vert.ai Models are the new code!

License

CC0