curated list of awesome tools and libraries for specific domains
- python
- plotting raster https://github.com/fmaussion/salem
- raster handling http://xarray.pydata.org/en/stable/
- multi dimensional arrays http://xarray.pydata.org/en/stable/
- spatial data including joins (works with dask) http://geopandas.org
- cleaning of addresses: https://github.com/openvenues/libpostal
- postgis
- multi dimensional
- hadoop
- http://www.geomesa.org
- https://github.com/DataSystemsLab/GeoSpark
- https://github.com/harsha2010/magellan
- https://github.com/locationtech/geowave
- https://github.com/locationtech/geotrellis
- https://github.com/Esri/spatial-framework-for-hadoop and https://github.com/Esri/gis-tools-for-hadoop as well as their java api https://github.com/Esri/geometry-api-java
- http://www.nltk.org/book/
- https://github.com/keon/awesome-nlp
- https://github.com/JohnSnowLabs/spark-nlp
- https://github.com/databricks/spark-corenlp (check license extra carefully for commercial setup)
- pyspark with https://spacy.io
- https://explosion.ai
- https://github.com/clulab/processors
- https://github.com/google/sling
- https://github.com/facebookresearch/faiss
- https://github.com/bplank/bilstm-aux
- https://github.com/facebookresearch/fastText
- https://github.com/facebookresearch/InferSent
- parsing HTML
- clustering
- general operations
- logging & alerting
- certificates
- https://certbot.eff.org and https://letsencrypt.org for free and automated https/ssl certificates
- hadoop monitoring
- testing
- data quality
- packer base images
small
- prediction
- feature extration
hadoop
- handling & prediction
- https://github.com/sryza/spark-timeseries
- https://spark-summit.org/2016/events/huohua-a-distributed-time-series-analysis-framework-for-spark/
- https://github.com/twosigma/flint
- https://databricks.gitbooks.io/databricks-spark-reference-applications/content/timeseries/index.html
- correlation https://github.com/Sotera/correlation-approximation
- https://github.com/sryza/spark-timeseries
- anomaly detection
- storage
model metadata
- https://github.com/IDSIA/sacred
- http://studio.ml (also hyper opt)
- https://github.com/mitdbg/modeldb
- https://dataversioncontrol.com
- https://www.comet.ml
- https://aetros.com
model building
- feature engineering
- small
- http://scikit-learn.org/stable/
- R
- python
- hadoop
- ensembling
- specific great models
- gradient boosted trees
- xgboost
- lightgbm
- catboost https://github.com/catboost/catboost
- gradient boosted trees
- visualization of results
model serving
- own API wrapper around original model code
- http://clipper.ai
- https://www.acumos.org
- https://polyaxon.com
- http://vespa.ai
- https://github.com/RedisLabsModules/redis-ml
- https://riseml.com
- https://github.com/Hydrospheredata/mist
- https://github.com/Azure/ai-toolkit-iot-edge
- https://www.dominodatalab.com and various other cloud data science work benches
- https://datmo.com
- https://aws.amazon.com/de/sagemaker/
model serialization
hyperparameter tuning
e2e
ml solutions
bridiging python / r and big data
- http://blog.madhukaraphatak.com/pipe-in-spark/
- sparklyR
- https://github.com/apple/turicreate out of core models on medium sized data
graph processing
- hadoop
- non hadoop
- https://neo4j.com (single master, multi slave cluster possible)
- tutorial
- telco hadoop geospatial
- https://www.youtube.com/watch?v=VtvP54Xo3Ek&feature=youtu.be
- streaming and declarative models: https://www.youtube.com/watch?v=Do7C4UJyWCM
- ml
- ml pipelines https://www.youtube.com/watch?v=cpR6Vkp7ImA
- shingles and pipelines https://www.youtube.com/watch?v=qkrh35IF2SU, https://github.com/PacktPublishing/Mastering-Spark-for-Data-Science
- gradient boosting comparision: https://www.youtube.com/watch?v=5CWwwtEM2TA
- streaming
- kafka https://www.youtube.com/watch?v=MNPI925PFD0
- spark streaming in depth https://www.youtube.com/watch?v=hyZU_bw1-ow
- python https://github.com/mrocklin/streamz
- python
- https://python-graph-gallery.com for inspiration
- seaborn
- R
- ggplot2 + grest themes
- javascript
bi & dashboarding
- https://metabase.com
- https://looker.com
- python
- https://github.com/stitchfix/pyxley notebooks
- jupyter
- zeppelin
type safety
- stan
- pymc3
- https://github.com/uber/pyro
- https://www.cockroachlabs.com (spanner)
- https://www.snowflake.net/de/
- https://snowplowanalytics.com/products/snowplow-open-source/
- hbase-spark
- via phenix spark
- https://github.com/hortonworks/shc-release/tree/HDP-2.6.3.0-235-tag
- postgres on GPUs http://www.brytlyt.com
- improved cassandra scylla http://www.scylladb.com
- https://www.mapd.com/platform/
- https://clickhouse.yandex
time series DBs
big real time analytics and data integration
- https://medium.com/@leventov/comparison-of-the-open-source-olap-systems-for-big-data-clickhouse-druid-and-pinot-8e042a5ed1c7
- https://www.quora.com/Should-I-use-Gobblin-or-Spark-Streaming-to-injest-data-from-Kafka-to-HDFS/answer/Prithiviraj-Damodaran
- typesafe configuration
- https://cir.is/docs/validation
- https://github.com/pureconfig/pureconfig
- founding / payments https://stripe.com/atlas
- errors
- https://github.com/actionml/universal-recommender
- https://github.com/DataSystemsLab/recdb-postgresql
- apache atlas
- cloudera navigator
- https://www.waterlinedata.com (hadoop only)
- https://alation.com (all)
- https://www.privitar.com
- data mining