Scalable ML Systems related to Database Technologies

Paper list about adopting machine learning techniques into data management tasks. Mainly consider ones published in top data management venues.

Large Scale Machine Learning

DB4ML - An In-Memory Database Kernel with Machine Learning Support. SIGMOD 2020, 159-173. Paper
Optimizing Machine Learning Workloads in Collaborative Environments. SIGMOD 2020, 1701-1716. Paper
Dynamic Parameter Allocation in Parameter Servers. PVLDB 13(11), 1877 - 1890, 2020. Paper
Crossbow: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers. PVLDB 12(11), 1399-1413, 2019. Paper
PS2: Parameter Server on Spark. SIGMOD 2019: 376-388. Paper
MLlib*: Fast Training of GLMs Using Spark MLlib. ICDE 2019: 1778-1789. Paper
On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML. PVLDB 11(12): 1755-1768, 2018. Paper
FlexPS: Flexible Parallelism Control in Parameter Server Architecture. PVLDB 11(5): 566-579, 2018. Paper
A Cost-based Optimizer for Gradient Descent Optimization. SIGMOD 2017, 977-992. Paper
SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning. CIDR 2017. Paper
SystemML: Declarative Machine Learning on Spark. PVLDB 9(13): 1425-1436, 2016. Paper Project
MLbase: A Distributed Machine-learning System. CIDR 2013. Paper Project

An Intermediate Representation for Optimizing Machine Learning Pipelines. PVLDB 12(11), 1553-1567, 2019. Paper
Democratizing Data Science through Interactive Curation of ML Pipelines. SIGMOD 2019, 1171-1188. Paper
Helix: Holistic Optimization for Accelerating Iterative Machine Learning. PVLDB 12(4), 446-460, 2018. Paper
KeystoneML: Optimizing pipelines for large-scale advanced analytics. ICDE 2017: 535–546. Paper Project

Tuple-oriented Compression for Large-scale Mini-batch Stochastic Gradient Descent. SIGMOD 2019, 1517-1534. Paper
SketchML: Accelerating Distributed Machine Learning with Data Sketches. SIGMOD 2018, 1269-1284. Paper
Compressed Linear Algebra for Large-Scale Machine Learning. PVLDB 9(12): 960-971, 2016. Paper

SPORES: Sum-Product Optimization via Relational Equality Saturation for Large Scale Linear Algebra. PVLDB 13(11), 1919 - 1932, 2020. Paper
Enabling and Optimizing Non-linear Feature Interactions in Linear Algebra Over Normalized Data. SIGMOD 2019, 1571-1588. Paper
Accelerating Generalized Linear Models with MLWeaving: A One-Size-Fits-All System for Any-precision Learning. PVLDB 12(7): 807-821, 2019. Paper
A Comparative Evaluation of Systems for Scalable Linear Algebra-based Analytics. PVLDB 11(13): 2168-2182, 2018. Paper Project
Towards Linear Algebra over Normalized Data. PVLDB 10(11): 1214-1225, 2017. Paper
Scalable Linear Algebra on a Relational Database System. ICDE 2017: 523-534. . Paper
Learning Generalized Linear Models Over Normalized Data. SIGMOD 2015: 1969-1984. Paper

Vertica-ML: Distributed Machine Learning in Vertica Database. SIGMOD 2020, pages: 755-768. Paper
Declarative Recursive Computation on an RDBMS. PVLDB 12(7): 822-835, 2019. Paper
In-Database Learning with Sparse Tensors. PODS 2018: 325-340. Paper
ColumnML: Column-Store Machine Learning with On-The-Fly Data Transformation. PVLDB 12(4), 348-361, 2018. Paper
The BUDS Language for Distributed Bayesian Machine Learning. SIGMOD 2017, 961-976. Paper
Are Key-Foreign Key Joins Safe to Avoid when Learning High-Capacity Classifiers? PVLDB 11(3), 366-379, 2017. Paper
Learning Linear Regression Models over Factorized Joins. SIGMOD 2016, 3-18. Paper
The MADlib Analytics Library or MAD Skills, the SQL. PVLDB 5(12): 1700-1711, 2012. Paper

Sketching Linear Classifiers over Data Streams. SIGMOD 2018: 757-772. Paper Code
DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions. SIGMOD 2018, 1363-1376. Paper
Scalable Training of Hierarchical Topic Models. PVLDB 11(7), 826-839, 2018. Paper
LDA*: A Robust and Large-scale Topic Modeling System. PVLDB 10(11), 1406-1417, 2017. Paper
Scalable Kernel Density Classification via Threshold-Based Pruning. SIGMOD 2017, 945-959. Paper
Heterogeneity-aware Distributed Parameter Servers. SIGMOD 2017: 463-478. Paper
WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation. PVLDB 9(10): 744-755, 2016. Paper
Exploiting Matrix Dependency for Efficient Distributed Matrix Computation. SIGMOD 2015: 93-105. Paper

BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees. SIGMOD 2019: 1135-1152. Paper
Snuba: Automating Weak Supervision to Label Training Data. PVLDB 12(3), 223-236, 2018. Paper
Snorkel: Rapid Training Data Creation with Weak Supervision. PVLDB 11(3), 269-282, 2017. Paper