/Readings_MLDB

Paper list about adopting machine learning techniques into data management tasks.

Scalable ML Systems related to Database Technologies

Paper list about adopting machine learning techniques into data management tasks. Mainly consider ones published in top data management venues.

System for Big Data

  • DB4ML - An In-Memory Database Kernel with Machine Learning Support. SIGMOD 2020, 159-173.  Paper

  • Optimizing Machine Learning Workloads in Collaborative Environments. SIGMOD 2020, 1701-1716.  Paper

  • Dynamic Parameter Allocation in Parameter Servers. PVLDB 13(11), 1877 - 1890, 2020.  Paper

  • Crossbow: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers. PVLDB 12(11), 1399-1413, 2019.  Paper

  • PS2: Parameter Server on Spark. SIGMOD 2019: 376-388.  Paper

  • MLlib*: Fast Training of GLMs Using Spark MLlib. ICDE 2019: 1778-1789.  Paper

  • On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML. PVLDB 11(12): 1755-1768, 2018. Paper

  • FlexPS: Flexible Parallelism Control in Parameter Server Architecture. PVLDB 11(5): 566-579, 2018. Paper

  • A Cost-based Optimizer for Gradient Descent Optimization. SIGMOD 2017, 977-992.  Paper

  • SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning. CIDR 2017.  Paper

  • SystemML: Declarative Machine Learning on Spark. PVLDB 9(13): 1425-1436, 2016. Paper Project

  • MLbase: A Distributed Machine-learning System. CIDR 2013. Paper  Project

Pipeline

  • An Intermediate Representation for Optimizing Machine Learning Pipelines. PVLDB 12(11), 1553-1567, 2019.  Paper

  • Democratizing Data Science through Interactive Curation of ML Pipelines. SIGMOD 2019, 1171-1188.  Paper

  • Helix: Holistic Optimization for Accelerating Iterative Machine Learning. PVLDB 12(4), 446-460, 2018. Paper

  • KeystoneML: Optimizing pipelines for large-scale advanced analytics. ICDE 2017: 535–546.  Paper Project

Compression

  • Tuple-oriented Compression for Large-scale Mini-batch Stochastic Gradient Descent. SIGMOD 2019, 1517-1534.  Paper

  • SketchML: Accelerating Distributed Machine Learning with Data Sketches. SIGMOD 2018, 1269-1284. Paper

  • Compressed Linear Algebra for Large-Scale Machine Learning. PVLDB 9(12): 960-971, 2016. Paper

Linear Algebra

  • SPORES: Sum-Product Optimization via Relational Equality Saturation for Large Scale Linear Algebra. PVLDB 13(11), 1919 - 1932, 2020.  Paper

  • Enabling and Optimizing Non-linear Feature Interactions in Linear Algebra Over Normalized Data. SIGMOD 2019, 1571-1588.  Paper

  • Accelerating Generalized Linear Models with MLWeaving: A One-Size-Fits-All System for Any-precision Learning. PVLDB 12(7): 807-821, 2019.  Paper

  • A Comparative Evaluation of Systems for Scalable Linear Algebra-based Analytics. PVLDB 11(13): 2168-2182, 2018.  Paper Project

  • Towards Linear Algebra over Normalized Data. PVLDB 10(11): 1214-1225, 2017. Paper

  • Scalable Linear Algebra on a Relational Database System. ICDE 2017: 523-534. . Paper

  • Learning Generalized Linear Models Over Normalized Data. SIGMOD 2015: 1969-1984.  Paper

Rely on Database System

  • Vertica-ML: Distributed Machine Learning in Vertica Database. SIGMOD 2020, pages: 755-768.  Paper

  • Declarative Recursive Computation on an RDBMS. PVLDB 12(7): 822-835, 2019.  Paper

  • In-Database Learning with Sparse Tensors. PODS 2018: 325-340.  Paper

  • ColumnML: Column-Store Machine Learning with On-The-Fly Data Transformation. PVLDB 12(4), 348-361, 2018. Paper

  • The BUDS Language for Distributed Bayesian Machine Learning. SIGMOD 2017, 961-976. Paper

  • Are Key-Foreign Key Joins Safe to Avoid when Learning High-Capacity Classifiers? PVLDB 11(3), 366-379, 2017. Paper

  • Learning Linear Regression Models over Factorized Joins. SIGMOD 2016, 3-18. Paper

  • The MADlib Analytics Library or MAD Skills, the SQL. PVLDB 5(12): 1700-1711, 2012. Paper

Specific Algorithms

  • Sketching Linear Classifiers over Data Streams. SIGMOD 2018: 757-772.  Paper  Code

  • DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions. SIGMOD 2018, 1363-1376.  Paper

  • Scalable Training of Hierarchical Topic Models. PVLDB 11(7), 826-839, 2018.  Paper

  • LDA*: A Robust and Large-scale Topic Modeling System. PVLDB 10(11), 1406-1417, 2017.  Paper

  • Scalable Kernel Density Classification via Threshold-Based Pruning. SIGMOD 2017, 945-959. Paper

  • Heterogeneity-aware Distributed Parameter Servers. SIGMOD 2017: 463-478. Paper

  • WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation. PVLDB 9(10): 744-755, 2016.  Paper

  • Exploiting Matrix Dependency for Efficient Distributed Matrix Computation. SIGMOD 2015: 93-105. Paper

Training Data Acquisition

  • BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees. SIGMOD 2019: 1135-1152. Paper

  • Snuba: Automating Weak Supervision to Label Training Data. PVLDB 12(3), 223-236, 2018. Paper

  • Snorkel: Rapid Training Data Creation with Weak Supervision. PVLDB 11(3), 269-282, 2017. Paper