Compressive Big Data Analytics (CBDA)
The theoretical foundations of Big Data Science are not fully developed, yet. The CBDA project investigates a new Big Data theory for high-throughput analytics and model-free Inference. Specifically, we explore the core principles of distribution-free and model-agnostic methods for scientific inference based on Big Data sets. Compressive Big Data analytics (CBDA) represents an idea that iteratively generates random (sub)samples from the Big Data collection, uses established techniques to develop model-based or non-parametric inference, repeats the (re)sampling and inference steps many times, and finally uses bootstrapping techniques to quantify probabilities, estimate likelihoods, or assess accuracy of findings. The CBDA approach may provide a scalable solution avoiding some of the Big Data management and analytics challenges. CBDA sampling is conducted on the data-element level, not on the case level, and the sampled values are not necessarily consistent across all data elements (e.g., high-throughput random sampling from cases and variables within cases). An alternative approach is to use Bayesian methods to investigate the theoretical properties (e.g., asymptotics, as sample sizes increase to infinity, but the data has sparse conditions) of model-free inference entirely based on the complete dataset without any parametric or model-limited restrictions.
This project investigates the parallels between the (established) compressive sensing (CS) for signal representation, reconstruction, recovery and data denoising, and the (new) field of big data analytics and inference. Ultimately, the project will develop the foundational principles of scientific inference based on compressive big data analytics. We investigate various methods for efficient data-aggregation, compressive-analytics, scientific inference, interactive services for data interrogation, including high-dimensional data visualization, and integration of research and education. Specific applications include neuroimaging-genetics studies of Alzheimer’s disease, predictive modeling of cancer treatment outcomes, and high-throughput data analytics using graphical pipeline workflows.