15 June 2022

Machine Learning for X and X for Machine Learning

After the recent advances in Artificial Intelligence (AI), and especially in Machine Learning (ML) and Deep Learning (DL)*, various other computer science fields have gone into a race of "blending" their existing methods with ML/DL. There are two directions to enable such a blend: Either using ML to advance the field, or using the methods developed in the field to improve ML. A commonly used slogan when combining ML with a computer science field is: ML for X or X for ML, where X can be, for instance, any of {databases, systems, reasoning, semantic web}. 

In this blog post, we focus on cases where X = big data management. We have already observed works on ML for data management and on data management for ML since several years now. Both directions have a great impact in both academia, with dedicated new conferences popping up, as well as in the industry, with several companies working on either improving their technology with ML or providing scalable and efficient solutions for ML.

Databloom.ai is one of the first companies to embrace both directions within a single product. Within Databloom's product, Blossom, we utilize ML to improve the optimizer and, thus, provide better performance to users but also we utilize well-known big data management techniques to speed up federated learning.


Databloom.ai is the first to embrace both directions within a single product

Machine Learning for Big Data Management and Big Data Management for Machine Learning in databloom.ai


ML for Data Management

In a previous blog post we discussed how we plan to incorporate ML into Blossom's optimizer. Having an ML model to predict the runtime of an execution plan can have several benefits. First, system performance can rocket as the optimizer is able to find very efficient plans. For example, for k-means, an ML-based optimizer [1] outputs very efficient plans by finding the right platform combination and achieve 7x better runtime performance than a highly-tuned cost-based optimizer! Second, the hard part of manually tuning a cost model in an optimizer is vanished. With just collected training data (plans and their runtime) you can easily train an ML model.

Data Management for ML

In an earlier blog post we discussed how Blossom, a federated data lakehouse analytics framework, can provide an efficient solution for federated learning following principles from data management. To enable federating learning, we are working towards a parameter server architecture. However, as simplistic parameter servers are very inefficient due to excessive network communication (i.e., large number of messages sent through the network) we utilize data management techniques such as increasing locality, enabling latency hiding, and exploiting bulk messaging. The database community has already exploited such optimizations and showed that they can lead to more than an order of magnitude faster training times.

[1] Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, Sanjay Chawla: ML-based Cross-Platform Query Optimization. ICDE 2020: 1489-1500.

[2] Alexander Renz-Wieland, Rainer Gemulla, Zoi Kaoudi, Volker Markl: NuPS: A Parameter Server for Machine Learning with Non-Uniform Parameter Access. SIGMOD Conference 2022: 481-495.

* We use the term ML to refer to both Machine Learning and Deep Learning.
** Icons appearing in the figure were created by Creative Stall Premium - Flaticon






most read