12 May 2022

Towards a Learning-based Query Optimizer

Query optimization is at the core of any data management/analytics system. It is the process of determining the best way to execute an input query or task (i.e., execution plan). Query optimization is composed of several three sub-processes: (i) The enumeration of the different execution plans, (ii) the cost of each subplan required to determine which one is the best, (iii) the cardinality estimation of subplans (i.e., how many tuples a subplan will output) which is crucial because it affects the cost of the plan. Recent research in the field of data management has begun to leverage the power of machine learning (ML) to solve these tasks more effectively and efficiently. In this blog post, we will focus on using ML for estimating the cost of subplans. 

Traditional optimizers come with a cost model. This means mathematical formulas that encode the cost of each operator and aggregate these costs to estimate the cost of a query plan. However, coming up with a cost model in a federated setting, as the one Blossom is built for, is not only very challenging but may also lead to suboptimal performance. There are several reasons for that: (i) Cost-based optimizers assume linear functions which do not depict the real system behaviour, (ii) they require access to statistics stored on the several platforms which may not be possible, and (iii) they need fine-tuning to really model the system behaviour which can be very time-consuming, yet very important. The plot below shows up to an order of magnitude better performance with a well-tuned cost model.

30 March 2022

The Missing Piece in ML-based Query Optimization

Machine Learning (ML) has not only become omnipresent in our everyday lives (with self-driving cars, digital personal assistants, chatbots etc.) but has also started spreading to our core technological systems, such as databases and operating systems. In the area of databases, there is a large amount of works aiming at optimizing data management components, from index building, knob tuning to query optimization. Just in query optimization, ML is used in the place of many optimizer components, such as cardinality estimation, cost model, and join enumeration. In this blog post, we focus on the case of using an ML in the place of a cost model and go from the traditional cost-based query optimization to the newly proposed ML-based query optimization.


ML-based query optimization

Generally speaking, given a user query, a query optimizer finds the best way to execute this query (i.e., execution plan) so that the runtime is minimized. For example, Databloom Blossom's query optimizer is responsible to figure out which platform combination is the best to execute a query so that the runtime of the given query is as low as possible. To do so, traditional cost-based optimizers use a cost model (a set of mathematical formulas) that captures the cost of executing a plan and an enumeration algorithm that searches for different plans until it finds the one with the smallest cost. Going from a cost-based query optimizer to an ML-based one, the cost model is replaced with an ML model (usually a regression model) that outputs an estimate of the runtime of a plan [1]. Then the enumeration algorithm simply invokes the ML model while searching different plans to determine the one with the lowest runtime estimate based on the ML model.


The hurdle of training data collection

It is well known by now that ML models can be as good as their training data is. It is the same in ML-based query optimization: 

17 March 2022

Challenges and opportunities towards AI solutions adoption

Artificial intelligence solutions have been revolutionizing the industry continuously in the last decades. The benefits delivered by these technologies are numerous and diverse; among others you can find: capacity to improve work efficiency, capacity to analyze big datasets, automate infrastructure for easy escalation, enhance customer experience, etc. Nowadays companies are challenging themselves to obtain benefits from these technologies, even enabling whole organizational transformations, boosting the capacity of the companies from its core.

The mere existence of these opportunities implies risk. A company's competitors can and will eventually exploit the capabilities of AI solutions, gaining significant competitive advantages. This fact brings pressure to implement AI strategies with architectural deficiencies based on general misconceptions among business people and AI specialists

This blog will explore two important problems to consider when integrating AI solutions with business strategies. Looking to provide solutions for future effective adoptions.

1. Implementing AI-Based solutions without clarifying the problem, neither business objectives: AI works as a wide collection of technologies which correct usage in the right situation can effectively help an organization to achieve strategic goals. One of the most common problems is the adoption of AI tools to "have as many AI applications across our enterprice solution", and then sell ourselves as "the AI-powered solution".

This integration of AI technologies cannot be developed in random order. It is proved that there exists an important correlation between the value gained with AI solutions and how integrated these initiatives are with the general digital strategy [1]. This pattern justifies that the impact of AI must be planned regarding the entire business model, not in single narrow scoped projects. In this way, executives are in a better position to appreciate and plan new paths to transform their business model using AI methods.

2. Lack of knowledge in work units:  This problem has two sides. From one part, there is a well known lack of highly qualified AI professionals. Usually, companies try to mitigate this problem using the outsourcing of these professionals. Nevertheless, this practice only works targeting particular projects with specific objectives, and the adoption among business areas of the final solution is often unclear. On the other hand, the second problem lies in the consumption side of the technology. It is crucial that business managers develop a genuine interest, and learn to cherish the disruptive capabilities that AI solutions can bring to business problems. In general, companies will never be able to exploit benefits from AI if medium management does not understand how technical solutions can improve the quality, efficiency, response time and resilience of the service their organizational area offers.

From the last points we can state that for meaningful AI technologies adoption, a reasonable way to proceed is the creation of AI teams, each of them tightly aligned with a business area with particular problems to be solved. In this scenario, the communication between AI specialists and Management will be projected on the quality of the final solution; given the importance of the capacity to exchange domain goals from business, and requirements to co-develop, calibrate and tailor AI solutions from AI specialists.

In an effort to make this communication easy, Databloom is preparing a web system called Blossom Studio that supports the generation of ML pipelines from data extraction to its final consumption. The interface provides a simplified overview of the operators and plans the organization has built; then, in an intuitive way the interface allows them to work with these assets and generate new transformation operators. Using the editor, the user is able to concatenate all these resources into new AI solutions that are parameterizable without the need of developing skills, just general AI knowledge.

All objects in the editor are customizable, and can be annotated providing all the relevant information for business users to be able to tailor these solutions regarding new needs. The work among different work units can be easily integrated and the preparation of multi-department assets can be split among business areas as separate operators that can be merged into a single one.

As a company we expect that these features will facilitate the elaboration of intricate execution plans. Making collaborations for the implementation of AI solutions more natural for today's work environments.


[1] https://sloanreview.mit.edu/projects/winning-with-ai/

16 March 2022

A decade after ‘data is the new soil’, it’s now the time for Blossom

When the Internet started, people were hard-set to provide a name for using the Internet. You have probably heard about "surfing / drowning / diving on the internet". Those three concepts referred to the capacity we had to explore and use the Internet. Each step forward on the Internet means being hit by tons of data; at that moment, we also start coining metaphors for data. 

Using metaphors provides an easy way to compare two concepts or objects, one abstract and complex, and the other familiar; they need to share some characteristics because the familiar concept will provide a way to understand the abstract. Nevertheless, the abstract concept will have many familiar representations hiding some attributes as it becomes more complex. 

When people started talking about Data is the new oil in 2006, they were referring to data as the source of a kind of energy that moves the world into a new area. And, if so, data also creates byproducts for those who work it, too. Those byproducts are often called New digital business models.
As a quintessence, the problem of defining your data as oil doesn't tell you anything about maintenance, re-usage, exploration, quality or storage. Or if the user is really creating new digital driven business models, it’s just a massive amount of data. Like oil.

Due to the lack of the concept of  the slogan “Data is the new oil”, one exciting approach appeared in 2010 [1] with the new slogan Data is the new soil. The slogan was presented by David McCandless Ted Talk about the data economy and industry, in which David said: 

Data is the new soil, because for me, it feels like a fertile, creative medium. Over the years, online, we've laid down a huge amount of information and data, and we irrigate it with networks and connectivity, and it's been worked and tilled by unpaid workers and governments.

The re-definition of the data economy as the new soil was sleeping under the always flowing information of the Internet for almost one decade; why did it not reach our ears then?. However, during the last decade, many other metaphors appeared, like "Data is new air", "Data is the new gold", "Data is the new ocean", and many others. We classify data as a natural resource, byproduct, industrial product, market, liquid, trendy or part of the body. Each of them has a particular meaning and is better than the other in some perspectives and worse in others.

With the more and more present data sovereignty concepts, data protection laws and new ways of capturing, generating, processing data; generating machine learning techniques like federated learning, many players will need to think about the path of manipulating the data. However, in the current status, coordination and communication across the actors will be possible when each member has the same abstract concept of the Data in their mind, living behind the metaphors and using just about mature concepts.

The metaphors are good as the first approach to help understanding the importance of data in the process to a data driven economy, and therefore society. However, we are a decade away from “Data is the new soil”, and it is time to switch our minds and stop searching for the fitting metaphor for data. Instead, we need to start to blossom a complete concept for the decentralization of data in the upcoming data world. 

What is you definition of data? Do we still need data metaphors? We, at Databloom, believe that one needs a data analytics framework where one can fit her own definition of data. The data analytics framework should be flexible enough to adapt to the application needs. This is our driving goal in our development of Blossom.


[1] https://archives.cjr.org/the_news_frontier/data_is_the_new_soil.php

[2] http://dismagazine.com/discussion/73298/sara-m-watson-metaphors-of-big-data/

[3] https://www.ted.com/talks/david_mccandless_the_beauty_of_data_visualization?language=en


28 February 2022

Poisoning attacks in Federated Learning

Federated learning is a double-edged sword in that it is designed to ensure data privacy, yet unfortunately, it opens a door for adversaries to exploit the system easily. One of the popular attack vectors is a poisoning attack.

What is a poisoning attack?

A poisoning attack aims to degrade machine learning models easily that can be classified into two categories: data and model poisoning attacks.

Data poisoning attacks aim to contaminate the training data to indirectly degrade the performance of machine learning models [1]. Data poisoning attacks can be broadly classified into two categories: (1) label flipping attacks in which an attacker "flips" labels of training data [2] (2) backdoor attacks in which an attacker injects new or manipulated training data, resulting in misclassification during inference time [3]. An attacker may perform global or targeted data poisoning attacks. Targeted attacks add more challenges to be detected as they only manipulate a specific class and make data for other classes intact.

On the other hand, a model poisoning attack aims to directly manipulate local models to compromise the accuracy of global models. Model poisoning attacks can be performed as untargeted or targeted attacks similar to data poisoning attacks. An untargeted attack aims to degrade the overall performance and achieve a denial of service. A targeted attack is a more sophisticated way to corrupt model updates for subtasks while maintaining high accuracy on global objectives [1].

A federated learning system is inherently vulnerable to poisoning attacks due to its inaccessibility to local data and models in individual participants. Targeted attacks make the problem worse as it would be extremely hard for a central server to identify attacks in the models given high accuracy on global objectives. 

How can we defend?

Approaches to defend against poisoning attacks can be classified into two categories: (1) Robust Aggregation and (2) Anomaly Detection.

A typical aggregation method in the federated learning system is the average of local models to get a global model. For example, each client computes gradients in each round of training phases, the gradients are sent to a central server, and the server computes the average gradients in FedSGD [4]. For better efficiency, each client computes the gradient and updates the model for multiple batches, and the model parameters are sent to the server, and the server computes the average of the model parameters in FedAvg [5]. Apparently, these average-based approaches will be susceptible to poisoning attacks. The research focus has been thus on how to better aggregate the model parameters while minimizing such as median aggregation [6], trimmed mean aggregation [6], Krum aggregation [7], or adaptive averaging algorithms [1].

A more proactive way to defend against poisoning attacks is filtering malicious updates through anomaly detection. Model updates of malicious clients are often distinguishable from those of honest clients. One defense method is clustering-based approaches that check model updates at the central server, cluster model updates, and filter suspicious clusters from aggregation [8]. Behavior-based defense methods measure differences in model updates between clients and filter malicious model updates from aggregation [8].


A federated learning system has recently emerged and thus the research on attacks against it is still in its early stage. To fully take advantage of the promising potentials of the federated learning system, a lot of research efforts are needed to provide robustness against poisoning attacks.


14 February 2022

Federated Learning (Part II): The Blossom Framework

This is the second post of our Federated Learning (FL) series. In our previous post, we introduced FL as a distributed machine learning (ML) approach where raw data at different workers is not moved out of the workers. We now take a dive into Databloom Blossom, a federated data lakehouse analytics framework, which provides a solution for federated learning.

The research and industry community have already started to provide multiple systems in the arena of federated learning. TensorFlow Federated [1], Flower [2], and OpenFL [3] are just a few examples of such systems. All these systems allow organizations and individuals (users) to deploy their ML tasks in a simple and federated way using a single system interface. 

What Is the Problem?
Yet, there are still several open problems that have not been tackled by these solutions, such as preserving data privacy, model debugging, reducing wall-clock training times, and reducing the trained model size. All of the equal importance. Among all of these open problems, there is one of crucial importance: supporting end-to-end pipelines

Currently, users must have good knowledge of several big data systems to be able to create their end-to-end pipelines. They must know from data preparation techniques to ML algorithms. Furthermore, users must also have good coding skills to put all the pieces (systems) together in a single end-to-end pipeline. The FL setting only exacerbates the problem.

Databloom Blossom Overview
Databloom offers Blossom, a Federated Data Lakehouse Analytics platform to help users to build their end-to-end federated pipelines. Blossom covers the entire spectrum of analytics in end-to-end pipelines and executes them in a federated way. Especially, Blossom allows users to focus solely on the logic of their applications, instead of worrying about the system, execution, and deployment details.

Figure 1. Blossom general architecture

Figure 1 illustrates the general architecture of Blossom. Overall, Blossom comes with two simple interfaces for users to develop their pipelines: Python (FedPy) for data scientists and a graphical dashboard (FedUX) for users in general. 

02 February 2022

Encryption with HDFS and kerberos

We at databloom.ai deal with a large ecosystem of big data implementations, most notably HDFS with encryption on flight and rest. We also see a lot of misconfigurations and want to shed some light into the topic with this technical article. We use plain Apache Hadoop, but the same technical background works for other distributions like Cloudera.

Encryption of data was and is the hottest topic in terms of data protection and prevention against theft, misuse and manipulation. Hadoop HDFS supports full transparent encryption in transit and at rest [1], based on Kerberos implementations, often used within multiple trusted Kerberos domains.


Hadoop KMS provides a REST-API, which has built-in SPNEGO and HTTPS support, comes mostly bundled with a pre-configured Apache Tomcat within your preferred Hadoop distribution.
To have encryption transparent for the user and the system, each encrypted zone is associated with a SEZK (single encryption zone key), created when the zone is defined as an encryption zone by interaction between NN and KMS. Each file within that zone will have its own DEK (Data Encryption Key). This behavior is fully transparent, since the NN directly asks the KMS for a new EDEK (encrypted data encryption key) encrypted with the zones key and adds them to the file’s metadata when a new file is created.

How the encryption flow in HDFS with kerberos works:

encryption flow in HDFS with kerberos


When a client wants to read a file in an encrypted zone, the NN provides the EDEK together with a zone key version and the client asks the KMS to decrypt the EDEK. If the client has permissions to read that zone (POSIX), the client will use the provided DEK to read the file. Seen from a DFS node perspective, that datastream is encrypted and the nodes only see an encrypted data stream.

Setup and Use

Hadoop KMS is a cryptographic key management server based on Hadoop's KeyProvider API and was first introduced in Hadoop 2.7.
Enabling KMS in Apache Hadoop takes a few lines of configuration, important to know that KMS doesn’t work without a working Kerberos implementation. Additionally, there are more configuration parameters which need to be known, especially in a multi-domain Kerberos environment - depending on the multi-homed setup of a scaled cluster setup [3].

First, KMS uses the same rule based mechanism as HDFS uses when a trusted kerberos environment is used. That means the same filtering rules as existent in core-site.xml need to be added to kms-site.xml to get the encryption for all trusted domains working. This has to be done per:


per kms-site.xmlThe terms trusted.domain / main.domain are placeholders, describing the original and the trusted kerberos domain. The use from an administrative standpoint is straightforward:

hadoop key create KEYNAME #(one time key creation)
hadoop fs -mkdir /enc_zones/data
hdfs crypto -createZone -keyName KEYNAME -path /enc_zones/data
hdfs crypto -listZones

First we create a key, then the directory we want to encrypt in HDFS and encrypt this with the key we created first.
This directory is now only accessible by users given access per HDFS POSIX permissions. Others aren’t able to change or read files. To give superusers the possibility to create backups without de- and encrypt, a virtual path prefix for distCp (/.reserved/raw) [2] is available. This prefix allows the block-wise copy of encrypted files, for backup and DR reasons.

The use of distCp for encrypted zones can cause some mishaps. Highly recommended is to have identical encrypted zones on both sides to avoid problems later. A potential distCp command for encrypted zones could look like:

hadoop distcp -px hdfs://source-cluster-namenode:8020/.reserved/raw/enc_zones/data hdfs://target-cluster-namenode:8020/.reserved/raw/enc_zones/data

[1] https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html
[2] https://hadoop.apache.org/docs/r2.7.2/hadoop-distcp/DistCp.html
[3] https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html

most read