The Data Mesh - should you adapt?

Image
In actuality, not every firm may be a good fit for the implementation of a Data Mesh.  Larger enterprises that experience uncertainty and change in their operations and environment are the primary target audience for Data Mesh.  A Data Mesh is definitely an unnecessary expense if your organization's data requirements are modest and remain constant over time. What is a "Data Mesh"? As it focuses on delivering useful and safe data products, Data Mesh is a strategic approach to modern data management and a strategy to support an organization's journey toward digital transformation. Data Mesh's major goal is to advance beyond the established centralized data management techniques of using data warehouses and data lakes. By giving data producers and data consumers the ability to access and handle data without having to go through the hassle of involving the data lake or data warehouse team, Data Mesh highlights the concept of organizational agility. Data Mesh's dec

Integrating new plugins into Blossom (Part 1)

Databloom Blossom is a federated data lakehouse analytics framework that provides a solution for federated learning. Blossom supports the entire generation of distributed ML pipelines from data extraction to consumption, covering access to data sources, data preparation, feature engineering, and Federated Learning.

The present blog is part of a series that encourages users to create their personalized blossom plugin, which contains custom logical operators and mappings to process these operators on different execution platforms. In addition, to declare conversion channels to transform output data types suitable for processing by any available Platform.

In this first part, the present blog will explain several Blossom concepts necessary to implement new features. Please, consider that Blossom is a cross-platform processing framework, and the computations are not trivial. In the first part of this tutorial, we will go deep into the abstractions that support the integration of technologies and help Blossom to optimize operators that can run on different platforms and implementations.

Blossom Plan

Blossom Plan is the base structure for the optimization process of a data analytics pipeline. It contains a set of operators composing some workload to be processed. A Blossom Plan includes methods to prune the search space, traverse the plan from its sinks operators and apply transformations over operators to obtain different possible implementations.

Operators

Blossom supports operators designed to run following different processing models, from Relational Database operators to Flink Stream Processing Operators. The framework that provides this interoperability works through several layers of abstraction to integrate components as general as possible. Therefore, Developers only need to implement already defined interfaces.

Operator Interface

Describe any node in a Blossom Plan and its principal components. An implementation of this interface must detail a type for the operator, specific configurations to process and optimize it, and methods to manipulate Input and Output Slots; controlling what data this operator as a unit of processing will receive and produce. 

Some Binary Operators handle several Input sources, e. g. Join Operators; while others Replicate Operators produce multiple Outputs streams to be processed by different operators. Input and Output slots connect two operators creating a Producer-Consumer dependency between them.

Input/Output Cardinality Abstract Operator

As suggested in the previous section, different Operators require a different number of Input/Output slots. Source Operators require only an Output Slot because they do not receive any Input, and Sink Operators require only an Input Slot because they do not transmit results to other operators. For a better classification of operators, Blossom incorporates UnaryToUnaryOperator, UnarySource, UnarySink, and BinaryToUnaryOperator classes to handle every specific case. Input and Output Slot are defined by a DatasetType that keeps track of the type and structure being transferred between operators through a slot.

Going further with this explanation, let's review the abstract class BinaryToUnaryOperator. It Receives three Generic Types[1] corresponding to the two inputs type of the operator and a single output type. Extending this class the user can model Join, Union, and Intersect Operators.


Blossom Operator Classes

Blossom Operator Classes are the actual nodes that compose a BlossomPlan. The purpose of a Blossom operator class is to define a promised functionality that could be implemented on different processing platforms. The Blossom community usually called these operators platform-independent operators. Blossom Operator Classes do not describe how a specific functionality will be delivered, that is tightly dependent on each underlying platform that Blossom can use to run the operator.

Any Operator Class that extends an Input/Output Cardinality Abstract Operator is a Blossom Operator Class. Let's review the CountOperator Blossom Operator Class; CountOperator<Type> extends UnaryToUnaryOperator<Type, Long>, meaning that it receives a Generic and returns a Long value. Therefore, the only restriction to Platforms implementing this operator is that in execution time, a CountOperator will receive a Stream of Type elements; after processing them, the CountOperator must return a single Long value. Any platform that wants to support  CountOperator must follow that pattern.



Channels

A Channel in Blossom is the interface that interconnects different sets of operators; in other words, a channel is a glue that connects one operator with another operator. Imagine an operator "Source" running in Java reading tuples from a local file; the output of "Source" will be a Collection of tuples. In the described case, the Output Channel provided by Source is a Java Collection. A Java collection channel only can be used as Input of Operators that accept Java Collection format as an Input. To allow other Platforms than Java to accept the output of Source, it is mandatory to convert this Java Collection into another format.

Execution Operators

A CountOperator cannot run unless a specific behavior is given. An Execution operator implements the procedure followed by an operator for its correct execution on a specific Platform. Let's see two examples:

JavaCountOperator<Type> extends CountOperator<Type> and implements the interface JavaExecutionOperator. In the case of Java Platform, the evaluate method gives behavior to this ExecutionOperator; notice from the extraction of the code that the operator uses a Java Collection Channel, and after casting the Channel as a Collection uses the standard Collection.size method to get the result.

On the other hand, FlinkCountOperator<Type> extends CountOperator<Type> and implements the interface FlinkExecutionOperator. The code Extract shows that in this case, the Channel must be a Flink DatasetChannel, and the operation is also trivial returning the Flink Dataset.count method result.

Execution Operators of an Operator are all the implementation alternatives for a Blossom Operator to be included as part of an executable Plan. To decide which alternative is more efficient given a certain optimization goal, Blossom compares an estimation of resources required by different ExecutionOperators running on the available Platforms.

In the next part of this tutorial, we will review how Blossom manages to Optimize a plan, what pieces of code must be provided in our Plugins to allow this, and a real example of a custom Executor that includes a Postgres platform.

References

Popular posts from this blog

Towards a Learning-based Query Optimizer

The Missing Piece in Learning-based Query Optimization

Federated Learning (Part II): The Blossom Framework