08 July 2022

Regulation-Compliant Federated Data Processing

Federated data processing has been a standard model for virtual integration of disparate data sources, where each source upholds a certain amount of autonomy. While early federated technologies resulted from mergers, acquisitions, and specialized corporate applications, recent demand for decentralized data storage and computation in information marketplaces and for Geo-distributed data analytics has made federated data services an indispensable component in the data systems market. 

At the same time, growing concerns with data privacy propelled by regulations across the world has brought federated data processing under the purview of regulatory bodies. 

This series of blog post will discuss challenges in building regulation-compliant federated data processing systems and our initiatives at Databloom that strive towards making compliance as a first-class citizen in our Blossom data platform.

 
Federated Data Processing

Running analytics in a federated environment require distributed data processing capabilities that (1) provides a unified query interface to analyze distributed and decentralized data, (2) transparently translates a user-specified query into a so-called query execution plan, and (3) can execute plan operators across compute sites. Here, a critical component in the processing pipeline is the query optimizer. Typically, a query optimizer considers distributed execution strategies (involving distributing query operators like join or aggregation across compute nodes), communication cost between compute nodes, and introduces a global property that describes where, i.e., at which site, processing of each plan operator happens. For example, a two-way join query over data sources in Asia, Europe, and North America may be executed by first joining data in North America and Europe and then joining with the data in Asia.

Growing Data Regulations---A New Challenge

As one can notice, federated queries implicitly ship data (i.e., intermediate query results) between compute sites. While several performance aspects, such as bandwidth, latency, communication cost, and compute capabilities have received great attention, the federate nature of data processing has been recently challenged by data transfer regulations (or policies) that restrict the movement of data across geographical (or institutional) borders or by any other rules of data protection that may apply to the data being transferred between certain sites.

European directives, for example, regulate transferring only certain information fields (or combinations thereof), such as non-personal information or information not relatable to a person. Likewise, regulations in Asia may also impose restrictions on data transfer. Non-compliance to such regulatory obligations has attracted fines in the tune of billions of dollars. It is, therefore, crucial to consider compliance with respect to  legal aspects when analyzing federated data.

Data Transfer Regulations through the GDPR lens

As of now most countries around the world have various data protection laws---with the EU General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) being the most prominent---that impose restrictions on how data is stored, processed, and transferred.

Let's take GDPR as an example. GDPR articles 44--50 explicitly deal with the transfer of data across national borders. Among these, there are two articles and one recital wherein the legal requirements for transferring data fundamentally affect federated data processing. 

Article 45: Transfers on the basis of an adequacy decision. 

The article dictates that transfer of data may take place without any specific authorization, e.g., when there is adequate data protection at the site where data is being transferred or when data is not subjected to regulations (i.e., when the data does not follow the definition of personal data as in Article 4(1)).

Article 46: Transfers subject to appropriate safeguards. 

This article prescribes that (in the absence of applicability of Article 45) data transfer can take place under "appropriate safeguards". Based on the European Data Protection Board (EDPB) recommendations that supplement transfer tools, pseudonymisation of data (as defined under Article 4(5)) is considered as an effective supplementary method.

Recital 108: Transfers under measures that compensate lack of data protection.

Data after adequate anonymization (i.e., when resulting data does not fall under Article 4(1) and as described in Recital 26) does not fall under the ambit of GDPR and therefore can be transferred.


Depending on the data and to where that data is being transferred, the above regulations can be classified into:

  1. No restrictions on transfer: Some data maybe allowed to be transferred unconditionally, and some to only certain locations.
  2. Conditional restrictions on transfer: For some data, only derived information (such as aggregates) or only after anonymization, can be transferred to (certain) locations.
  3. Complete ban on transfer: Some data, no matter whatsoever, must not be transferred outside.

Compliance-by-Design: The Challenges



Rather than ad-hoc solutions to make data processing regulation-compliant, a more holistic approach that provides appropriate safeguards to data controllers (entities that control what data and how data should be processed) and data processors (entities that processes data on behalf of a controller) within a federated data processing system  is required.  

In the context of federated data processing, three aspects must to be revisited:

  1. First and foremost, data processing systems must offer declarative policy specification languages, which makes it easy and simple for controllers  to specify data regulations. Policy specification languages should take into account the type of data, its processing, as well as the location of processing. Regulations may affect processing of an entire dataset, a subset of it, or even information that is derived from it. Policy specifications must also be considered keeping in mind the heterogeneity of data formats (e.g., graph, relational, or textual data).
  2. The second aspect of ensuring that compliance is at the core of federated data processing is integrating legal aspects in query rewriting and optimization. A system must be able to transparently translate user queries into compliant execution plans
  3. Lastly, federated systems must offer capabilities to decentralize query execution, which may also be desired by the compliant plan. We need query executors that can efficiently orchestrate queries over different platforms across multiple clouds or data silos. 

 Summary

Today regulation-compliant data processing is a major challenge, driven by regulatory bodies across the world. In this blog post, we  analyzed data transfer regulations from the perspective of GDPR and discussed key research challenges for including compliance aspects in federated data processing. Compliance is at the core of our blossom data platform. In the next blog post, we will discuss how Databloom's blossom data platform addresses some of the aforementioned challenges and ensures regulation-compliant data processing across multiple clouds, Geo-locations, and data platforms.

References

[1] Kaustubh Beedkar, Jorge-Arnulfo Quiané-Ruiz, Volker Markl: Compliant Geo-distributed Query Processing. SIGMOD Conference 2021: 181-193
[2] Kaustubh Beedkar, David Brekardin, Jorge-Arnulfo Quiané-Ruiz, Volker Markl: Compliant Geo-distributed Data Processing in Action. Proc. VLDB Endow. 14(12): 2843-2846 (2021)
[3] Kaustubh Beedkar, Jorge-Arnulfo Quiané-Ruiz, Volker Markl: Navigating Compliance with Data Transfers in Federated Data Processing. IEEE Data Eng. Bull. 45(1): 50-61 (2022)





















04 July 2022

Internationalization: The challenges of building multilingual web applications

At Databloom, we value diversity. We are a multicultural company with team members from different parts of the world, where we speak a wide variety of languages, such as English, French, German, Greek, Hindi, Korean, and Spanish. Data science teams are also so diverse! For that reason, in Databloom's Blossom Studio, we plan to introduce internationalization and localization features to make our application multilingual.

In that context, we want to discuss several aspects that we found relevant when trying to implement a multilingual application. In addition, we also want to share some resources that we found helpful when applying some of these concepts into practice.

1. Translation Methods

We have two main options: machine automatic translations, where an external service performs the translation for us (e.g., Google translator API, Amazon translate), and human translations, where we manually provide the translated texts.

Generally speaking, it is helpful to use online translation services when the content of the website or application is going to be updated regularly and with high frequency over time (such as blog websites and newspapers). On the other hand, if the website's content will not change substantially, it is preferable to use custom translations since they give us greater control over the quality and accuracy of the translated content. Regardless of the chosen method, we also need to consider where we'll store the translated texts (front or backend) and how we'll access that information when required.

2. Localization Formats

It refers to the formats used to represent different aspects of the language in the local country. It includes topics such as content layout and the formatting of dates and numbers.

  • Content Layout: Different languages may have different writing modes concerning the directionality of their texts. While in the case of Indo-European languages such as English, the writing mode is typically left-to-right (LTR), languages like Arabic and Hebrew have a right-to-left (RTL) script. On the other hand, other scripts, typically of Asian origin, can also be written in a top-to-bottom direction, as is the case with Chinese and Japanese. In this context, our web application should be able to correctly adjust its content layout when displaying another language with a different writing mode, just like in the example shown in Figure 1.

Figure 1. This website (https://www.un.org/) supports different writing modes (RTL, LTR) [1].


A recommendation we can give regarding this point is to apply CSS styles using relative positions (start, end) instead of using fixed ones (left, right). In this way, if the directionality attribute in the markup file is dynamically modified (e.g., dir = "ltr" to dir = "rtl"), it won't be necessary to alter the respective CSS styles.

  • Date and time formats: Some countries may share the same base language but may have different ways of displaying dates, as is the case of the US and the UK. While in the former, the date format is MM/DD/YYYY, in the case of the latter is DD/MM/YYYY. 


Figure 2. US and UK date format comparison

  • Numbers format: Countries may use distinct approaches to represent their numbers. Not only may they use particular ways to depict each digit (as is the case between Latin languages and Chinese), but they can also employ different characters to separate thousands and decimals numbers. For example, the US uses the comma character (",") as the thousands separator. In contrast, many European and Latin-American countries apply that same character as a decimal separator.
 Figure 3. The US and Chile numbers format comparison


3. User Experience


We must also consider some UX and UI elements in our web application to manage multiple languages appropriately, like the following: 

  • Language Switch: Usually corresponds to a dropdown list or a select element that displays all the languages the web application supports. The user should be able to change the language when clicking on one of the list items.
  • User Navigation: We must decide if each language will have a specific address displayed to the user. For example, for the English version, you can have the following address: https://www.mycoolwebsite.com/en, while for the german version, you may have this one: https://www.mycoolwebsite.com/de. In the case of single-page applications using Vue.js, this feature is made possible by using the extension Vue router. To simulate that browsing experience, you can create different routes for each selected language. 

  • User Preferred Language: A nice-to-have feature is to display the preferred language by the user right from the start. We can use browser properties like Navigator.language to capture that information in our application. This read-only property returns a string in a standard format representing the user's preferred language in that particular browser [2].
  • Fallback Language: We also should have a fallback language to present to the user in case the preferred language is unavailable or if we don't have translations for a specific language. Typically, the fallback language should be the one your customers use the most.
  • Persistence: Another nice-to-have feature is that choices made by the user persist over time, even when refreshing the web application. One way to accomplish this is by storing the user's selected language in the Local Storage using the Window.LocalStorage property [3].

4. Resources

Implementing all these features may seem like a daunting task. Fortunately, there are several internationalization packages that you can use to make things happen, so you don't have to reinvent the wheel. For example, one that we highly recommend for Vue applications is vue-i18n. This plug-in addresses many of the features mentioned above, such as number and date formats, and also considers other attributes we didn't mention here, like noun pluralization. Furthermore, it works well with both Options API and Composition API.

Other resources you can use to get you started:

References:















26 June 2022

Apache Wayang: More than a Big Data Abstraction

Recently, Paul King (V.P. and Chair of Groovy PMC) highlighted the big data abstraction [1] that Apache Wayang [2] provides. He mainly showed that users specify an application in a logical plan (a Wayang Plan) that is platform agnostic: Apache Wayang, in turn, transforms a logical plan into a set of execution (physical) operators to be executed by specific underlying processing platforms, such as Apache Flink and Apache Spark.

In this post, we elaborate on the cross-platform optimizer that comes with Apache Wayang, which decides how to generate execution plans. When a user specifies an application on the so called Wayang plan, 

25 June 2022

Integrating new plugins into Blossom (Part 1)

Databloom Blossom is a federated data lakehouse analytics framework that provides a solution for federated learning. Blossom supports the entire generation of distributed ML pipelines from data extraction to consumption, covering access to data sources, data preparation, feature engineering, and Federated Learning.

The present blog is part of a series that encourages users to create their personalized blossom plugin, which contains custom logical operators and mappings to process these operators on different execution platforms. In addition, to declare conversion channels to transform output data types suitable for processing by any available Platform.

In this first part, the present blog will explain several Blossom concepts necessary to implement new features. Please, consider that Blossom is a cross-platform processing framework, and the computations are not trivial. In the first part of this tutorial, we will go deep into the abstractions that support the integration of technologies and help Blossom to optimize operators that can run on different platforms and implementations.

Blossom Plan

Blossom Plan is the base structure for the optimization process of a data analytics pipeline. It contains a set of operators composing some workload to be processed. A Blossom Plan includes methods to prune the search space, traverse the plan from its sinks operators and apply transformations over operators to obtain different possible implementations.

Operators

Blossom supports operators designed to run following different processing models, from Relational Database operators to Flink Stream Processing Operators. The framework that provides this interoperability works through several layers of abstraction to integrate components as general as possible. Therefore, Developers only need to implement already defined interfaces.

Operator Interface

Describe any node in a Blossom Plan and its principal components. An implementation of this interface must detail a type for the operator, specific configurations to process and optimize it, and methods to manipulate Input and Output Slots; controlling what data this operator as a unit of processing will receive and produce. 

Some Binary Operators handle several Input sources, e. g. Join Operators; while others Replicate Operators produce multiple Outputs streams to be processed by different operators. Input and Output slots connect two operators creating a Producer-Consumer dependency between them.

Input/Output Cardinality Abstract Operator

As suggested in the previous section, different Operators require a different number of Input/Output slots. Source Operators require only an Output Slot because they do not receive any Input, and Sink Operators require only an Input Slot because they do not transmit results to other operators. For a better classification of operators, Blossom incorporates UnaryToUnaryOperator, UnarySource, UnarySink, and BinaryToUnaryOperator classes to handle every specific case. Input and Output Slot are defined by a DatasetType that keeps track of the type and structure being transferred between operators through a slot.

Going further with this explanation, let's review the abstract class BinaryToUnaryOperator. It Receives three Generic Types[1] corresponding to the two inputs type of the operator and a single output type. Extending this class the user can model Join, Union, and Intersect Operators.


Blossom Operator Classes

Blossom Operator Classes are the actual nodes that compose a BlossomPlan. The purpose of a Blossom operator class is to define a promised functionality that could be implemented on different processing platforms. The Blossom community usually called these operators platform-independent operators. Blossom Operator Classes do not describe how a specific functionality will be delivered, that is tightly dependent on each underlying platform that Blossom can use to run the operator.

Any Operator Class that extends an Input/Output Cardinality Abstract Operator is a Blossom Operator Class. Let's review the CountOperator Blossom Operator Class; CountOperator<Type> extends UnaryToUnaryOperator<Type, Long>, meaning that it receives a Generic and returns a Long value. Therefore, the only restriction to Platforms implementing this operator is that in execution time, a CountOperator will receive a Stream of Type elements; after processing them, the CountOperator must return a single Long value. Any platform that wants to support  CountOperator must follow that pattern.



Channels

A Channel in Blossom is the interface that interconnects different sets of operators; in other words, a channel is a glue that connects one operator with another operator. Imagine an operator "Source" running in Java reading tuples from a local file; the output of "Source" will be a Collection of tuples. In the described case, the Output Channel provided by Source is a Java Collection. A Java collection channel only can be used as Input of Operators that accept Java Collection format as an Input. To allow other Platforms than Java to accept the output of Source, it is mandatory to convert this Java Collection into another format.

Execution Operators

A CountOperator cannot run unless a specific behavior is given. An Execution operator implements the procedure followed by an operator for its correct execution on a specific Platform. Let's see two examples:

JavaCountOperator<Type> extends CountOperator<Type> and implements the interface JavaExecutionOperator. In the case of Java Platform, the evaluate method gives behavior to this ExecutionOperator; notice from the extraction of the code that the operator uses a Java Collection Channel, and after casting the Channel as a Collection uses the standard Collection.size method to get the result.

On the other hand, FlinkCountOperator<Type> extends CountOperator<Type> and implements the interface FlinkExecutionOperator. The code Extract shows that in this case, the Channel must be a Flink DatasetChannel, and the operation is also trivial returning the Flink Dataset.count method result.

Execution Operators of an Operator are all the implementation alternatives for a Blossom Operator to be included as part of an executable Plan. To decide which alternative is more efficient given a certain optimization goal, Blossom compares an estimation of resources required by different ExecutionOperators running on the available Platforms.

In the next part of this tutorial, we will review how Blossom manages to Optimize a plan, what pieces of code must be provided in our Plugins to allow this, and a real example of a custom Executor that includes a Postgres platform.

References

20 June 2022

The 3 pillars for effectively and efficiently leading remote work teams

We live in a time when distances are no longer a problem. Many communication and meeting management tools allow us to interact with people from very diverse geographical locations. But, although these tools are of great help to us to contact our friends and family, but the question that one might wonder is: are they able to coordinate, manage and monitor our work teams in the same way? Maybe not by themselves.


At Databloom, we have taken on this challenge from a perspective of cultural plurality and the richness that comes from having developers and engineers with diverse skills and knowledge but with a common purpose. Achieve organizational goals in time, quality, and innovation.

The fundaments to achieve the challenges are diverse, but at least the three next fundamental pillars to manage successful teams should be highlighted.

Pillar 1. Form teams with high standards of quality and commitment:


This is undoubtedly the first step for good management. The human groups, particularly the work teams with diverse members in terms of age, experience, knowledge, and even personalities, must share essential common values. ​​Some of these values are: teamwork, generosity, and patience to share knowledge, professionalism, coupled with a high sense of commitment and sense of belonging to the organization.

Our frontend development team at Databloom, is fortunate of having a highly professional, responsible, collaborative, and creative team. Undoubtedly, each of them has strong skills in the field of programming, application development, technology integration, abstraction skills, critical thinking, and motivation for innovation. But none of this has real value if it is not accompanied by great human and professional quality as well as an entrepreneurial spirit and the courage to face new challenges.

Pillar 2. Establish clear and measurable personal, group, and organizational goals.


When each team member has clearly defined their tasks, responsibilities, and deadlines. When you are also aware of how each of these responsibilities is part of a larger objective as part of a team and how meeting these objectives lead us to achieve the goals of the organization, then it is when you achieve that collaborative culture that enables team members to complement with each other when executing a task.

No less important is to establish a personal and team goal to achieve a standard in terms of handling technologies, programming languages​​, and knowledge of best practices for development.

One of the main challenges we are facing in our frontend development team is that research and experimentation are fundamental pieces for a good development process.

Pillar 3. The role played by the team leader.


The team leader must be a facilitator, not just a task controller. His first responsibility is to get to know each member of his team deeply. He must know the strengths and weaknesses as well as the potential of each one of his collaborators. He must establish relationships of trust and respect between each of the members and build bridges for collaborative work and collaborative problem-solving.

The team leader must provide fluid, permanent and simple communication channels.

Each member of the team must be sure that their leader is available to help them find a solution or look for alternatives that allow them to overcome the obstacles and doubts that appear in the development process.

If we have the 3 pillars mentioned above, whatever the tool you use to manage your work meetings, follow-up meetings, or simply conversations to review obstacles and problems (Google Meets, Zoom, WhatsApp, Microsoft Teams, etc.), you have a high probability of efficiently managing your work team, with the certainty that both you and each one of the members will be confident to raise their concerns and ask for help whenever is necessary:

Help that you will certainly find in a friendly, reliable, and professional environment.

At Databloom we believe in a collaborative culture, where the role that each member plays is extremely important and therefore it is everyone duty to be a support, providing timely help when it is required. We established regular and specific meetings, set short-term goals and continuously monitor activities.

We give Databloom developers enough autonomy to manage their time and commitments and, more importantly, the freedom to create and propose their ideas, thereby enriching the final product and the work environment.

15 June 2022

Machine Learning for X and X for Machine Learning

After the recent advances in Artificial Intelligence (AI), and especially in Machine Learning (ML) and Deep Learning (DL)*, various other computer science fields have gone into a race of "blending" their existing methods with ML/DL. There are two directions to enable such a blend: Either using ML to advance the field, or using the methods developed in the field to improve ML. A commonly used slogan when combining ML with a computer science field is: ML for X or X for ML, where X can be, for instance, any of {databases, systems, reasoning, semantic web}. 

In this blog post, we focus on cases where X = big data management. We have already observed works on ML for data management and on data management for ML since several years now. Both directions have a great impact in both academia, with dedicated new conferences popping up, as well as in the industry, with several companies working on either improving their technology with ML or providing scalable and efficient solutions for ML.

Databloom.ai is one of the first companies to embrace both directions within a single product. Within Databloom's product, Blossom, we utilize ML to improve the optimizer and, thus, provide better performance to users but also we utilize well-known big data management techniques to speed up federated learning.

 

Databloom.ai is the first to embrace both directions within a single product


Machine Learning for Big Data Management and Big Data Management for Machine Learning in databloom.ai

 

ML for Data Management

In a previous blog post we discussed how we plan to incorporate ML into Blossom's optimizer. Having an ML model to predict the runtime of an execution plan can have several benefits. First, system performance can rocket as the optimizer is able to find very efficient plans. For example, for k-means, an ML-based optimizer [1] outputs very efficient plans by finding the right platform combination and achieve 7x better runtime performance than a highly-tuned cost-based optimizer! Second, the hard part of manually tuning a cost model in an optimizer is vanished. With just collected training data (plans and their runtime) you can easily train an ML model.


Data Management for ML

In an earlier blog post we discussed how Blossom, a federated data lakehouse analytics framework, can provide an efficient solution for federated learning following principles from data management. To enable federating learning, we are working towards a parameter server architecture. However, as simplistic parameter servers are very inefficient due to excessive network communication (i.e., large number of messages sent through the network) we utilize data management techniques such as increasing locality, enabling latency hiding, and exploiting bulk messaging. The database community has already exploited such optimizations and showed that they can lead to more than an order of magnitude faster training times.

[1] Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, Sanjay Chawla: ML-based Cross-Platform Query Optimization. ICDE 2020: 1489-1500.

[2] Alexander Renz-Wieland, Rainer Gemulla, Zoi Kaoudi, Volker Markl: NuPS: A Parameter Server for Machine Learning with Non-Uniform Parameter Access. SIGMOD Conference 2022: 481-495.

* We use the term ML to refer to both Machine Learning and Deep Learning.
** Icons appearing in the figure were created by Creative Stall Premium - Flaticon

 


 

 

 

 


24 May 2022

Thinking to Switch to Vue? A Ligthweight but Powerful Framework

Since I started working developing user interfaces with GWT, several years have passed and I have gone through different technologies, such as Angular V1.x and V2.x. Now arriving with Databloom, I met Vue (v3.x). I have been lucky to see how browsers have grown to become capable of not only displaying informative content but also of becoming a multipurpose box, capable of offering functionality of all kinds, which was previously only available in specialised standalone applications developed in Java, or C , VB, etc.
Each of these and other technologies that have passed allowed us (developers) to shape various functionalities to the content displayed in the browser, limited only by the imagination and needs of the requirements. However, unlike years ago, we now find ourselves with powerful front-end tools and frameworks, which not only allow us to build web applications, but also give us the job (and the alternative) of knowing which one to choose according to our needs. In this decision a series of factors are combined that can help us to appropriately choose the frontend technology that best suits our needs and that allows us to develop our applications making the most effective and efficient use of the resources of our organisation: previous experience of the development team, age of the team, deadlines for our commitments, fashions, among which I have found myself in the different companies in which I have worked.

Databloom has opted for a decision based on the relationship between the speed of adaptation of the technology vs. the cost of learning. 
Considering the richness that the framework offers according to our needs, we decided to adopt Vue 3 to build our frontend.
We have been working with this "new" framework for several months now and we have found only pleasant surprises in the process of building our application: a lightweight framework, with all the power of the "biggest" framework. Furthermore,

Vue is powered by many plugins that they complement the gaps that it has, but with the advantage of offering a much smoother learning curve.

Today, even novice team members, with no previous experience in front-end development, have a much friendlier experience, allowing them to add value to their product in a much shorter time. In addition, if you come from Angular 2x for example, the learning and adaptation will be even more natural and you will be able to take advantage of everything that concerns Typescript in the construction of your components.
I must say that not everything is perfect. There are things to keep in mind about Vue 3 vs other more structured frameworks that I did not like so much. For instance, saying that when working with the "Composition API" mode, there are variants of it that make the creation of the component (tag <script>) not as transparent as one would like. Yet, they offer one or another advantage vs the cleaner option (<script setup lang=”ts”>) that is more similar to how the controller layer is built a view in Angular 2x for example. Same with the view-model hooking process which forces you to wrap your property in an object (ref, reactive, swallowref, etc) and at first glance it's not as intuitive as just defining a property in the controller and binding it in sight. Nevertheless, once you get used to these details, you can use them to your advantage and you will have different options to handle the various scenarios that can arise in the process of building your application.

We have found a powerful ally in Vue when building the face of our system Blossom (The Blossom Studio) 

Using Vue, we hope to develop and offer a robust product that is useful and effective tool for optimization activities in processing. of “Big Data”.

most read