05 October 2022

The Data Mesh - should you adapt?

In actuality, not every firm may be a good fit for the implementation of a Data Mesh. 

=> Larger enterprises that experience uncertainty and change in their operations and environment are the primary target audience for Data Mesh. 

=> Data Mesh is definitely an unnecessary expense if your organization's data requirements are modest and remain constant over time.

What is a "Data Mesh"?

the next-gen data mesh
As it focuses on delivering useful and safe data products, Data Mesh is a strategic approach to modern data management and a strategy to support an organization's journey toward digital transformation. Data Mesh's major goal is to advance beyond the established centralized data management techniques of using data warehouses and data lakes. By giving data producers and data consumers the ability to access and handle data without having to go through the hassle of involving the data lake or data warehouse team, Data Mesh highlights the concept of organizational agility. Data Mesh's decentralized approach distributes data ownership to industry-specific organizations that use, control, and manage data as a product. From Wikipedia, the definition of Data Mesh:
"Data mesh is a socio-technical approach to build a decentralized data architecture by leveraging a domain-oriented, self-serve design (in a software development perspective), and borrows Eric Evans’ theory of domain-driven design and Manuel Pais’ and Matthew Skelton’s theory of team topologies. Data mesh mainly concerns about the data itself, taking the Data Lake and the pipelines as a secondary concern. The main proposition is scaling analytical data by domain-oriented decentralization. With data mesh, the responsibility for analytical data is shifted from the central data team to the domain teams, supported by a data platform team that provides a domain-agnostic data platform."
The four underlying ideas of the Data Mesh concept are as follows:
  • Data as a Product
  • Domain focussed ownership 
  • Self-Service Infrastructure for Data
  • Federated Governance on Computational Level
Let's examine the four principles in more detail.

Data as a Product

In order to provide commercial value, data products are created by the domain and consumed by users or downstream domains. Data products are distinct from conventional data marts in that they are self-contained and in charge of all infrastructure, security, and provenance issues connected to maintaining the data's accuracy. Data products enhance business intelligence and machine learning efforts by enabling a clear line of ownership and responsibility. They can be used by other data products or by end users directly.

Domain focussed ownership

We must first understand what a domain is in order to comprehend domain-driven data. A domain is a group of people gathered for a common practical business goal. According to Data Mesh, the domain should be in charge of managing the data that is related to and generated by its business function. The assimilation, transformation, and delivery of data to end users are the domains' responsibilities. The domain eventually makes its data available as data products, which it has the rights to during their full existence.

Self-Service Infrastructure for Data

In order for members of the domains to easily develop and maintain their data products, a self-serve data infrastructure must consist of a wide range of capabilities. The infrastructure engineering team that supports the self-serve data platform is primarily focused on managing and running the numerous technologies in use. This demonstrates how domains are concerned with data, while the team working on the self-serve data platform is concerned with technology. The independence of the domains serves as a barometer for the self-serve data platform's performance.

Federated Governance on Computational Level

It is possible to view conventional data governance as a barrier to generating value from data. By integrating governance issues into the workflow of the domains, Data Mesh makes it possible to take a new approach. Although there are many facets to data governance, it is crucial that usage metrics and reporting are included when talking about Data Mesh. The number and type of data usage are crucial data points for determining the worth and, consequently, the success, of specific data products.

Business reasons and advantages of Data Mesh

Data Mesh deployment encourages organizational agility for businesses that want to prosper in a volatile economic environment. Every firm must be able to react to environmental changes with a low-cost, high-reward strategy. Changes in regulatory requirements, the need to comply with new analytics requirements, and the introduction of new data sources are all factors that will cause an organization's data management processes to alter. In light of these dynamics, current approaches to data management are frequently built on intricate and tightly connected ETL between operational and analytical systems that struggle to adapt in time to meet business objectives. The goal of Data Mesh is to offer a more adaptable approach to data so that it can effectively react to such changes.

Deep Technology: Required to set up and run a Data Mesh 

Technology capabilities are a crucial facilitator for putting a Data Mesh into operation. For a number of reasons, new technology is probably necessary. 
  1. Interoperability of those new technologies is going to be crucial in lowering the friction associated with technology exploitation. 
  2. Allow domains to function independently and concentrate on data, which is their primary priority, rather than technology. 
  3. Enabling the purchase of new data platforms online and the seamless exploitation of the data they disclose. 
  4. Enable automatic reporting of governance elements throughout the data mesh, including data product usage, standard compliance, and customer feedback.

Participants: Decentralized domains to a central data team 

A Data Mesh journey will involve significant organizational changes and modifications to the responsibilities of the people. Existing employees will be essential to the adoption of a Data Mesh's success because they can offer the Data Mesh's journey rich tacit knowledge. Therefore, a realignment of current data-focused staff as well as the transfer of data ownership from a central data team to decentralized domains should be considered. In addition, reward systems and managerial structures have changed.

Process optimization: Internal organizational changes 

By using Data Mesh, the business will need to make adjustments to its internal processes in order to encourage a resilient and flexible data architecture. If we take data governance into account, new processes for defining, implementing, and enforcing data policies will be needed. These processes will have an impact on how data is accessed, managed, and used in established daily activities and well-known processes. Additionally, a well designed Data Mesh enables Processes Mining over the whole data lifecycle chain to enable much efficient processes management and design.

Data Lakes, Data Fabrics and Data Mesh - what is what?

The data lake is a technology approach whose primary goal has traditionally been to serve as a single repository to which data can be moved as easily as possible, with the central team in charge of managing it. While data lakes can provide significant business value, they are not without flaws. The main issue is that once data is moved to the lake, it loses context. For example, we may have several files containing a customer definition, one from a logistics system, one from payments, and one from marketing; which one is correct for my use? Furthermore, because the data in the data lake has not been pre-processed, data issues will inevitably arise. This creates a significant barrier to using the data to address the original business question because the data consumer will typically need to communicate with the data lake team to comprehend and resolve data issues. 

Data Mesh, on the other hand, is more than just technology; it combines both technology and organizational aspects, such as the concepts of data ownership, data quality, and autonomy. As a result, data consumers have a clear line of sight regarding data quality and ownership, and data issues can be discovered and resolved much more efficiently. Data can eventually be used and trusted.

A data fabric focuses on a collection of multiple technology capabilities that work together to create an interface for data consumers. Various advocates of data fabric advocate for the automation of many data management chores with technologies such as ML to allow end users to access data more easily. There is some utility in this for simple data utilization, but for more complicated circumstances or when business knowledge has to be integrated into the data, the limits of Data Fabric will become obvious. 

It is to add that a data fabric might perhaps be utilized as part of a Data Mesh self-serve platform, exposing data to domains who can then insert their business expertise into the final data product.

Blossom Sky creates a Data Mesh for your data infrastructure

Organizations that are ready to adopt Data Mesh will want assistance in connecting their data sources in order to achieve a rapid win with Databloom's Next-gen Data Mesh. Basically, those two steps needs to be done to bring your enterprise to next level of future-proven data management:

Connect to the data sources where it is stored. 

The first step in starting your Data Mesh journey is to connect to data sources. A fundamental Data Mesh implementation idea is to connect your data sources by using your existing investments: data lakes or data warehouses; cloud or on­-premise; structured warehouse or unstructured lakehouses. In contrast to the single-source-of-truth method of initially centralized all your data, you are using and querying the data where it sits. It is many of our clients' first Data Mesh win, since our open core and attachable connectors enable our customers to connect to data sources like SQL, text, Big Data or Tensorflow.

Make it possible for teams to create data products. 

After providing a data team with the data they want, the next step is to teach them how to turn data sets into data products. Then, using a data product, establish a data product library or catalog. Blossom Studio includes a catalog that allows you swiftly search for, find, and identify data items that may be of interest. 

Because you are swiftly producing and then deploying data products across the enterprise, creating data products is a strong competency because you've enabled your data consumers to quickly move from discovery to ideation and insight.

Building and maintaining a Data Mesh 


The Blossom Development Environment (BDE) or Apache Wayang will be useful for those who are eager to begin or are just beginning their Data Mesh adventure. In fact, many book a Solution Architect with us to assist them in completing this challenging and rewarding task. It does not bind too much labor and may be low cost, low risk, and great payoff with the appropriate plan. The exercise to determine how Data Mesh will fit into your business from a technology, people, and process standpoint is the goal of a session with our Solution Architects. You'll also be able to assess your strengths and limitations, which will help, when you're ready to start your Data Mesh transformation program, to curate all the learnings to speed up where you can move swiftly and slack off where you need remedial work. A session with a specialized architect from us consists of a three hours consultation in which we discuss:
  • The scope and choose the use case.
  • Which Pre-MVP environments must be established for early design and enabling efforts. 
  • How to design, improve, and use data products. 
  • And finally embrace the Data Mesh as a part of your data strategy.

Book a consultation with us over our contact form, we are here to help you in your journey to a future ready data driven organization.

08 July 2022

Regulation-Compliant Federated Data Processing

Federated data processing has been a standard model for virtual integration of disparate data sources, where each source upholds a certain amount of autonomy. While early federated technologies resulted from mergers, acquisitions, and specialized corporate applications, recent demand for decentralized data storage and computation in information marketplaces and for Geo-distributed data analytics has made federated data services an indispensable component in the data systems market. 

At the same time, growing concerns with data privacy propelled by regulations across the world has brought federated data processing under the purview of regulatory bodies. 

This series of blog post will discuss challenges in building regulation-compliant federated data processing systems and our initiatives at Databloom that strive towards making compliance as a first-class citizen in our Blossom data platform.

 
Federated Data Processing

Running analytics in a federated environment require distributed data processing capabilities that (1) provides a unified query interface to analyze distributed and decentralized data, (2) transparently translates a user-specified query into a so-called query execution plan, and (3) can execute plan operators across compute sites. Here, a critical component in the processing pipeline is the query optimizer. Typically, a query optimizer considers distributed execution strategies (involving distributing query operators like join or aggregation across compute nodes), communication cost between compute nodes, and introduces a global property that describes where, i.e., at which site, processing of each plan operator happens. For example, a two-way join query over data sources in Asia, Europe, and North America may be executed by first joining data in North America and Europe and then joining with the data in Asia.

Growing Data Regulations---A New Challenge

As one can notice, federated queries implicitly ship data (i.e., intermediate query results) between compute sites. While several performance aspects, such as bandwidth, latency, communication cost, and compute capabilities have received great attention, the federate nature of data processing has been recently challenged by data transfer regulations (or policies) that restrict the movement of data across geographical (or institutional) borders or by any other rules of data protection that may apply to the data being transferred between certain sites.

European directives, for example, regulate transferring only certain information fields (or combinations thereof), such as non-personal information or information not relatable to a person. Likewise, regulations in Asia may also impose restrictions on data transfer. Non-compliance to such regulatory obligations has attracted fines in the tune of billions of dollars. It is, therefore, crucial to consider compliance with respect to  legal aspects when analyzing federated data.

Data Transfer Regulations through the GDPR lens

As of now most countries around the world have various data protection laws---with the EU General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) being the most prominent---that impose restrictions on how data is stored, processed, and transferred.

Let's take GDPR as an example. GDPR articles 44--50 explicitly deal with the transfer of data across national borders. Among these, there are two articles and one recital wherein the legal requirements for transferring data fundamentally affect federated data processing. 

Article 45: Transfers on the basis of an adequacy decision. 

The article dictates that transfer of data may take place without any specific authorization, e.g., when there is adequate data protection at the site where data is being transferred or when data is not subjected to regulations (i.e., when the data does not follow the definition of personal data as in Article 4(1)).

Article 46: Transfers subject to appropriate safeguards. 

This article prescribes that (in the absence of applicability of Article 45) data transfer can take place under "appropriate safeguards". Based on the European Data Protection Board (EDPB) recommendations that supplement transfer tools, pseudonymisation of data (as defined under Article 4(5)) is considered as an effective supplementary method.

Recital 108: Transfers under measures that compensate lack of data protection.

Data after adequate anonymization (i.e., when resulting data does not fall under Article 4(1) and as described in Recital 26) does not fall under the ambit of GDPR and therefore can be transferred.


Depending on the data and to where that data is being transferred, the above regulations can be classified into:

  1. No restrictions on transfer: Some data maybe allowed to be transferred unconditionally, and some to only certain locations.
  2. Conditional restrictions on transfer: For some data, only derived information (such as aggregates) or only after anonymization, can be transferred to (certain) locations.
  3. Complete ban on transfer: Some data, no matter whatsoever, must not be transferred outside.

Compliance-by-Design: The Challenges



Rather than ad-hoc solutions to make data processing regulation-compliant, a more holistic approach that provides appropriate safeguards to data controllers (entities that control what data and how data should be processed) and data processors (entities that processes data on behalf of a controller) within a federated data processing system  is required.  

In the context of federated data processing, three aspects must to be revisited:

  1. First and foremost, data processing systems must offer declarative policy specification languages, which makes it easy and simple for controllers  to specify data regulations. Policy specification languages should take into account the type of data, its processing, as well as the location of processing. Regulations may affect processing of an entire dataset, a subset of it, or even information that is derived from it. Policy specifications must also be considered keeping in mind the heterogeneity of data formats (e.g., graph, relational, or textual data).
  2. The second aspect of ensuring that compliance is at the core of federated data processing is integrating legal aspects in query rewriting and optimization. A system must be able to transparently translate user queries into compliant execution plans
  3. Lastly, federated systems must offer capabilities to decentralize query execution, which may also be desired by the compliant plan. We need query executors that can efficiently orchestrate queries over different platforms across multiple clouds or data silos. 

 Summary

Today regulation-compliant data processing is a major challenge, driven by regulatory bodies across the world. In this blog post, we  analyzed data transfer regulations from the perspective of GDPR and discussed key research challenges for including compliance aspects in federated data processing. Compliance is at the core of our blossom data platform. In the next blog post, we will discuss how Databloom's blossom data platform addresses some of the aforementioned challenges and ensures regulation-compliant data processing across multiple clouds, Geo-locations, and data platforms.

References

[1] Kaustubh Beedkar, Jorge-Arnulfo Quiané-Ruiz, Volker Markl: Compliant Geo-distributed Query Processing. SIGMOD Conference 2021: 181-193
[2] Kaustubh Beedkar, David Brekardin, Jorge-Arnulfo Quiané-Ruiz, Volker Markl: Compliant Geo-distributed Data Processing in Action. Proc. VLDB Endow. 14(12): 2843-2846 (2021)
[3] Kaustubh Beedkar, Jorge-Arnulfo Quiané-Ruiz, Volker Markl: Navigating Compliance with Data Transfers in Federated Data Processing. IEEE Data Eng. Bull. 45(1): 50-61 (2022)





















04 July 2022

Internationalization: The challenges of building multilingual web applications

At Databloom, we value diversity. We are a multicultural company with team members from different parts of the world, where we speak a wide variety of languages, such as English, French, German, Greek, Hindi, Korean, and Spanish. Data science teams are also so diverse! For that reason, in Databloom's Blossom Studio, we plan to introduce internationalization and localization features to make our application multilingual.

In that context, we want to discuss several aspects that we found relevant when trying to implement a multilingual application. In addition, we also want to share some resources that we found helpful when applying some of these concepts into practice.

1. Translation Methods

We have two main options: machine automatic translations, where an external service performs the translation for us (e.g., Google translator API, Amazon translate), and human translations, where we manually provide the translated texts.

Generally speaking, it is helpful to use online translation services when the content of the website or application is going to be updated regularly and with high frequency over time (such as blog websites and newspapers). On the other hand, if the website's content will not change substantially, it is preferable to use custom translations since they give us greater control over the quality and accuracy of the translated content. Regardless of the chosen method, we also need to consider where we'll store the translated texts (front or backend) and how we'll access that information when required.

2. Localization Formats

It refers to the formats used to represent different aspects of the language in the local country. It includes topics such as content layout and the formatting of dates and numbers.

  • Content Layout: Different languages may have different writing modes concerning the directionality of their texts. While in the case of Indo-European languages such as English, the writing mode is typically left-to-right (LTR), languages like Arabic and Hebrew have a right-to-left (RTL) script. On the other hand, other scripts, typically of Asian origin, can also be written in a top-to-bottom direction, as is the case with Chinese and Japanese. In this context, our web application should be able to correctly adjust its content layout when displaying another language with a different writing mode, just like in the example shown in Figure 1.

Figure 1. This website (https://www.un.org/) supports different writing modes (RTL, LTR) [1].


A recommendation we can give regarding this point is to apply CSS styles using relative positions (start, end) instead of using fixed ones (left, right). In this way, if the directionality attribute in the markup file is dynamically modified (e.g., dir = "ltr" to dir = "rtl"), it won't be necessary to alter the respective CSS styles.

  • Date and time formats: Some countries may share the same base language but may have different ways of displaying dates, as is the case of the US and the UK. While in the former, the date format is MM/DD/YYYY, in the case of the latter is DD/MM/YYYY. 


Figure 2. US and UK date format comparison

  • Numbers format: Countries may use distinct approaches to represent their numbers. Not only may they use particular ways to depict each digit (as is the case between Latin languages and Chinese), but they can also employ different characters to separate thousands and decimals numbers. For example, the US uses the comma character (",") as the thousands separator. In contrast, many European and Latin-American countries apply that same character as a decimal separator.
 Figure 3. The US and Chile numbers format comparison


3. User Experience


We must also consider some UX and UI elements in our web application to manage multiple languages appropriately, like the following: 

  • Language Switch: Usually corresponds to a dropdown list or a select element that displays all the languages the web application supports. The user should be able to change the language when clicking on one of the list items.
  • User Navigation: We must decide if each language will have a specific address displayed to the user. For example, for the English version, you can have the following address: https://www.mycoolwebsite.com/en, while for the german version, you may have this one: https://www.mycoolwebsite.com/de. In the case of single-page applications using Vue.js, this feature is made possible by using the extension Vue router. To simulate that browsing experience, you can create different routes for each selected language. 

  • User Preferred Language: A nice-to-have feature is to display the preferred language by the user right from the start. We can use browser properties like Navigator.language to capture that information in our application. This read-only property returns a string in a standard format representing the user's preferred language in that particular browser [2].
  • Fallback Language: We also should have a fallback language to present to the user in case the preferred language is unavailable or if we don't have translations for a specific language. Typically, the fallback language should be the one your customers use the most.
  • Persistence: Another nice-to-have feature is that choices made by the user persist over time, even when refreshing the web application. One way to accomplish this is by storing the user's selected language in the Local Storage using the Window.LocalStorage property [3].

4. Resources

Implementing all these features may seem like a daunting task. Fortunately, there are several internationalization packages that you can use to make things happen, so you don't have to reinvent the wheel. For example, one that we highly recommend for Vue applications is vue-i18n. This plug-in addresses many of the features mentioned above, such as number and date formats, and also considers other attributes we didn't mention here, like noun pluralization. Furthermore, it works well with both Options API and Composition API.

Other resources you can use to get you started:

References:















26 June 2022

Apache Wayang: More than a Big Data Abstraction

Recently, Paul King (V.P. and Chair of Groovy PMC) highlighted the big data abstraction [1] that Apache Wayang [2] provides. He mainly showed that users specify an application in a logical plan (a Wayang Plan) that is platform agnostic: Apache Wayang, in turn, transforms a logical plan into a set of execution (physical) operators to be executed by specific underlying processing platforms, such as Apache Flink and Apache Spark.

In this post, we elaborate on the cross-platform optimizer that comes with Apache Wayang, which decides how to generate execution plans. When a user specifies an application on the so called Wayang plan, 

25 June 2022

Integrating new plugins into Blossom (Part 1)

Databloom Blossom is a federated data lakehouse analytics framework that provides a solution for federated learning. Blossom supports the entire generation of distributed ML pipelines from data extraction to consumption, covering access to data sources, data preparation, feature engineering, and Federated Learning.

The present blog is part of a series that encourages users to create their personalized blossom plugin, which contains custom logical operators and mappings to process these operators on different execution platforms. In addition, to declare conversion channels to transform output data types suitable for processing by any available Platform.

In this first part, the present blog will explain several Blossom concepts necessary to implement new features. Please, consider that Blossom is a cross-platform processing framework, and the computations are not trivial. In the first part of this tutorial, we will go deep into the abstractions that support the integration of technologies and help Blossom to optimize operators that can run on different platforms and implementations.

Blossom Plan

Blossom Plan is the base structure for the optimization process of a data analytics pipeline. It contains a set of operators composing some workload to be processed. A Blossom Plan includes methods to prune the search space, traverse the plan from its sinks operators and apply transformations over operators to obtain different possible implementations.

Operators

Blossom supports operators designed to run following different processing models, from Relational Database operators to Flink Stream Processing Operators. The framework that provides this interoperability works through several layers of abstraction to integrate components as general as possible. Therefore, Developers only need to implement already defined interfaces.

Operator Interface

Describe any node in a Blossom Plan and its principal components. An implementation of this interface must detail a type for the operator, specific configurations to process and optimize it, and methods to manipulate Input and Output Slots; controlling what data this operator as a unit of processing will receive and produce. 

Some Binary Operators handle several Input sources, e. g. Join Operators; while others Replicate Operators produce multiple Outputs streams to be processed by different operators. Input and Output slots connect two operators creating a Producer-Consumer dependency between them.

Input/Output Cardinality Abstract Operator

As suggested in the previous section, different Operators require a different number of Input/Output slots. Source Operators require only an Output Slot because they do not receive any Input, and Sink Operators require only an Input Slot because they do not transmit results to other operators. For a better classification of operators, Blossom incorporates UnaryToUnaryOperator, UnarySource, UnarySink, and BinaryToUnaryOperator classes to handle every specific case. Input and Output Slot are defined by a DatasetType that keeps track of the type and structure being transferred between operators through a slot.

Going further with this explanation, let's review the abstract class BinaryToUnaryOperator. It Receives three Generic Types[1] corresponding to the two inputs type of the operator and a single output type. Extending this class the user can model Join, Union, and Intersect Operators.


Blossom Operator Classes

Blossom Operator Classes are the actual nodes that compose a BlossomPlan. The purpose of a Blossom operator class is to define a promised functionality that could be implemented on different processing platforms. The Blossom community usually called these operators platform-independent operators. Blossom Operator Classes do not describe how a specific functionality will be delivered, that is tightly dependent on each underlying platform that Blossom can use to run the operator.

Any Operator Class that extends an Input/Output Cardinality Abstract Operator is a Blossom Operator Class. Let's review the CountOperator Blossom Operator Class; CountOperator<Type> extends UnaryToUnaryOperator<Type, Long>, meaning that it receives a Generic and returns a Long value. Therefore, the only restriction to Platforms implementing this operator is that in execution time, a CountOperator will receive a Stream of Type elements; after processing them, the CountOperator must return a single Long value. Any platform that wants to support  CountOperator must follow that pattern.



Channels

A Channel in Blossom is the interface that interconnects different sets of operators; in other words, a channel is a glue that connects one operator with another operator. Imagine an operator "Source" running in Java reading tuples from a local file; the output of "Source" will be a Collection of tuples. In the described case, the Output Channel provided by Source is a Java Collection. A Java collection channel only can be used as Input of Operators that accept Java Collection format as an Input. To allow other Platforms than Java to accept the output of Source, it is mandatory to convert this Java Collection into another format.

Execution Operators

A CountOperator cannot run unless a specific behavior is given. An Execution operator implements the procedure followed by an operator for its correct execution on a specific Platform. Let's see two examples:

JavaCountOperator<Type> extends CountOperator<Type> and implements the interface JavaExecutionOperator. In the case of Java Platform, the evaluate method gives behavior to this ExecutionOperator; notice from the extraction of the code that the operator uses a Java Collection Channel, and after casting the Channel as a Collection uses the standard Collection.size method to get the result.

On the other hand, FlinkCountOperator<Type> extends CountOperator<Type> and implements the interface FlinkExecutionOperator. The code Extract shows that in this case, the Channel must be a Flink DatasetChannel, and the operation is also trivial returning the Flink Dataset.count method result.

Execution Operators of an Operator are all the implementation alternatives for a Blossom Operator to be included as part of an executable Plan. To decide which alternative is more efficient given a certain optimization goal, Blossom compares an estimation of resources required by different ExecutionOperators running on the available Platforms.

In the next part of this tutorial, we will review how Blossom manages to Optimize a plan, what pieces of code must be provided in our Plugins to allow this, and a real example of a custom Executor that includes a Postgres platform.

References

20 June 2022

The 3 pillars for effectively and efficiently leading remote work teams

We live in a time when distances are no longer a problem. Many communication and meeting management tools allow us to interact with people from very diverse geographical locations. But, although these tools are of great help to us to contact our friends and family, but the question that one might wonder is: are they able to coordinate, manage and monitor our work teams in the same way? Maybe not by themselves.


At Databloom, we have taken on this challenge from a perspective of cultural plurality and the richness that comes from having developers and engineers with diverse skills and knowledge but with a common purpose. Achieve organizational goals in time, quality, and innovation.

The fundaments to achieve the challenges are diverse, but at least the three next fundamental pillars to manage successful teams should be highlighted.

Pillar 1. Form teams with high standards of quality and commitment:


This is undoubtedly the first step for good management. The human groups, particularly the work teams with diverse members in terms of age, experience, knowledge, and even personalities, must share essential common values. ​​Some of these values are: teamwork, generosity, and patience to share knowledge, professionalism, coupled with a high sense of commitment and sense of belonging to the organization.

Our frontend development team at Databloom, is fortunate of having a highly professional, responsible, collaborative, and creative team. Undoubtedly, each of them has strong skills in the field of programming, application development, technology integration, abstraction skills, critical thinking, and motivation for innovation. But none of this has real value if it is not accompanied by great human and professional quality as well as an entrepreneurial spirit and the courage to face new challenges.

Pillar 2. Establish clear and measurable personal, group, and organizational goals.


When each team member has clearly defined their tasks, responsibilities, and deadlines. When you are also aware of how each of these responsibilities is part of a larger objective as part of a team and how meeting these objectives lead us to achieve the goals of the organization, then it is when you achieve that collaborative culture that enables team members to complement with each other when executing a task.

No less important is to establish a personal and team goal to achieve a standard in terms of handling technologies, programming languages​​, and knowledge of best practices for development.

One of the main challenges we are facing in our frontend development team is that research and experimentation are fundamental pieces for a good development process.

Pillar 3. The role played by the team leader.


The team leader must be a facilitator, not just a task controller. His first responsibility is to get to know each member of his team deeply. He must know the strengths and weaknesses as well as the potential of each one of his collaborators. He must establish relationships of trust and respect between each of the members and build bridges for collaborative work and collaborative problem-solving.

The team leader must provide fluid, permanent and simple communication channels.

Each member of the team must be sure that their leader is available to help them find a solution or look for alternatives that allow them to overcome the obstacles and doubts that appear in the development process.

If we have the 3 pillars mentioned above, whatever the tool you use to manage your work meetings, follow-up meetings, or simply conversations to review obstacles and problems (Google Meets, Zoom, WhatsApp, Microsoft Teams, etc.), you have a high probability of efficiently managing your work team, with the certainty that both you and each one of the members will be confident to raise their concerns and ask for help whenever is necessary:

Help that you will certainly find in a friendly, reliable, and professional environment.

At Databloom we believe in a collaborative culture, where the role that each member plays is extremely important and therefore it is everyone duty to be a support, providing timely help when it is required. We established regular and specific meetings, set short-term goals and continuously monitor activities.

We give Databloom developers enough autonomy to manage their time and commitments and, more importantly, the freedom to create and propose their ideas, thereby enriching the final product and the work environment.

15 June 2022

Machine Learning for X and X for Machine Learning

After the recent advances in Artificial Intelligence (AI), and especially in Machine Learning (ML) and Deep Learning (DL)*, various other computer science fields have gone into a race of "blending" their existing methods with ML/DL. There are two directions to enable such a blend: Either using ML to advance the field, or using the methods developed in the field to improve ML. A commonly used slogan when combining ML with a computer science field is: ML for X or X for ML, where X can be, for instance, any of {databases, systems, reasoning, semantic web}. 

In this blog post, we focus on cases where X = big data management. We have already observed works on ML for data management and on data management for ML since several years now. Both directions have a great impact in both academia, with dedicated new conferences popping up, as well as in the industry, with several companies working on either improving their technology with ML or providing scalable and efficient solutions for ML.

Databloom.ai is one of the first companies to embrace both directions within a single product. Within Databloom's product, Blossom, we utilize ML to improve the optimizer and, thus, provide better performance to users but also we utilize well-known big data management techniques to speed up federated learning.

 

Databloom.ai is the first to embrace both directions within a single product


Machine Learning for Big Data Management and Big Data Management for Machine Learning in databloom.ai

 

ML for Data Management

In a previous blog post we discussed how we plan to incorporate ML into Blossom's optimizer. Having an ML model to predict the runtime of an execution plan can have several benefits. First, system performance can rocket as the optimizer is able to find very efficient plans. For example, for k-means, an ML-based optimizer [1] outputs very efficient plans by finding the right platform combination and achieve 7x better runtime performance than a highly-tuned cost-based optimizer! Second, the hard part of manually tuning a cost model in an optimizer is vanished. With just collected training data (plans and their runtime) you can easily train an ML model.


Data Management for ML

In an earlier blog post we discussed how Blossom, a federated data lakehouse analytics framework, can provide an efficient solution for federated learning following principles from data management. To enable federating learning, we are working towards a parameter server architecture. However, as simplistic parameter servers are very inefficient due to excessive network communication (i.e., large number of messages sent through the network) we utilize data management techniques such as increasing locality, enabling latency hiding, and exploiting bulk messaging. The database community has already exploited such optimizations and showed that they can lead to more than an order of magnitude faster training times.

[1] Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, Sanjay Chawla: ML-based Cross-Platform Query Optimization. ICDE 2020: 1489-1500.

[2] Alexander Renz-Wieland, Rainer Gemulla, Zoi Kaoudi, Volker Markl: NuPS: A Parameter Server for Machine Learning with Non-Uniform Parameter Access. SIGMOD Conference 2022: 481-495.

* We use the term ML to refer to both Machine Learning and Deep Learning.
** Icons appearing in the figure were created by Creative Stall Premium - Flaticon

 


 

 

 

 


most read