The Data Mesh - should you adapt?

In actuality, not every firm may be a good fit for the implementation of a Data Mesh.  Larger enterprises that experience uncertainty and change in their operations and environment are the primary target audience for Data Mesh.  A Data Mesh is definitely an unnecessary expense if your organization's data requirements are modest and remain constant over time. What is a "Data Mesh"? As it focuses on delivering useful and safe data products, Data Mesh is a strategic approach to modern data management and a strategy to support an organization's journey toward digital transformation. Data Mesh's major goal is to advance beyond the established centralized data management techniques of using data warehouses and data lakes. By giving data producers and data consumers the ability to access and handle data without having to go through the hassle of involving the data lake or data warehouse team, Data Mesh highlights the concept of organizational agility. Data Mesh's dec

Federated Learning: Old Wine in a New Bottle?

Machine Learning (ML) is a specific subset (branch) of Artificial Intelligence (AI). The main idea of ML is to enable systems to learn from historical data to predict new output values for input events. The beauty is that ML does not require systems to be explicitly programmed to achieve so. All this with little human intervention.

With the growing volumes of data in today’s world, ML has gained unprecedented popularity. We can achieve today what it was unimaginable yesterday: from predicting cancer risk from mammogram to polyglot AI translators. As a result, ML has become the key competitive differentiator for many companies, leading ML-powered software to quickly become omnipresent in our lives. The key in ML is that the more available data the better the accuracy of the predictive models.

The Appearance of Distributed ML
While ML has become a quite powerful technology, its hunger for training data makes it hard to build ML models in a single machine. It is not unusual to see training data size in the order of hundreds of gigabytes to terabytes, such as in the Earth Observation domain. This has created the need to build ML models over distributed data over multiple storage nodes.

Distributed ML aims at learning ML models using multiple compute nodes to cope with larger input training data sizes as well as to improve performance and models’ accuracy [1]. Thus, distributed ML help organizations and individuals to draw meaningful conclusions from vast amounts of training data. Healthcare and advertising are only two examples of the most common sectors that highly benefit from distributed ML.

There exist two fundamental ways to perform distributed ML: data parallelism and model parallelism [2]. Figure 1 illustrates these two approaches for distributed ML.

In the data parallelism approach, the system horizontally partitions the input training data, usually it creates as many partitions as available compute nodes (workers), and distributes each data partition to a different worker. Then, it sends the same model features to each worker, which, in turn, learns a local model using their data partition as input. The workers, then, send their local model to a central place where the system merges them into a single global model.

The model parallelism approach, in contrast to data parallelism, partitions the model features and sends each model partition to a different worker, which in turn build a local model using the same input data. That is, the entire input training data is replicated in all workers. Then, the system brings these local models into a centralized place to aggregate them into a single global model.
"Yet, although powerful, distributed ML has a core assumption that limits its applicability: one needs to have control and access over the entire training data."
However, in an increasing number of cases, one cannot have direct access to raw data and hence distributed ML cannot be applied in such cases, for example in the healthcare domain.

The Emergence of Federated Learning
The concept of FL was first introduced by Google in 2017 [3]. Yet, the concept of federated analytics/databases date from the 80’s [4]. Similar to federated databases, FL aims at bringing computation to where the data is.

Federated learning (FL) is basically a distributed ML approach but, in contrast to traditional distributed ML, raw data at different workers is never moved out of the workers. The workers own the data and they are the only ones having control and direct access to it. Generally speaking, FL allows for gaining experience for a more diverse set of datasets at different independent/autonomous locations.

Ensuring data privacy is just crucial in today’s world when societal awareness for data privacy is rising as one of the main concerns of society. For example, many governmental organizations have written laws, such as GDPR [5] and CCPA [6], to control the way data is being stored and processed. FL enables organizations and individuals to train ML models across multiple autonomous parties without compromising data privacy. During training, organizations/individuals share their local models to learn from each other’s local models. Thus, organizations and individuals can leverage others’ data to learn more robust ML models than when using their own data only
"The beauty of FL is that it enables organizations and individuals to collaborate towards a common goal without sacrificing data privacy."
Multiple participants collaboratively train a model with their sensitive data and communicate among them only the learnt local model. Figure 2 illustrates the general architecture of FL. 
FL also leverages the two fundamental execution modes to build models across multiple participants: horizontal learning (data parallelism) and vertical learning (model parallelism).

FL is a powerful technology that allows organizations and individuals to collaborate towards the same goal without sacrificing data privacy.
"Old wine in a new bottle? We can at least conclude that FL is a mixture of federated databases with distributed ML."
Yet, there are a few aspects that make FL unique:

In contrast to distributed ML, FL prevents organizations or individuals to access data from other organizations/individuals.
  • FL is geo-distributed in essence while distributed ML is an on-premise technology.
  • One of the main goals of FL is safeguarding data privacy while this is a nice-to-have feature in federated databases. Distributed ML does not care it because assumes full control of the data.
  • While federated databases assume a relational data model, FL does not make any assumption on the underlying data model.

Popular posts from this blog

Towards a Learning-based Query Optimizer

The Missing Piece in Learning-based Query Optimization

Federated Learning (Part II): The Blossom Framework