Posts

Showing posts from February, 2022

The Data Mesh - should you adapt?

Image
In actuality, not every firm may be a good fit for the implementation of a Data Mesh.  Larger enterprises that experience uncertainty and change in their operations and environment are the primary target audience for Data Mesh.  A Data Mesh is definitely an unnecessary expense if your organization's data requirements are modest and remain constant over time. What is a "Data Mesh"? As it focuses on delivering useful and safe data products, Data Mesh is a strategic approach to modern data management and a strategy to support an organization's journey toward digital transformation. Data Mesh's major goal is to advance beyond the established centralized data management techniques of using data warehouses and data lakes. By giving data producers and data consumers the ability to access and handle data without having to go through the hassle of involving the data lake or data warehouse team, Data Mesh highlights the concept of organizational agility. Data Mesh's dec

Poisoning attacks in Federated Learning

Image
Federated learning is a double-edged sword in that it is designed to ensure data privacy, yet unfortunately, it opens a door for adversaries to exploit the system easily. One of the popular attack vectors is a poisoning attack. What is a poisoning attack? A poisoning attack aims to degrade machine learning models easily that can be classified into two categories: data and model poisoning attacks. Data poisoning attacks aim to contaminate the training data to indirectly degrade the performance of machine learning models [1]. Data poisoning attacks can be broadly classified into two categories: (1) label flipping attacks in which an attacker "flips" labels of training data [2] (2) backdoor attacks in which an attacker injects new or manipulated training data, resulting in misclassification during inference time [3]. An attacker may perform global or targeted data poisoning attacks. Targeted attacks add more challenges to be detected as they only manipulate a specific class an

Federated Learning (Part II): The Blossom Framework

Image
This is the second post of our Federated Learning (FL) series. In our previous post, we introduced FL as a distributed machine learning (ML) approach where raw data at different workers is not moved out of the workers. We now take a dive into Databloom Blossom, a federated data lakehouse analytics framework, which provides a solution for federated learning. The research and industry community have already started to provide multiple systems in the arena of federated learning. TensorFlow Federated [1], Flower [2], and OpenFL [3] are just a few examples of such systems. All these systems allow organizations and individuals (users) to deploy their ML tasks in a simple and federated way using a single system interface.  What Is the Problem? Yet, there are still several open problems that have not been tackled by these solutions, such as preserving data privacy, model debugging, reducing wall-clock training times, and reducing the trained model size. All of the equal importance. Among all o

Encryption with HDFS and kerberos

Image
We at databloom.ai deal with a large ecosystem of big data implementations, most notably HDFS with encryption on flight and rest. We also see a lot of misconfigurations and want to shed some light into the topic with this technical article. We use plain Apache Hadoop, but the same technical background works for other distributions like Cloudera . Encryption of data was and is the hottest topic in terms of data protection and prevention against theft, misuse and manipulation. Hadoop HDFS supports full transparent encryption in transit and at rest [1], based on Kerberos implementations, often used within multiple trusted Kerberos domains. Technology Hadoop KMS provides a REST-API, which has built-in SPNEGO and HTTPS support, comes mostly bundled with a pre-configured Apache Tomcat within your preferred Hadoop distribution. To have encryption transparent for the user and the system, each encrypted zone is associated with a SEZK (single encryption zone key), created when the zone is defi