Showing posts from January, 2022

The Data Mesh - should you adapt?

In actuality, not every firm may be a good fit for the implementation of a Data Mesh.  Larger enterprises that experience uncertainty and change in their operations and environment are the primary target audience for Data Mesh.  A Data Mesh is definitely an unnecessary expense if your organization's data requirements are modest and remain constant over time. What is a "Data Mesh"? As it focuses on delivering useful and safe data products, Data Mesh is a strategic approach to modern data management and a strategy to support an organization's journey toward digital transformation. Data Mesh's major goal is to advance beyond the established centralized data management techniques of using data warehouses and data lakes. By giving data producers and data consumers the ability to access and handle data without having to go through the hassle of involving the data lake or data warehouse team, Data Mesh highlights the concept of organizational agility. Data Mesh's dec

Combined Federated Data Services with Blossom and Flower

In a former post in November, Jorge wrote about what FL is and why top data scientists are only using FL to train and productionize ML and AI models; to bolster the rising use of Federated Learning we also see a high adoption in China [1]. When it comes to Federated Learning frameworks we typically find two leading open source projects - Apache Wayang [2] (maintained by databloom ) and Flower [3] (maintained by Adap ). And at the first view both frameworks seem to do the same. But, as usual, the 2nd view tells another story. How does Flower differ from Wayang? Flower is a federated learning system, written in Python and supports a large number of training and AI frameworks. The beauty of Flower is the strategy concept [4]; the data scientist can define which and how a dedicated framework is used. Flower delivers the model to the desired framework and watches the execution, gets the calculations back and starts the next cycle. That makes Federated Learning in Python easy, but also limit

Apache Spark vs Apache Flink, Stop it, why not better together?

In the world of batch processing currently exist two big guys out there, Apache Spark [1] and Apache Flink [2] . Apache Spark was born in batch processing and was built to perform in the best way possible in that environment. On the other hand, Apache Flink is a system that, depending on configuration, can run in batch processing, but it is designed for stream environments. However, both can execute dataflow and process terabytes without any problem; both were built to extract the best performance of the machine on whichthey are running. Reviewing feature by feature, design by design, it is possible to find that both of them will have some ideal use cases where they are undefeatable. If the environment and workflow are well known, and the characteristics match with one of the use cases of the big guys, sit back and enjoy your performance; however, how do you know that you belong to one of those cases? Do the use cases get your daily workload done? I