20 January 2022

Apache Spark vs Apache Flink, Stop it, why not better together?

In the world of batch processing currently exist two big guys out there, Apache Spark[1] and Apache Flink[2]. Apache Spark was born in batch processing and was built to perform in the best way possible in that environment. On the other hand, Apache Flink is a system that, depending on configuration, can run in batch processing, but it is designed for stream environments. However, both can execute dataflow and process terabytes without any problem; both were built to extract the best performance of the machine on whichthey are running.

Reviewing feature by feature, design by design, it is possible to find that both of them will have some ideal use cases where they are undefeatable. If the environment and workflow are well known, and the characteristics match with one of the use cases of the big guys, sit back and enjoy your performance; however, how do you know that you belong to one of those cases? Do the use cases get your daily workload done? Is the workload similar enough to their common assumptions at the development time? Is the current execution time the best for the workload? How to be sure?

Many people these days have the same questions in their mind, some of them pick the most popular, others prefer the one with the big community to get a faster answer, there is a group that will take the faster one in a synthetic benchmark, the experimented one will show the one that offers the best performance in a selected workload. Nevertheless, the experimenting people also face a new question: “Is my workload still in the area where the platform shines?” Too many questions will appear each day regarding which is the best platform for my current workload.

A year ago, a group of researchers asked the same questions[3], and they figured out that the best answer is that each workload, query, and amount of data will have the best configuration in terms of resources and even the best platforms. So, for example, nobody will think that it is a good idea to start a cluster of Apache Spark or Apache Flink to process 1GB of data, just because it is a waste of money, and in this example will appear another question that is "When is a good idea to start the cluster?" but also "What is the proper size of the cluster?".

Another example, several companies start with a small amount of data and processing scripts, sometimes even in bash, that is capable of solving their needs, but in the time that script starts getting slower and slower because the amount of data that the script process produces. In many cases then they figure out the team needs to start thinking of one of the big two guys. So in this example, the question mutates to how to monitor the processing queries: by time? By the memory that they use? The input size?

The questions exposed are just a tiny fraction of the actual questions that bother many developers and managers; however, there exists a system that helps with that question. Apache Wayang[4] is a cross-platform data processing system, which can consider several factors for picking the best combination between platforms and configuration for the query run in the shortest time possible.

The goal of Apache Wayang is to detect when to change the platform, this means that the query requires to be implemented just one time, and if the data grows, the query will continue in the best performance possible. For example, if the input size is 1GB today and the data grows to 100GB in one month, Apache Wayang will detect the right moment of switching the platform.

In the case of workload focusing on just relational processing and it starts moving slowly to graph processing, Apache Wayang will be executing the queries requiring graph processing in specialized execution engines, and the rest of the workload will continue running as usual. On the other hand, in the case of a faster platform appearing, just plug it in Apache Wayang, and see if the platform can take some advantage of the current workload. Then, the workload will be executed on that platform.

These days, it is required to stop thinking about what platform is the best because they all have one characteristic where they are undefeatable. Nevertheless, the company's workload needs the best platforms, but forcing the workload to run on one platform is not fair; today's workloads need various platforms working together. Apache Wayang appears to solve the question, “What is better?”, telling as an answer, "Tell me your query, and I will tell you what combination is the best."


most read