Who of you has right now the reasonable certainty that all your Spark jobs are performing at their maximum without wasting more computational resources than necessary? If so, what information are you basing it on? Would you be able to quantify the waste of resources in economic terms? In my on-field 7 years’ experience at Agile Lab, about 5 to 10% maximum of Spark users really know and master what’s happening inside their processes.
Apache Spark is one of the most successful open-source projects in the world, and that’s because it could make big (huge) data processing extremely easy by hiding its distributed computation complexity inside its engine. This allowed every programmer without specific knowledge to approach and use the Spark API to manipulate a distributed dataset as it was a collection of objects in the local memory.
Unfortunately, writing an algorithm or an ETL pipeline in Spark, leveraging its built-in APIs, doesn’t guarantee it will be able to linearly scale up with the data increase, nor it always auto-optimize itself in order to perform at its best, according to the given computational resources.
Even now, after many years after its mass adoption, it’s not so unusual to incur into underperforming or resources overusing solutions. This is because distributed-systems-inexperienced developers are put without worries to write computationally complex processes as if they were just like “normal” Java applications.
But what are the turns of this in terms of business?
Execution time multiplied by the allocated resources (CPU and RAM) is directly proportional to the economic costs and, in the Cloud world, this is even more impacting than on-premise.
Typically, Big Data infrastructures are very expensive and it’s not so rare to find out Spark jobs allocating hundreds of vcores and GB of memory.
So basically, either inefficient jobs occupy a certain amount of resources for more time than necessary, or they allocate more resources than necessary thus wasting them during their runtime. Such wastes repeat themselves every time these jobs are executed, thus making it to all effects a recurring cost (OPEX).
How can we know if our jobs are efficient or not?
First rule – Don’t believe what developers declare about these jobs and here’s why:
- The tuning process is long and boring, in fact it requires to rerun the job over and over, simulating different scenarios and collecting performance data in an extremely scrupulous fashion. These are repetitive and high precision tasks that most of developers just avoid doing.
- It’s difficult to optimize something you don’t have a reference benchmark against, so the common practice is thinking to have reached the top performance too soon, also because questioning ourselves is always a difficult path to undertake.
- Usually, they don’t precisely know the dimensions over which to optimize and they don’t have the correct tools to instrument the jobs.
Let’s make an example of how I’ve usually seen Spark jobs tuning done.
The easiest approach is to provide the job with a certain amount of resources and then progressively reduce them until it doesn’t crash, then increase them a little so to have an extra safety margin.
But can we really trust this process? What happens if the input data volume is not constant day after day? Are we sure we can’t optimize the code and making it more efficient? Did we verify that the crash wasn’t caused by data skewness on a specific node?
The answer is not trivial and it requires experience and different debug and instrumentation tools. For instance, let me tell you about an episode that will show you how much the process mentioned above is superficial.
It occurred to me once to troubleshoot a job allocating about 100 Spark executors with 20GB RAM each, so a total of 2TB RAM. It was an import/ingestion job scheduled to run every day and lasting about 30 minutes, thus costing about 200$ per run. The developers team did try to reduce the amount of allocated executor’s memory, but faced OutOfMemory issues, so they were resigned it was performing the best possible. The number of executors was so high since they actually noticed little improvements in the job duration time by increasing the overall parallelism.
After the analysis, it came out that only one of the executors was actually working for the first 28 minutes (out of 30) and that it needed 20GB of RAM to complete its tasks. We modified the source code and the files structure then and got to have the job to linearly scale up and completing in just 12 minutes with only 20 executors having 20GB RAM each, thus making it a 16$ cost per run (67.000$ save per year!).
To conclude, my strong advice for all the IT executives managing Spark-based projects/systems/applications and — most of all — their associated costs is to always ask detailed reports about the performance tuning processes (before going into production), such as:
- Audit of all the tests attempts, with related configuration, input data e timings of all the main elaboration steps.
- Memory and CPU profiling of all the executors JVMs (not the hosts!)
- Data skewness analysis, to understand if data is uniformly and efficiently distributable across the executor’s cores during all the elaboration steps.
- Time lost by executors, to verify that there are no hot-spotting ones thus keeping most of the others idle.
- Worst case scenario simulations, in terms of data and in terms of cluster events, such as YARN or the cluster resource manager not able to allocate the requested containers, input data doubles or accumulates for days due to unpredicted system stops, etc …
- Opportunities to optimize the code base and effort spent in this direction.
I know this seems to be an expensive extra activity, but they are CAPEX costs that positively impact the OPEX ones. If they tell you no benefits/improvements come out of this type of tasks, then beware!
Would you like to go deeper? Take a look at Agile Lab Spark Radar to find out how you can easily optimize your Spark applications performance.