Spark Remote Debugging
Hi everybody! I’m a Big Data Engineer @ Agile Lab, a remote-first Big Data engineering and R&D firm located in Italy. Our main focus is to build Big Data and AI systems, in a very challenging — yet awesome — environment.
In Agile Lab we use in various projects, among other technologies, Apache Spark as a processing engine. Spark is fast and it is simple for a developer to write some code that can be run right away, but as for regular programs, it is not always easy to understand what a Spark job is doing.
This article will focus on how a developer can remotely debug a running Spark Scala/Java application (running on YARN) using IntelliJ IDEA, but all the Spark and environment configurations hold also for other IDEs.
Agent JDWP, licence to debug
To perform remote debugging of a Spark job, we leverage the JDWP agent (Java Debug Wire Protocol) that defines a communication protocol between a debugger and a running JVM. JDWP defines only the format and layout of packets exchanged by the debugger and the target JVM, while the transport protocol can be chosen by the user. Usually, the available transport mechanisms are shared memory (dt_shmem) and socket (dt_socket) but only the latter, which uses a TCP socket connection to communicate, can be used for remote debugging.
So in order to enable remote debugging, we must configure the target JVM with the following Java property in order to make it acting as a JDWP server to which our IDE can connect:
The property above tells the JVM to load the JDWP agent and wait for a socket connection on the specified port. In particular:
transport=dt_socket tells the agent to use socket as the desired transport mechanism.
server=y means that the JVM will act a JDWP server: it will listen for a debugger client to attach to it.
suspend=y tells the JVM if it must wait for a debugger connection before executing the main function. If this is set to false (n), the main function will start while listening for the debugger connection anyway.
address=4747 specifies the port at which the debug socket will listen on. In the example, the target JVM will listen on port 4747 for incoming client connections.
We will leverage the JDWP agent for all the following remote debugging scenarios, so remember that you can always adjust the configurations listed above to fit your use case.
You must choose your Spark deployment …but choose wisely
Before delving into the debug of your application, here’s a quick recap of how a Spark job executes on a cluster; each Spark job requires:
* a process called driver that performs all the standard Java code
* one or more executor processes that will perform all the code defined inside the transformations and actions of the RDDs/Datasets.
This means that in a realistic scenario we will have different JVMs running at the same time (often on different nodes): one for the driver and one for each executor.