Spark Architecture (The functioning of spark)

Ayush Singh
3 min readJun 21, 2022

Spark is a distributed computing platform and every spark application is a distributed application.

What is a cluster ?

A cluster is a pool of physical computers viewed as a single system. E.g :- Suppose there are 4 nodes(physical computers) in a cluster each with 16 CPU cores and 64 GB RAM, all these nodes are connected using Hadoop YARN cluster manager.Therefore cluster capacity is 64 CPU cores, 256 GB RAM.

Suppose I want to run a spark application on this 4 node cluster where 1 node is master and other 3 are worker nodes. So I run spark submit command, the request will go to the YARN resource manager. The YARN resource manager will see and create 1 application master container on a worker node and will start my application’s main method on that container(isolated runtime virtual environment, container comes with some cpu and memory allocation).

Note : Spark is written in scala, scala is a jvm application and it always runs in jvm. The developers wanted to make this available for python developers. So, they created a java wrapper on top of the scala core and they created python wrapper on top of java wrapper. This python wrapper is known as PySpark :)

Let’s go inside driver container and see what happens there :-

The container is running the main method of the application. Suppose the code is python code. The python code is designed to start the java main method internally. So the pyspark application will start the jvm application. Once we have the jvm application, the pyspark wrapper will call the java wrapper using py4j(allows python application to call java application) connection

The application driver that you see in above image, distributes the work to others by creating executors.
How this happens :- After starting the application, the driver goes back to YARN resource manager and asks for some containers. The resource manager creates some more containers on worker nodes and give them to driver. Now, the driver starts spark executors on these containers with each container having 1 spark executor. The spark executor is nothing but a jvm application.

Note that in executors, python worker(runtime) is created adjacent to jvm executors to run plain python when you use some library of python that is not java wrappable.
E.g :- some python specific libraries and python udfs.

--

--