How to set Spark / Pyspark custom configs in Synapse Workspace spark pool

This article is contributed. See the original author and article here.

In Azure Synapse, system configurations of spark pool look like below, where the number of executors, vcores, memory is defined by default.

There could be the requirement of few users who want to manipulate the number of executors or memory assigned to a spark session during execution time.

Usually, we can reconfigure them by traversing to the Spark pool on Azure Portal and set the configurations in the spark pool by uploading text file which looks like this:

But in the Synapse spark pool, few of these user-defined configurations get overridden by the default value of the Spark pool.

What should be the next step to persist these configurations at the spark pool Session level?

For notebooks

If we want to set config of a session with more than the executors defined at the system level (in this case there are 2 executors as we saw above), we need to write below sample code to populate the session with 4 executors. This sample code helps to logically get more executors for a session.

Execute the below code to confirm that the number of executors is the same as defined in the session which is 4 :

In the sparkUI you can also see these executors if you want to cross verify :

A list of many session configs is briefed here .

We can also setup the desired session-level configuration in Apache Spark Job definition :

For Apache Spark Job:

If we want to add those configurations to our job, we have to set them when we initialize the Spark session or Spark context, for example for a PySpark job:

Spark Session:

from pyspark.sql import SparkSession

if __name__ == “__main__”:

# create Spark session with necessary configuration

spark = SparkSession

.builder

.appName(“testApp”)

.config(“spark.executor.instances”,”4″)

.config(“spark.executor.cores”,”4″)

.getOrCreate()

Spark Context:

from pyspark import SparkContext, SparkConf

if __name__ == “__main__”:

# create Spark context with necessary configuration

conf = SparkConf().setAppName(“testApp”).set(“spark.hadoop.validateOutputSpecs”, “false”).set(“spark.executor.cores”,”4″).set(“spark.executor.instances”,”4″)

spark = SparkContext(conf=conf)

Hope this helps you to configure a job/notebook as per your convenience with the number of executors.

Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.

How to set Spark / Pyspark custom configs in Synapse Workspace spark pool

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

We look forward to meeting you