How to Submit a Spark Job via Rest API?

Spread the love

In my last article, I’ve explained submitting a job using spark-submit command, alternatively, we can use spark standalone master REST API (RESTFul) to submit a Scala or Python(PySpark) job or application.

In this article, I will explain how to submit Scala and PySpark (python) jobs. using Rest API, getting the status of the application, and finally killing the application with an example.

1. Spark Standalone mode REST API

Spark standalone mode provides REST API to run a spark job, below I will explain using some of the REST API’s from CURL command but in real time you can integrate this with your web UI application or any RESTFul API.

1.1 Enable REST API

By default REST API service is disabled, you can enable it by adding the below configuration on spark-defaults.conf file.


// Enable REST API
spark.master.rest.enabled true

After you add the property, make sure you restart the service to effect this change.


./sbin/start-master.sh
./sbin/start-slave.sh spark://192.168.1.1:7077

And make sure the standalone cluster is up and running by accessing the below URL. Replace the IP address and port according to your setup.


http://192.168.1.1:8080

Not enabling this property, you will get the following error when you attempt to submit the application.


This Page Cannot Be Displayed
The system cannot communicate with the external server (spark-master-ip).
The Internet server may be busy, maybe permanently down, or maybe unreachable because of network problems.
Please check the spelling of the Internet address entered.
If it is correct, try this request later.

1.2 Spark Submit REST API Request

We use REST API /v1/submissions/create to submit an application to the standalone cluster, with this request you need to provide the class you wanted to run for mainClass, appArgs for any command-line arguments and location of the jar file with appResource to name few.

As said in the beginning, here I’ve explained using REST API from curl command.


curl -X POST http://192.168.1.1:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
  "appResource": "/home/hduser/sparkbatchapp.jar",
  "sparkProperties": {
    "spark.executor.memory": "8g",
    "spark.master": "spark://192.168.1.1:7077",
    "spark.driver.memory": "8g",
    "spark.driver.cores": "2",
    "spark.eventLog.enabled": "false",
    "spark.app.name": "Spark REST API - PI",
    "spark.submit.deployMode": "cluster",
    "spark.jars": "/home/user/spark-examples_versionxx.jar",
    "spark.driver.supervise": "true"
  },
  "clientSparkVersion": "2.4.0",
  "mainClass": "org.apache.spark.examples.SparkPi",
  "environmentVariables": {
    "SPARK_ENV_LOADED": "1"
  },
  "action": "CreateSubmissionRequest",
  "appArgs": [
    "80"
  ]
}'

This will submit the job to the cluster and returns the following response which contains the application id @ submissionId field.


{
  "action" : "CreateSubmissionResponse",
  "message" : "Driver successfully submitted as driver-20200923223841-0001",
  "serverSparkVersion" : "2.4.0",
  "submissionId" : "driver-20200923223841-0001",
  "success" : true
}

1.2 Submitting PySpark using REST API

The below example submits the PySpark example spark_pi.py located at /home/user/ with command line argument 80.


curl -X POST http://192.168.1.1:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
  "appResource": "file:/home/user/spark_pi.py",
  "sparkProperties": {
    "spark.executor.memory": "8g",
    "spark.master": "spark://192.168.1.1:7077",
    "spark.driver.memory": "8g",
    "spark.driver.cores": "2",
    "spark.eventLog.enabled": "false",
    "spark.app.name": "Spark REST API - PI",
    "spark.submit.deployMode": "cluster",
    "spark.driver.supervise": "true"
  },
  "clientSparkVersion": "2.4.0",
  "mainClass": "org.apache.spark.deploy.SparkSubmit",
  "environmentVariables": {
    "SPARK_ENV_LOADED": "1"
  },
  "action": "CreateSubmissionRequest",
  "appArgs": [ "/home/user/spark_pi.py",  "80" ]
}'

1.3 Status of the Job from REST API

You can use either Spark UI to monitor your job or you can submit the following Rest API request to get the Status of the application. Make sure you specify the driver-applicatonid you got from the previous request.


curl http://192.168.1.1:6066/v1/submissions/status/driver-20200923223841-0001

This results in below response.


{
  "action" : "SubmissionStatusResponse",
  "driverState" : "FINISHED",
  "serverSparkVersion" : "2.4.0",
  "submissionId" : "driver-20200923223841-0001",
  "success" : true,
  "workerHostPort" : "192.168.1.1:38451",
  "workerId" : "worker-20200923223841-192.168.1.2-34469"
}

1.4 Kill the Job

Sometimes we may need to kill the job, below is the REST API to kill the job.


// Kill the Job 
curl -X POST http://192.168.1.1:6066/v1/submissions/kill/driver-20200923223841-0001

This results in below response


{
  "action" : "KillSubmissionResponse",
  "message" : "Kill request for driver-20200923223841-0001 submitted",
  "serverSparkVersion" : "2.4.0",
  "submissionId" : "driver-20200923223841-0001",
  "success" : true
}

2. Using the REST API for Yarn Manager

Submitting an application to Yarn using Rest API is a little tricky and I will cover this in the future when I was able to submit successfully, meanwhile please refer to the below links.

In case if you are not succeed try to use Cloudera Livy. From teh Livy document it supports the following.

  • Interactive Scala, Python, and R shells
  • Batch submissions in Scala, Java, Python
  • Multiple users can share the same server (impersonation support)
  • Can be used for submitting jobs from anywhere with REST
  • Does not require any code change to your programs

Conclusion

In this article, you have learned how to submit a spark application using Standalone mode REST API , getting the status of the application and killing it and finally got some pointers on how to use Yarn Rest API and Livy.

Happy Learning !!

Related Article

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

This Post Has 4 Comments

  1. NRJ

    ‘m using pyspark stand alone setup to run jobs like this .\submit-job.cmd E:\Test\Test.py,
    is it possible to submit job with the help for REST API as mentioned in the tutorial, as i coildnt find the web api service url, but my master and worker runs in this respectively Spark Master at spark://192.168.0.147:7077 and Spark Worker at 192.168.0.147:56594

    Im unable to find the Web API

    1. NNK

      IN order to submit Spark jobs using API you need to setup a thrir-party service that describes in this article.

      1. NRJ

        Im using windows machine and i have created standalone setup , third party setups mentioned like livy and file server are they application to windows platform, as most of the tutorials are related to linux

        1. NNK

          I have not tried it but I believe you should able to use Livy in Windows. I will write an article on this soon.

You are currently viewing How to Submit a Spark Job via Rest API?