PySpark Apply udf to Multiple Columns

  • Post author:
  • Post category:PySpark
  • Post last modified:December 16, 2022

How to apply a PySpark udf to multiple or all columns of the DataFrame?

Let’s create a PySpark DataFrame and apply the UDF on multiple columns.

# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName('') \

# Prepare data
data=data = [('James','','Smith','1991-04-01'),


Yields below output.

pyspark apply udf multiple columns

PySpark UDF on Multiple Columns

The below example uses multiple (actually three) columns to the UDF function.

# imports
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# udf function
def concat(x, y, z):
    return x +' '+ y + ' ' + z

concat_cols = udf(concat, StringType())

# using udf
df.withColumn("Full_Name",concat_cols(df.firstname,df.middlename, df.lastname)) \

Yields below output.

pyspark udf multiple columns

5. PySpark Pandas apply()

We can leverage Pandas DataFrame.apply() by running Pandas API over PySpark. Below is a simple example to give you an idea.

# Imports
import pyspark.pandas as ps
import numpy as np

technologies = ({
    'Fee' :[20000,25000,30000,22000,np.NaN],
# Create a DataFrame
psdf = ps.DataFrame(technologies)

def add(data):
   return data[0] + data[1]
addDF = psdf.apply(add,axis=1)

Leave a Reply