PySpark Apply udf to Multiple Columns

How to apply a PySpark udf to multiple or all columns of the DataFrame?

Let’s create a PySpark DataFrame and apply the UDF on multiple columns.

# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName('') \

# Prepare data
data=data = [('James','','Smith','1991-04-01'),


Yields below output.

pyspark apply udf multiple columns

PySpark UDF on Multiple Columns

The below example uses multiple (actually three) columns to the UDF function.

# imports
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# udf function
def concat(x, y, z):
    return x +' '+ y + ' ' + z

concat_cols = udf(concat, StringType())

# using udf
df.withColumn("Full_Name",concat_cols(df.firstname,df.middlename, df.lastname)) \

Yields below output.

pyspark udf multiple columns

5. PySpark Pandas apply()

We can leverage Pandas DataFrame.apply() by running Pandas API over PySpark. Below is a simple example to give you an idea.

# Imports
import pyspark.pandas as ps
import numpy as np

technologies = ({
    'Fee' :[20000,25000,30000,22000,np.NaN],
# Create a DataFrame
psdf = ps.DataFrame(technologies)

def add(data):
   return data[0] + data[1]
addDF = psdf.apply(add,axis=1)

Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ @ LinkedIn

Leave a Reply

You are currently viewing PySpark Apply udf to Multiple Columns