PySpark Apply udf to Multiple Columns

Naveen Nelamali

1 year ago

How to apply a PySpark udf to multiple or all columns of the DataFrame?

Let’s create a PySpark DataFrame and apply the UDF on multiple columns.


# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com') \
                    .getOrCreate()

# Prepare data
data=data = [('James','','Smith','1991-04-01'),
  ('Michael','Rose','','2000-05-19'),
  ('Robert','','Williams','1978-09-05'),
  ('Maria','Anne','Jones','1967-12-01'),
  ('Jen','Mary','Brown','1980-02-17')
]

columns=["firstname","middlename","lastname","dob"]
df=spark.createDataFrame(data,columns)
df.printSchema()
df.show(truncate=False)

Yields below output.

PySpark UDF on Multiple Columns

The below example uses multiple (actually three) columns to the UDF function.


# imports
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# udf function
def concat(x, y, z):
    return x +' '+ y + ' ' + z

concat_cols = udf(concat, StringType())

# using udf
df.withColumn("Full_Name",concat_cols(df.firstname,df.middlename, df.lastname)) \
  .show()

Yields below output.

5. PySpark Pandas apply()

We can leverage Pandas DataFrame.apply() by running Pandas API over PySpark. Below is a simple example to give you an idea.


# Imports
import pyspark.pandas as ps
import numpy as np

technologies = ({
    'Fee' :[20000,25000,30000,22000,np.NaN],
    'Discount':[1000,2500,1500,1200,3000]
               })
# Create a DataFrame
psdf = ps.DataFrame(technologies)
print(psdf)

def add(data):
   return data[0] + data[1]
   
addDF = psdf.apply(add,axis=1)
print(addDF)

PySpark UDF on Multiple Columns

5. PySpark Pandas apply()

Related Articles