• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:4 mins read
You are currently viewing PySpark Apply udf to Multiple Columns

How to apply a PySpark udf to multiple or all columns of the DataFrame?

Advertisements

Let’s create a PySpark DataFrame and apply the UDF on multiple columns.


# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com') \
                    .getOrCreate()

# Prepare data
data=data = [('James','','Smith','1991-04-01'),
  ('Michael','Rose','','2000-05-19'),
  ('Robert','','Williams','1978-09-05'),
  ('Maria','Anne','Jones','1967-12-01'),
  ('Jen','Mary','Brown','1980-02-17')
]

columns=["firstname","middlename","lastname","dob"]
df=spark.createDataFrame(data,columns)
df.printSchema()
df.show(truncate=False)

Yields below output.

pyspark apply udf multiple columns

PySpark UDF on Multiple Columns

The below example uses multiple (actually three) columns to the UDF function.


# imports
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# udf function
def concat(x, y, z):
    return x +' '+ y + ' ' + z

concat_cols = udf(concat, StringType())

# using udf
df.withColumn("Full_Name",concat_cols(df.firstname,df.middlename, df.lastname)) \
  .show()

Yields below output.

pyspark udf multiple columns

5. PySpark Pandas apply()

We can leverage Pandas DataFrame.apply() by running Pandas API over PySpark. Below is a simple example to give you an idea.


# Imports
import pyspark.pandas as ps
import numpy as np

technologies = ({
    'Fee' :[20000,25000,30000,22000,np.NaN],
    'Discount':[1000,2500,1500,1200,3000]
               })
# Create a DataFrame
psdf = ps.DataFrame(technologies)
print(psdf)

def add(data):
   return data[0] + data[1]
   
addDF = psdf.apply(add,axis=1)
print(addDF)

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium