How to apply a PySpark udf to multiple or all columns of the DataFrame?
Let’s create a PySpark DataFrame and apply the UDF on multiple columns.
# Import
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com') \
.getOrCreate()
# Prepare data
data=data = [('James','','Smith','1991-04-01'),
('Michael','Rose','','2000-05-19'),
('Robert','','Williams','1978-09-05'),
('Maria','Anne','Jones','1967-12-01'),
('Jen','Mary','Brown','1980-02-17')
]
columns=["firstname","middlename","lastname","dob"]
df=spark.createDataFrame(data,columns)
df.printSchema()
df.show(truncate=False)
Yields below output.
PySpark UDF on Multiple Columns
The below example uses multiple (actually three) columns to the UDF function.
# imports
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
# udf function
def concat(x, y, z):
return x +' '+ y + ' ' + z
concat_cols = udf(concat, StringType())
# using udf
df.withColumn("Full_Name",concat_cols(df.firstname,df.middlename, df.lastname)) \
.show()
Yields below output.
5. PySpark Pandas apply()
We can leverage Pandas DataFrame.apply() by running Pandas API over PySpark. Below is a simple example to give you an idea.
# Imports
import pyspark.pandas as ps
import numpy as np
technologies = ({
'Fee' :[20000,25000,30000,22000,np.NaN],
'Discount':[1000,2500,1500,1200,3000]
})
# Create a DataFrame
psdf = ps.DataFrame(technologies)
print(psdf)
def add(data):
return data[0] + data[1]
addDF = psdf.apply(add,axis=1)
print(addDF)
Related Articles
- PySpark apply Function to Column
- PySpark Add a New Column to DataFrame
- PySpark selectExpr()
- PySpark transform() Function with Example
- PySpark foreach() Usage with Examples
- PySpark UDF (User Defined Function)
- PySpark Where Filter Function | Multiple Conditions