The PySpark sql.DataFrame.selectExpr() is a transformation that is used to execute a SQL expression and returns a new updated DataFrame. This is similar to select() transformation with an ability to run SQL like expressions.
1. PySpark selectExpr() Syntax & Usage
PySpark selectExpr()
is a function of DataFrame that is similar to select(), the difference is it takes a set of SQL expressions in a string to execute. This gives the ability to run SQL like expressions without creating a temporary table and views.
selectExpr()
just has one signature that takes SQL expression in a String and returns a new DataFrame. Note like select() it doesn’t have a signature to take the Column type.
2. Syntax of selectExpr()
The following is the syntax of selectExpr() function.
# Syntax of selectExpr()
DataFrame.selectExpr(*expr)
3. PySpark selectExpr() Example
When you use selectExpr() you need to provide the complete expression in a String. In the below example, I am adding a month from another column to the date column. So here, I have used the add_months(), tod_date() and cast() functions without importing any SQL functions.
# using selectExpr()
# Increment month of the date
df.selectExpr("date","increment", \
"add_months(to_date(date,'yyyy-MM-dd'),cast(increment as int)) as inc_date") \
.show()
4. Using select() with expr()
Alternatively, you can also write the same statement using expr()
SQL function on select. But the same cannot achieve without expr() function. To use expr(), you need to import it from pyspark.sql.functions
.
# using select() and expr()
from pyspark.sql.functions import expr, col
df.select(col("date"),col("increment"), \
expr("add_months(to_date(date,'yyyy-MM-dd'),cast(increment as int))").alias("inc_date")) \
.show()
5. Complete Example
Following is the complete example of PySpark selectExpr() function.
# Import
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com') \
.getOrCreate()
data = [("2019-01-23",1),("2019-06-24",2),("2019-09-20",3)]
df = spark.createDataFrame(data,schema = ["date","increment"])
df.show()
# Increment month of the date
df.selectExpr("date","increment", \
"add_months(to_date(date,'yyyy-MM-dd'),cast(increment as int)) as inc_date") \
.show()
from pyspark.sql.functions import expr, col
df.select(col("date"),col("increment"), \
expr("add_months(to_date(date,'yyyy-MM-dd'),cast(increment as int))").alias("inc_date")) \
.show()
6. Conclusion
In this article, you have learned the PySpark selectExpr() function syntax and usage with an example. The selectExpr() function is used to run SQL like expression on the DataFrame.
Happy Learning !!
Related Articles
- PySpark map() Transformation
- PySpark mapPartitions()
- Pyspark Select Distinct Rows
- PySpark Select Top N Rows From Each Group
- PySpark Select First Row of Each Group?
- PySpark Pandas UDF Example
- PySpark Apply Function to Column
- PySpark flatMap() Transformation
- PySpark RDD Transformations with examples
- PySpark between() range of values
- PySpark max() – Different Methods Explained
- PySpark sum() Columns Example
- PySpark union two DataFrames
- PySpark Broadcast Variable
- PySpark Broadcast Join
- PySpark persist() Example
- PySpark Apply udf to Multiple Columns