• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:6 mins read
You are currently viewing PySpark selectExpr()

The PySpark sql.DataFrame.selectExpr() is a transformation that is used to execute a SQL expression and returns a new updated DataFrame. This is similar to select() transformation with an ability to run SQL like expressions.

1. PySpark selectExpr() Syntax & Usage

PySpark selectExpr() is a function of DataFrame that is similar to select(), the difference is it takes a set of SQL expressions in a string to execute. This gives the ability to run SQL like expressions without creating a temporary table and views.

selectExpr() just has one signature that takes SQL expression in a String and returns a new DataFrame. Note like select() it doesn’t have a signature to take the Column type.

2. Syntax of selectExpr()

The following is the syntax of selectExpr() function.


# Syntax of selectExpr()
DataFrame.selectExpr(*expr)

3. PySpark selectExpr() Example

When you use selectExpr() you need to provide the complete expression in a String. In the below example, I am adding a month from another column to the date column. So here, I have used the add_months(), tod_date() and cast() functions without importing any SQL functions.


# using selectExpr()
# Increment month of the date
df.selectExpr("date","increment", \
              "add_months(to_date(date,'yyyy-MM-dd'),cast(increment as int)) as inc_date") \
    .show()

4. Using select() with expr()

Alternatively, you can also write the same statement using expr() SQL function on select. But the same cannot achieve without expr() function. To use expr(), you need to import it from pyspark.sql.functions.


# using select() and expr() 
from pyspark.sql.functions import expr, col
df.select(col("date"),col("increment"), \
      expr("add_months(to_date(date,'yyyy-MM-dd'),cast(increment as int))").alias("inc_date")) \
    .show()

5. Complete Example

Following is the complete example of PySpark selectExpr() function.


# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com') \
                    .getOrCreate()
                    
data = [("2019-01-23",1),("2019-06-24",2),("2019-09-20",3)]
df = spark.createDataFrame(data,schema = ["date","increment"])
df.show()

# Increment month of the date
df.selectExpr("date","increment", \
              "add_months(to_date(date,'yyyy-MM-dd'),cast(increment as int)) as inc_date") \
    .show()

from pyspark.sql.functions import expr, col
df.select(col("date"),col("increment"), \
      expr("add_months(to_date(date,'yyyy-MM-dd'),cast(increment as int))").alias("inc_date")) \
    .show()

6. Conclusion

In this article, you have learned the PySpark selectExpr() function syntax and usage with an example. The selectExpr() function is used to run SQL like expression on the DataFrame.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium