PySpark lit() – Add Literal or Constant to DataFrame

Spread the love

PySpark SQL functions lit() and typedLit() are used to add a new column to DataFrame by assigning a literal or constant value. Both these functions return Column type as return type.

Both of these are available in PySpark by importing pyspark.sql.functions

First, let’s create a DataFrame.


import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("111",50000),("222",60000),("333",40000)]
columns= ["EmpId","Salary"]
df = spark.createDataFrame(data = data, schema = columns)

lit() Function to Add Constant Column

PySpark lit() function is used to add constant or literal value as a new column to the DataFrame.

Creates a [[Column]] of literal value. The passed in object is returned directly if it is already a [[Column]]. If the object is a Scala Symbol, it is converted into a [[Column]] also. Otherwise, a new [[Column]] is created to represent the literal value

Let’s take a look at some examples.

Example 1: Simple usage of lit() function

Let’s see an example of how to create a new column with constant value using lit() Spark SQL function. On the below snippet, we are creating a new column by adding a literal ‘1’ to PySpark DataFrame.


from pyspark.sql.functions import col,lit
df2 = df.select(col("EmpId"),col("Salary"),lit("1").alias("lit_value1"))
df2.show(truncate=False)

+-----+------+----------+
|EmpId|Salary|lit_value1|
+-----+------+----------+
|  111| 50000|         1|
|  222| 60000|         1|
|  333| 40000|         1|
+-----+------+----------+

Adding the same constant literal to all records in DataFrame may not be real-time useful so let’s see another example.

Example 2 : lit() function with withColumn

The following example shows how to use pyspark lit() function using withColumn to derive a new column based on some conditions.


from pyspark.sql.functions import when, lit, col
df3 = df2.withColumn("lit_value2", when(col("Salary") >=40000 & col("Salary") <= 50000,lit("100")).otherwise(lit("200")))
df3.show(truncate=False)

Below is the output for the above code snippet.


+-----+------+----------+----------+
|EmpId|Salary|lit_value1|lit_value2|
+-----+------+----------+----------+
|  111| 50000|         1|       100|
|  222| 60000|         1|       200|
|  333| 40000|         1|       100|
+-----+------+----------+----------+

typedLit() Function – Syntax

Difference between lit() and typedLit() is that, typedLit function can handle collection types e.g.: Array, Dictionary(map) e.t.c. Unfortunately, I could not find this function in PySpark, when I find it, I will add an example.

Complete Example of How to Add Constant Column


import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("111",50000),("222",60000),("333",40000)]
columns= ["EmpId","Salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)

from pyspark.sql.functions import col,lit
df2 = df.select(col("EmpId"),col("Salary"),lit("1").alias("lit_value1"))
df2.show(truncate=False)
from pyspark.sql.functions import when
df3 = df2.withColumn("lit_value2", when(col("Salary") >=40000 & col("Salary") <= 50000,lit("100")).otherwise(lit("200")))
df3.show(truncate=False)

Conclusion:

You have learned multiple ways to add a constant literal value to DataFrame using PySpark lit() function and have learned the difference between lit and typedLit functions.

When possible try to use predefined PySpark functions as they are a little bit more compile-time safety and perform better when compared to user-defined functions. If your application is critical on performance try to avoid using custom UDF functions as these are not guarantee on performance.

Happy Learning !!

Naveen (NNK)

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing PySpark lit() – Add Literal or Constant to DataFrame