• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:10 mins read
You are currently viewing PySpark Add a New Column to DataFrame

In this PySpark article, I will explain different ways to add a new column to DataFrame using withColumn(), select(), sql(), Few ways include adding a constant column with a default value, derive based out of another column, add a column with NULL/None value, adding multiple columns e.t.c

1. Add New Column to DataFrame Examples

To see all these with examples first, let’s create a PySpark DataFrame.


# Create sample data
from pyspark.sql import SparkSession
spark = SparkSession.builder \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()
data = [('James','Smith','M',3000), ('Anna','Rose','F',4100),
  ('Robert','Williams','M',6200)
]
columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()

# Output
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
|    James|   Smith|     M|  3000|
|     Anna|    Rose|     F|  4100|
|   Robert|Williams|     M|  6200|
+---------+--------+------+------+

2. Add New Column with Constant Value

In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark.sql.functions. lit() function takes a constant value you wanted to add and returns a Column type. In case you want to add a NULL/None use lit(None). From the below example first adds a literal constant value 0.3 to a DataFrame and the second adds a None.


# Add new constant column
from pyspark.sql.functions import lit
df.withColumn("bonus_percent", lit(0.3)) \
  .show()
# Add New column with NULL
df.withColumn("DEFAULT_COL", lit(None)) \
  .show()

# Output
+---------+--------+------+------+-------------+
|firstname|lastname|gender|salary|bonus_percent|
+---------+--------+------+------+-------------+
|    James|   Smith|     M|  3000|          0.3|
|     Anna|    Rose|     F|  4100|          0.3|
|   Robert|Williams|     M|  6200|          0.3|
+---------+--------+------+------+-------------+

3. Add Column Based on Another Column of DataFrame

You can also add a column to DataFrame based on another existing column value, this is the most used way.


# Add column from existing column
df.withColumn("bonus_amount", df.salary*0.3) \
  .show()

# Output
+---------+--------+------+------+------------+
|firstname|lastname|gender|salary|bonus_amount|
+---------+--------+------+------+------------+
|    James|   Smith|     M|  3000|       900.0|
|     Anna|    Rose|     F|  4100|      1230.0|
|   Robert|Williams|     M|  6200|      1860.0|
+---------+--------+------+------+------------+

# Add column by concatinating existing columns
from pyspark.sql.functions import concat_ws
df.withColumn("name", concat_ws(",","firstname","lastname")) \
  .show()

# Output
+---------+--------+------+------+---------------+
|firstname|lastname|gender|salary|           name|
+---------+--------+------+------+---------------+
|    James|   Smith|     M|  3000|    James,Smith|
|     Anna|    Rose|     F|  4100|      Anna,Rose|
|   Robert|Williams|     M|  6200|Robert,Williams|
+---------+--------+------+------+---------------+

4. Add Column Value Based on Condition

Sometimes you may need to add a constant/literal value based on the condition, to do so you can use when otherwise and lit() together.


# Add Column using when otherwise condition
from pyspark.sql.functions import when
df.withColumn("grade", \
   when((df.salary < 4000), lit("A")) \
     .when((df.salary >= 4000) & (df.salary <= 5000), lit("B")) \
     .otherwise(lit("C")) \
  ).show()

# Output
+---------+--------+------+------+-----+
|firstname|lastname|gender|salary|grade|
+---------+--------+------+------+-----+
|    James|   Smith|     M|  3000|    A|
|     Anna|    Rose|     F|  4100|    B|
|   Robert|Williams|     M|  6200|    C|
+---------+--------+------+------+-----+

5. Add Column When not Exists on DataFrame

In order to add a column when not exists, you should check if desired column name exists in PySpark DataFrame, you can get the DataFrame columns using df.columns, now add a column conditionally when not exists in df.columns.


# Add column Using if condition
if 'dummy' not in df.columns:
   df.withColumn("dummy",lit(None))

6. Add Multiple Columns using Map

You can add multiple columns to PySpark DataFrame in several ways if you wanted to add a known set of columns you can easily do it by chaining withColumn() or using select(). However, sometimes you may need to add multiple columns after applying some transformations, In that case, you can use either map() or foldLeft(). Let’s see an example with a map.

I don’t have a real-time scenario to add multiple columns, below is just a skeleton on how to use. I will update this once I have a Scala example.


# Let's assume DF has just 3 columns c1,c2,c3
df2 = df.rdd.map(row=>{
# apply transformation on these columns and derive multiple columns and store these column vlaues into c5,c6,c7,c8,c9,10
(c1,c2,c5,c6,c7,c8,c9,c10)
})

If you closely look at the above snippet, DataFrame has 3 columns and we are deriving multiple columns dynamically from existing columns by applying transformations, this can be a split() function or any custom UDF.

7. Add Column to DataFrame using select()

So far most of the examples covered above use withColumn() to add a column, you can also achieve all the above examples using select() transformation.


# Add column using select
df.select("firstname","salary", lit(0.3).alias("bonus")).show()
df.select("firstname","salary", lit(df.salary * 0.3).alias("bonus_amount")).show()
df.select("firstname","salary", current_date().alias("today_date")).show()

8. Add Column to DataFrame using SQL Expression

Below are similar example using PySpark SQL expression


# Add columns to DataFrame using SQL
df.createOrReplaceTempView("PER")
df2=spark.sql("select firstname,salary, '0.3' as bonus from PER")
df3=spark.sql("select firstname,salary, salary * 0.3 as bonus_amount from PER")
df4=dfspark.sql("select firstname,salary, current_date() as today_date from PER")
df5=spark.sql("select firstname,salary, " +
          "case salary when salary < 4000 then 'A' "+
          "else 'B' END as grade from PER")

Conclusion

In this article, you have learned multiple ways to add a new column to PySpark DataFrame that includes adding a constant column, based on the existing column, when a column not exists, add multiple columns with Python examples.

Happy Learning !!

Reference

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium