In this PySpark article, I will explain different ways to add a new column to DataFrame using withColumn()
, select()
, sql()
, Few ways include adding a constant column with a default value, derive based out of another column, add a column with NULL/None value, adding multiple columns e.t.c
1. Add New Column to DataFrame Examples
- Add New Column with Default Constant Value or None/Null
- Add Column Based on Another Column
- Add Column Based on Condition
- Add Column When not Exists
- Add Multiple Columns using map() Transformation
- Add Column to DataFrame using select()
- Add Column to DataFrame using SQL Expression
To see all these with examples first, let’s create a PySpark DataFrame.
# Create sample data
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
data = [('James','Smith','M',3000), ('Anna','Rose','F',4100),
('Robert','Williams','M',6200)
]
columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()
# Output
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
| James| Smith| M| 3000|
| Anna| Rose| F| 4100|
| Robert|Williams| M| 6200|
+---------+--------+------+------+
2. Add New Column with Constant Value
In PySpark, to add a new column to DataFrame use lit()
function by importing from pyspark.sql.functions
. lit()
function takes a constant value you wanted to add and returns a Column type. In case you want to add a NULL/None
use lit(None)
. From the below example first adds a literal constant value 0.3 to a DataFrame and the second adds a None
.
# Add new constant column
from pyspark.sql.functions import lit
df.withColumn("bonus_percent", lit(0.3)) \
.show()
# Add New column with NULL
df.withColumn("DEFAULT_COL", lit(None)) \
.show()
# Output
+---------+--------+------+------+-------------+
|firstname|lastname|gender|salary|bonus_percent|
+---------+--------+------+------+-------------+
| James| Smith| M| 3000| 0.3|
| Anna| Rose| F| 4100| 0.3|
| Robert|Williams| M| 6200| 0.3|
+---------+--------+------+------+-------------+
3. Add Column Based on Another Column of DataFrame
You can also add a column to DataFrame based on another existing column value, this is the most used way.
# Add column from existing column
df.withColumn("bonus_amount", df.salary*0.3) \
.show()
# Output
+---------+--------+------+------+------------+
|firstname|lastname|gender|salary|bonus_amount|
+---------+--------+------+------+------------+
| James| Smith| M| 3000| 900.0|
| Anna| Rose| F| 4100| 1230.0|
| Robert|Williams| M| 6200| 1860.0|
+---------+--------+------+------+------------+
# Add column by concatinating existing columns
from pyspark.sql.functions import concat_ws
df.withColumn("name", concat_ws(",","firstname","lastname")) \
.show()
# Output
+---------+--------+------+------+---------------+
|firstname|lastname|gender|salary| name|
+---------+--------+------+------+---------------+
| James| Smith| M| 3000| James,Smith|
| Anna| Rose| F| 4100| Anna,Rose|
| Robert|Williams| M| 6200|Robert,Williams|
+---------+--------+------+------+---------------+
4. Add Column Value Based on Condition
Sometimes you may need to add a constant/literal value based on the condition, to do so you can use when otherwise and lit() together.
# Add Column using when otherwise condition
from pyspark.sql.functions import when
df.withColumn("grade", \
when((df.salary < 4000), lit("A")) \
.when((df.salary >= 4000) & (df.salary <= 5000), lit("B")) \
.otherwise(lit("C")) \
).show()
# Output
+---------+--------+------+------+-----+
|firstname|lastname|gender|salary|grade|
+---------+--------+------+------+-----+
| James| Smith| M| 3000| A|
| Anna| Rose| F| 4100| B|
| Robert|Williams| M| 6200| C|
+---------+--------+------+------+-----+
5. Add Column When not Exists on DataFrame
In order to add a column when not exists, you should check if desired column name exists in PySpark DataFrame, you can get the DataFrame columns using df.columns
, now add a column conditionally when not exists in df.columns.
# Add column Using if condition
if 'dummy' not in df.columns:
df.withColumn("dummy",lit(None))
6. Add Multiple Columns using Map
You can add multiple columns to PySpark DataFrame in several ways if you wanted to add a known set of columns you can easily do it by chaining withColumn()
or using select()
. However, sometimes you may need to add multiple columns after applying some transformations, In that case, you can use either map() or foldLeft(). Let’s see an example with a map.
I don’t have a real-time scenario to add multiple columns, below is just a skeleton on how to use. I will update this once I have a Scala example.
# Let's assume DF has just 3 columns c1,c2,c3
df2 = df.rdd.map(row=>{
# apply transformation on these columns and derive multiple columns and store these column vlaues into c5,c6,c7,c8,c9,10
(c1,c2,c5,c6,c7,c8,c9,c10)
})
If you closely look at the above snippet, DataFrame has 3 columns and we are deriving multiple columns dynamically from existing columns by applying transformations, this can be a split() function or any custom UDF.
7. Add Column to DataFrame using select()
So far most of the examples covered above use withColumn()
to add a column, you can also achieve all the above examples using select()
transformation.
# Add column using select
df.select("firstname","salary", lit(0.3).alias("bonus")).show()
df.select("firstname","salary", lit(df.salary * 0.3).alias("bonus_amount")).show()
df.select("firstname","salary", current_date().alias("today_date")).show()
8. Add Column to DataFrame using SQL Expression
Below are similar example using PySpark SQL expression
# Add columns to DataFrame using SQL
df.createOrReplaceTempView("PER")
df2=spark.sql("select firstname,salary, '0.3' as bonus from PER")
df3=spark.sql("select firstname,salary, salary * 0.3 as bonus_amount from PER")
df4=dfspark.sql("select firstname,salary, current_date() as today_date from PER")
df5=spark.sql("select firstname,salary, " +
"case salary when salary < 4000 then 'A' "+
"else 'B' END as grade from PER")
Conclusion
In this article, you have learned multiple ways to add a new column to PySpark DataFrame that includes adding a constant column, based on the existing column, when a column not exists, add multiple columns with Python examples.
Happy Learning !!
Related Articles
- PySpark lit() – Add Literal or Constant to DataFrame
- PySpark Replace Column Values in DataFrame
- PySpark Retrieve DataType & Column Names of DataFrame
- PySpark Replace Empty Value With None/null on DataFrame
- PySpark Find Maximum Row per Group in DataFrame
- PySpark Select First Row of Each Group?
- PySpark date_format() – Convert Date to String format