PySpark DataFrame provides a drop()
method to drop a single column/field or multiple columns from a DataFrame/Dataset. In this article, I will explain ways to drop columns using PySpark (Spark with Python) example.
Related: Drop duplicate rows from DataFrame
First, let’s create a PySpark DataFrame.
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = (("James","","Smith","36636","NewYork",3100), \
("Michael","Rose","","40288","California",4300), \
("Robert","","Williams","42114","Florida",1400), \
("Maria","Anne","Jones","39192","Florida",5500), \
("Jen","Mary","Brown","34561","NewYork",3000) \
)
columns= ["firstname","middlename","lastname","id","location","salary"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
This yields below output.
root
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- id: string (nullable = true)
|-- location: string (nullable = true)
|-- salary: long (nullable = true)
1. PySpark DataFrame drop() syntax
PySpark drop()
takes self and *cols as arguments. In the below sections, I’ve explained with examples.
drop(self, *cols)
2. Drop Column From DataFrame
First, let’s see a how-to drop a single column from PySpark DataFrame. Below explained three different ways. To use a second signature you need to import pyspark.sql.functions import col
df.drop("firstname") \
.printSchema()
""" import col is required """
df.drop(col("firstname")) \
.printSchema()
df.drop(df.firstname) \
.printSchema()
The above 3 examples drops column “firstname” from DataFrame. You can use either one of these according to your need.
root
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- id: string (nullable = true)
|-- location: string (nullable = true)
|-- salary: long (nullable = true)
3. Drop Multiple Columns from DataFrame
This uses an array string as an argument to drop() function. This removes more than one column (all columns from an array) from a DataFrame.
df.drop("firstname","middlename","lastname") \
.printSchema()
cols = ("firstname","middlename","lastname")
df.drop(*cols) \
.printSchema()
The above two examples remove more than one column at a time from DataFrame. These both yield the same output.
root
|-- id: string (nullable = true)
|-- location: string (nullable = true)
|-- salary: integer (nullable = true)
4. Complete Example
Below is a complete example of how to drop one column or multiple columns from a PySpark DataFrame.
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = (("James","","Smith","36636","NewYork",3100), \
("Michael","Rose","","40288","California",4300), \
("Robert","","Williams","42114","Florida",1400), \
("Maria","Anne","Jones","39192","Florida",5500), \
("Jen","Mary","Brown","34561","NewYork",3000) \
)
columns= ["firstname","middlename","lastname","id","location","salary"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
df.drop("firstname") \
.printSchema()
df.drop(col("firstname")) \
.printSchema()
df.drop(df.firstname) \
.printSchema()
df.drop("firstname","middlename","lastname") \
.printSchema()
cols = ("firstname","middlename","lastname")
df.drop(*cols) \
.printSchema()
This complete example is also available at PySpark Examples Github project for reference.
Thanks for reading and Happy Learning !!
Have you tried dropping column by index?
how to remove only one column, when there are multiple columns with the same name ??