PySpark DataFrame provides a drop()
method to drop a single column/field or multiple columns from a DataFrame/Dataset. In this article, I will explain ways to drop columns using PySpark (Spark with Python) example.
Related: Drop duplicate rows from DataFrame
First, let’s create a PySpark DataFrame.
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = (("James","","Smith","36636","NewYork",3100), \
("Michael","Rose","","40288","California",4300), \
("Robert","","Williams","42114","Florida",1400), \
("Maria","Anne","Jones","39192","Florida",5500), \
("Jen","Mary","Brown","34561","NewYork",3000) \
)
columns= ["firstname","middlename","lastname","id","location","salary"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
This yields below output.
root
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- id: string (nullable = true)
|-- location: string (nullable = true)
|-- salary: long (nullable = true)
1. PySpark DataFrame drop() syntax
PySpark drop()
takes self and *cols as arguments. In the below sections, I’ve explained with examples.
drop(self, *cols)
2. Drop Column From DataFrame
First, let’s see a how-to drop a single column from PySpark DataFrame. Below explained three different ways. To use a second signature you need to import pyspark.sql.functions import col
df.drop("firstname") \
.printSchema()
""" import col is required """
df.drop(col("firstname")) \
.printSchema()
df.drop(df.firstname) \
.printSchema()
The above 3 examples drops column “firstname” from DataFrame. You can use either one of these according to your need.
root
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- id: string (nullable = true)
|-- location: string (nullable = true)
|-- salary: long (nullable = true)
3. Drop Multiple Columns from DataFrame
This uses an array string as an argument to drop() function. This removes more than one column (all columns from an array) from a DataFrame.
df.drop("firstname","middlename","lastname") \
.printSchema()
cols = ("firstname","middlename","lastname")
df.drop(*cols) \
.printSchema()
The above two examples remove more than one column at a time from DataFrame. These both yield the same output.
root
|-- id: string (nullable = true)
|-- location: string (nullable = true)
|-- salary: integer (nullable = true)
4. Complete Example
Below is a complete example of how to drop one column or multiple columns from a PySpark DataFrame.
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = (("James","","Smith","36636","NewYork",3100), \
("Michael","Rose","","40288","California",4300), \
("Robert","","Williams","42114","Florida",1400), \
("Maria","Anne","Jones","39192","Florida",5500), \
("Jen","Mary","Brown","34561","NewYork",3000) \
)
columns= ["firstname","middlename","lastname","id","location","salary"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
df.drop("firstname") \
.printSchema()
df.drop(col("firstname")) \
.printSchema()
df.drop(df.firstname) \
.printSchema()
df.drop("firstname","middlename","lastname") \
.printSchema()
cols = ("firstname","middlename","lastname")
df.drop(*cols) \
.printSchema()
This complete example is also available at PySpark Examples Github project for reference.
Thanks for reading and Happy Learning !!