PySpark – Drop One or Multiple Columns From DataFrame

PySpark DataFrame provides a drop() method to drop a single column/field or multiple columns from a DataFrame/Dataset. In this article, I will explain ways to drop columns using PySpark (Spark with Python) example.

Related: Drop duplicate rows from DataFrame

First, let’s create a PySpark DataFrame.


spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = (("James","","Smith","36636","NewYork",3100), \
    ("Michael","Rose","","40288","California",4300), \
    ("Robert","","Williams","42114","Florida",1400), \
    ("Maria","Anne","Jones","39192","Florida",5500), \
    ("Jen","Mary","Brown","34561","NewYork",3000) \
  )
columns= ["firstname","middlename","lastname","id","location","salary"]

df = spark.createDataFrame(data = simpleData, schema = columns)

df.printSchema()

This yields below output.


root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- location: string (nullable = true)
 |-- salary: long (nullable = true)

1. PySpark DataFrame drop() syntax

PySpark drop() takes self and *cols as arguments. In the below sections, I’ve explained with examples.


drop(self, *cols)

2. Drop Column From DataFrame

First, let’s see a how-to drop a single column from PySpark DataFrame. Below explained three different ways. To use a second signature you need to import pyspark.sql.functions import col


df.drop("firstname") \
  .printSchema()
""" import col is required """  
df.drop(col("firstname")) \
  .printSchema()  
  
df.drop(df.firstname) \
  .printSchema()   

The above 3 examples drops column “firstname” from DataFrame. You can use either one of these according to your need.


root
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- location: string (nullable = true)
 |-- salary: long (nullable = true)

3. Drop Multiple Columns from DataFrame

This uses an array string as an argument to drop() function. This removes more than one column (all columns from an array) from a DataFrame.


df.drop("firstname","middlename","lastname") \
    .printSchema()

cols = ("firstname","middlename","lastname")

df.drop(*cols) \
   .printSchema()

The above two examples remove more than one column at a time from DataFrame. These both yield the same output.


root
 |-- id: string (nullable = true)
 |-- location: string (nullable = true)
 |-- salary: integer (nullable = true)

4. Complete Example

Below is a complete example of how to drop one column or multiple columns from a PySpark DataFrame.


import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = (("James","","Smith","36636","NewYork",3100), \
    ("Michael","Rose","","40288","California",4300), \
    ("Robert","","Williams","42114","Florida",1400), \
    ("Maria","Anne","Jones","39192","Florida",5500), \
    ("Jen","Mary","Brown","34561","NewYork",3000) \
  )
columns= ["firstname","middlename","lastname","id","location","salary"]

df = spark.createDataFrame(data = simpleData, schema = columns)

df.printSchema()
df.show(truncate=False)

df.drop("firstname") \
  .printSchema()
  
df.drop(col("firstname")) \
  .printSchema()  
  
df.drop(df.firstname) \
  .printSchema()

df.drop("firstname","middlename","lastname") \
    .printSchema()

cols = ("firstname","middlename","lastname")

df.drop(*cols) \
   .printSchema()

This complete example is also available at PySpark Examples Github project for reference.

Thanks for reading and Happy Learning !!

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

This Post Has 2 Comments

  1. Anonymous

    how to remove only one column, when there are multiple columns with the same name ??

    1. NNK

      Have you tried dropping column by index?

You are currently viewing PySpark – Drop One or Multiple Columns From DataFrame