PySpark – Drop One or Multiple Columns From DataFrame

Spread the love

PySpark DataFrame provides a drop() method to drop a single column/field or multiple columns from a DataFrame/Dataset. In this article, I will explain ways to drop columns using PySpark (Spark with Python) example.

Related: Drop duplicate rows from DataFrame

First, let’s create a PySpark DataFrame.


spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = (("James","","Smith","36636","NewYork",3100), \
    ("Michael","Rose","","40288","California",4300), \
    ("Robert","","Williams","42114","Florida",1400), \
    ("Maria","Anne","Jones","39192","Florida",5500), \
    ("Jen","Mary","Brown","34561","NewYork",3000) \
  )
columns= ["firstname","middlename","lastname","id","location","salary"]

df = spark.createDataFrame(data = simpleData, schema = columns)

df.printSchema()

This yields below output.


root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- location: string (nullable = true)
 |-- salary: long (nullable = true)

1. PySpark DataFrame drop() syntax

PySpark drop() takes self and *cols as arguments. In the below sections, I’ve explained with examples.


drop(self, *cols)

2. Drop Column From DataFrame

First, let’s see a how-to drop a single column from PySpark DataFrame. Below explained three different ways. To use a second signature you need to import pyspark.sql.functions import col


df.drop("firstname") \
  .printSchema()
""" import col is required """  
df.drop(col("firstname")) \
  .printSchema()  
  
df.drop(df.firstname) \
  .printSchema()   

The above 3 examples drops column “firstname” from DataFrame. You can use either one of these according to your need.


root
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- location: string (nullable = true)
 |-- salary: long (nullable = true)

3. Drop Multiple Columns from DataFrame

This uses an array string as an argument to drop() function. This removes more than one column (all columns from an array) from a DataFrame.


df.drop("firstname","middlename","lastname") \
    .printSchema()

cols = ("firstname","middlename","lastname")

df.drop(*cols) \
   .printSchema()

The above two examples remove more than one column at a time from DataFrame. These both yield the same output.


root
 |-- id: string (nullable = true)
 |-- location: string (nullable = true)
 |-- salary: integer (nullable = true)

4. Complete Example

Below is a complete example of how to drop one column or multiple columns from a PySpark DataFrame.


import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = (("James","","Smith","36636","NewYork",3100), \
    ("Michael","Rose","","40288","California",4300), \
    ("Robert","","Williams","42114","Florida",1400), \
    ("Maria","Anne","Jones","39192","Florida",5500), \
    ("Jen","Mary","Brown","34561","NewYork",3000) \
  )
columns= ["firstname","middlename","lastname","id","location","salary"]

df = spark.createDataFrame(data = simpleData, schema = columns)

df.printSchema()
df.show(truncate=False)

df.drop("firstname") \
  .printSchema()
  
df.drop(col("firstname")) \
  .printSchema()  
  
df.drop(df.firstname) \
  .printSchema()

df.drop("firstname","middlename","lastname") \
    .printSchema()

cols = ("firstname","middlename","lastname")

df.drop(*cols) \
   .printSchema()

This complete example is also available at PySpark Examples Github project for reference.

Thanks for reading and Happy Learning !!

Naveen (NNK)

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

This Post Has 2 Comments

  1. Anonymous

    how to remove only one column, when there are multiple columns with the same name ??

    1. NNK

      Have you tried dropping column by index?

You are currently viewing PySpark – Drop One or Multiple Columns From DataFrame