PySpark – Drop One or Multiple Columns From DataFrame

PySpark DataFrame provides a drop() method to drop a single column/field or multiple columns from a DataFrame/Dataset. In this article, I will explain ways to drop columns using PySpark (Spark with Python) example.

Related: Drop duplicate rows from DataFrame

First, let’s create a PySpark DataFrame.


spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = (("James","","Smith","36636","NewYork",3100), \
    ("Michael","Rose","","40288","California",4300), \
    ("Robert","","Williams","42114","Florida",1400), \
    ("Maria","Anne","Jones","39192","Florida",5500), \
    ("Jen","Mary","Brown","34561","NewYork",3000) \
  )
columns= ["firstname","middlename","lastname","id","location","salary"]

df = spark.createDataFrame(data = simpleData, schema = columns)

df.printSchema()

This yields below output.


root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- location: string (nullable = true)
 |-- salary: long (nullable = true)

1. PySpark DataFrame drop() syntax

PySpark drop() takes self and *cols as arguments. In the below sections, I’ve explained with examples.


drop(self, *cols)

2. Drop Column From DataFrame

First let’s see a how-to drop a single column from PySpark DataFrame. Below explained three different ways. To use second signature you need to import pyspark.sql.functions import col


df.drop("firstname") \
  .printSchema()
""" import col is required """  
df.drop(col("firstname")) \
  .printSchema()  
  
df.drop(df.firstname) \
  .printSchema()   

The above 3 examples drops column “firstname” from DataFrame. You can use either one of these according to your need.


root
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- location: string (nullable = true)
 |-- salary: long (nullable = true)

3. Drop Multiple Columns from DataFrame

This uses an array string as an argument to drop() function. This removes more than one column (all columns from an array) from a DataFrame.


df.drop("firstname","middlename","lastname") \
    .printSchema()

cols = ("firstname","middlename","lastname")

df.drop(*cols) \
   .printSchema()

The above two examples remove more than one column at a time from DataFrame. These both yield the same output.


root
 |-- id: string (nullable = true)
 |-- location: string (nullable = true)
 |-- salary: integer (nullable = true)

4. Complete Example

Below is a complete example of how to drop one column or multiple columns from a PySpark DataFrame.


import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = (("James","","Smith","36636","NewYork",3100), \
    ("Michael","Rose","","40288","California",4300), \
    ("Robert","","Williams","42114","Florida",1400), \
    ("Maria","Anne","Jones","39192","Florida",5500), \
    ("Jen","Mary","Brown","34561","NewYork",3000) \
  )
columns= ["firstname","middlename","lastname","id","location","salary"]

df = spark.createDataFrame(data = simpleData, schema = columns)

df.printSchema()
df.show(truncate=False)

df.drop("firstname") \
  .printSchema()
  
df.drop(col("firstname")) \
  .printSchema()  
  
df.drop(df.firstname) \
  .printSchema()

df.drop("firstname","middlename","lastname") \
    .printSchema()

cols = ("firstname","middlename","lastname")

df.drop(*cols) \
   .printSchema()

This complete example is also available at PySpark Examples Github project for references.

Thanks for reading and Happy Learning !!

5. Related Articles

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

PySpark – Drop One or Multiple Columns From DataFrame