PySpark Create DataFrame from List

  • Post author:
  • Post category:PySpark
  • Post last modified:January 17, 2024
  • Reading time:5 mins read

In PySpark, we often need to create a DataFrame from a list, In this article, I will explain creating DataFrame and RDD from List using PySpark examples.

list is a data structure in Python that holds a collection/tuple of items. List items are enclosed in square brackets, like [data1, data2, data3].

In PySpark, when you have data in a list that means you have a collection of data in a PySpark driver. When you create a DataFrame, this collection is going to be parallelized.

First, let’ create a list of data.


dept = [("Finance",10), 
        ("Marketing",20), 
        ("Sales",30), 
        ("IT",40) 
      ]

Here, we have 4 elements in a list. now let’s convert this to a DataFrame.


deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)

This yields below output. Here we have assigned columns to a DataFrame from a list.


root
 |-- dept_name: string (nullable = true)
 |-- dept_id: long (nullable = true)

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+

Now, let’s add a columns using Schema.


from pyspark.sql.types import StructType,StructField, StringType
deptSchema = StructType([       
    StructField('firstname', StringType(), True),
    StructField('middlename', StringType(), True),
    StructField('lastname', StringType(), True)
])

deptDF = spark.createDataFrame(data=dept, schema = deptSchema)
deptDF.printSchema()
deptDF.show(truncate=False)

This yields the same output as above. You can also create a DataFrame from a list of Row type.


# Using list of Row type
from pyspark.sql import Row
dept2 = [Row("Finance",10), 
        Row("Marketing",20), 
        Row("Sales",30), 
        Row("IT",40) 
      ]

Finally, let’s create an RDD from a list. Note that RDDs are not schema based hence we cannot add column names to RDD.


# Convert list to RDD
rdd = spark.sparkContext.parallelize(dept)

Once you have an RDD, you can also convert this into DataFrame.

Complete example of creating DataFrame from list

Below is a complete to create PySpark DataFrame from list.


import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import StructType,StructField, StringType

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

#Using List
dept = [("Finance",10), 
        ("Marketing",20), 
        ("Sales",30), 
        ("IT",40) 
      ]

deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)

deptSchema = StructType([       
    StructField('firstname', StringType(), True),
    StructField('middlename', StringType(), True),
    StructField('lastname', StringType(), True)
])

deptDF1 = spark.createDataFrame(data=dept, schema = deptSchema)
deptDF1.printSchema()
deptDF1.show(truncate=False)

# Using list of Row type
dept2 = [Row("Finance",10), 
        Row("Marketing",20), 
        Row("Sales",30), 
        Row("IT",40) 
      ]

deptDF2 = spark.createDataFrame(data=dept2, schema = deptColumns)
deptDF2.printSchema()
deptDF2.show(truncate=False)

# Convert list to RDD
rdd = spark.sparkContext.parallelize(dept)

This complete example is also available at PySpark github project.

Happy Learning !!

Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ @ LinkedIn

Leave a Reply

This Post Has One Comment

  1. Abhi

    Good Blog for beginner to understand basics with ease .. Thanks