• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:5 mins read
You are currently viewing PySpark Create DataFrame from List

In PySpark, we often need to create a DataFrame from a list, In this article, I will explain creating DataFrame and RDD from List using PySpark examples.

Advertisements

list is a data structure in Python that holds a collection/tuple of items. List items are enclosed in square brackets, like [data1, data2, data3].

In PySpark, when you have data in a list that means you have a collection of data in a PySpark driver. When you create a DataFrame, this collection is going to be parallelized.

First, let’ create a list of data.


dept = [("Finance",10), 
        ("Marketing",20), 
        ("Sales",30), 
        ("IT",40) 
      ]

Here, we have 4 elements in a list. now let’s convert this to a DataFrame.


deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)

This yields below output. Here we have assigned columns to a DataFrame from a list.


root
 |-- dept_name: string (nullable = true)
 |-- dept_id: long (nullable = true)

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+

Now, let’s add a columns using Schema.


from pyspark.sql.types import StructType,StructField, StringType
deptSchema = StructType([       
    StructField('firstname', StringType(), True),
    StructField('middlename', StringType(), True),
    StructField('lastname', StringType(), True)
])

deptDF = spark.createDataFrame(data=dept, schema = deptSchema)
deptDF.printSchema()
deptDF.show(truncate=False)

This yields the same output as above. You can also create a DataFrame from a list of Row type.


# Using list of Row type
from pyspark.sql import Row
dept2 = [Row("Finance",10), 
        Row("Marketing",20), 
        Row("Sales",30), 
        Row("IT",40) 
      ]

Finally, let’s create an RDD from a list. Note that RDDs are not schema based hence we cannot add column names to RDD.


# Convert list to RDD
rdd = spark.sparkContext.parallelize(dept)

Once you have an RDD, you can also convert this into DataFrame.

Complete example of creating DataFrame from list

Below is a complete to create PySpark DataFrame from list.


import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import StructType,StructField, StringType

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

#Using List
dept = [("Finance",10), 
        ("Marketing",20), 
        ("Sales",30), 
        ("IT",40) 
      ]

deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)

deptSchema = StructType([       
    StructField('firstname', StringType(), True),
    StructField('middlename', StringType(), True),
    StructField('lastname', StringType(), True)
])

deptDF1 = spark.createDataFrame(data=dept, schema = deptSchema)
deptDF1.printSchema()
deptDF1.show(truncate=False)

# Using list of Row type
dept2 = [Row("Finance",10), 
        Row("Marketing",20), 
        Row("Sales",30), 
        Row("IT",40) 
      ]

deptDF2 = spark.createDataFrame(data=dept2, schema = deptColumns)
deptDF2.printSchema()
deptDF2.show(truncate=False)

# Convert list to RDD
rdd = spark.sparkContext.parallelize(dept)

This complete example is also available at PySpark github project.

Happy Learning !!

This Post Has One Comment

  1. Abhi

    Good Blog for beginner to understand basics with ease .. Thanks

Comments are closed.