PySpark Create DataFrame from List

Naveen Nelamali

4 years ago

In PySpark, we often need to create a DataFrame from a list, In this article, I will explain creating DataFrame and RDD from List using PySpark examples.

A list is a data structure in Python that holds a collection/tuple of items. List items are enclosed in square brackets, like [data1, data2, data3].

In PySpark, when you have data in a list that means you have a collection of data in a PySpark driver. When you create a DataFrame, this collection is going to be parallelized.

First, let’ create a list of data.


dept = [("Finance",10), 
        ("Marketing",20), 
        ("Sales",30), 
        ("IT",40) 
      ]

Here, we have 4 elements in a list. now let’s convert this to a DataFrame.


deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)

This yields below output. Here we have assigned columns to a DataFrame from a list.


root
 |-- dept_name: string (nullable = true)
 |-- dept_id: long (nullable = true)

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+

Now, let’s add a columns using Schema.


from pyspark.sql.types import StructType,StructField, StringType
deptSchema = StructType([       
    StructField('firstname', StringType(), True),
    StructField('middlename', StringType(), True),
    StructField('lastname', StringType(), True)
])

deptDF = spark.createDataFrame(data=dept, schema = deptSchema)
deptDF.printSchema()
deptDF.show(truncate=False)

This yields the same output as above. You can also create a DataFrame from a list of Row type.


# Using list of Row type
from pyspark.sql import Row
dept2 = [Row("Finance",10), 
        Row("Marketing",20), 
        Row("Sales",30), 
        Row("IT",40) 
      ]

Finally, let’s create an RDD from a list. Note that RDDs are not schema based hence we cannot add column names to RDD.


# Convert list to RDD
rdd = spark.sparkContext.parallelize(dept)

Once you have an RDD, you can also convert this into DataFrame.

Complete example of creating DataFrame from list

Below is a complete to create PySpark DataFrame from list.


import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import StructType,StructField, StringType

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

#Using List
dept = [("Finance",10), 
        ("Marketing",20), 
        ("Sales",30), 
        ("IT",40) 
      ]

deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)

deptSchema = StructType([       
    StructField('firstname', StringType(), True),
    StructField('middlename', StringType(), True),
    StructField('lastname', StringType(), True)
])

deptDF1 = spark.createDataFrame(data=dept, schema = deptSchema)
deptDF1.printSchema()
deptDF1.show(truncate=False)

# Using list of Row type
dept2 = [Row("Finance",10), 
        Row("Marketing",20), 
        Row("Sales",30), 
        Row("IT",40) 
      ]

deptDF2 = spark.createDataFrame(data=dept2, schema = deptColumns)
deptDF2.printSchema()
deptDF2.show(truncate=False)

# Convert list to RDD
rdd = spark.sparkContext.parallelize(dept)

This complete example is also available at PySpark github project.

Happy Learning !!

Complete example of creating DataFrame from list

Related Articles