Site icon Spark By {Examples}

PySpark Create DataFrame from List

pyspark dataframe from list

In PySpark, we often need to create a DataFrame from a list, In this article, I will explain creating DataFrame and RDD from List using PySpark examples.

list is a data structure in Python that holds a collection/tuple of items. List items are enclosed in square brackets, like [data1, data2, data3].

In PySpark, when you have data in a list that means you have a collection of data in a PySpark driver. When you create a DataFrame, this collection is going to be parallelized.

First, let’ create a list of data.

dept = [("Finance",10), 

Here, we have 4 elements in a list. now let’s convert this to a DataFrame.

deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)

This yields below output. Here we have assigned columns to a DataFrame from a list.

 |-- dept_name: string (nullable = true)
 |-- dept_id: long (nullable = true)

|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |

Now, let’s add a columns using Schema.

from pyspark.sql.types import StructType,StructField, StringType
deptSchema = StructType([       
    StructField('firstname', StringType(), True),
    StructField('middlename', StringType(), True),
    StructField('lastname', StringType(), True)

deptDF = spark.createDataFrame(data=dept, schema = deptSchema)

This yields the same output as above. You can also create a DataFrame from a list of Row type.

# Using list of Row type
from pyspark.sql import Row
dept2 = [Row("Finance",10), 

Finally, let’s create an RDD from a list. Note that RDDs are not schema based hence we cannot add column names to RDD.

# Convert list to RDD
rdd = spark.sparkContext.parallelize(dept)

Once you have an RDD, you can also convert this into DataFrame.

Complete example of creating DataFrame from list

Below is a complete to create PySpark DataFrame from list.

import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import StructType,StructField, StringType

spark = SparkSession.builder.appName('').getOrCreate()

#Using List
dept = [("Finance",10), 

deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)

deptSchema = StructType([       
    StructField('firstname', StringType(), True),
    StructField('middlename', StringType(), True),
    StructField('lastname', StringType(), True)

deptDF1 = spark.createDataFrame(data=dept, schema = deptSchema)

# Using list of Row type
dept2 = [Row("Finance",10), 

deptDF2 = spark.createDataFrame(data=dept2, schema = deptColumns)

# Convert list to RDD
rdd = spark.sparkContext.parallelize(dept)

This complete example is also available at PySpark github project.

Happy Learning !!

Exit mobile version