In PySpark, we often need to create a DataFrame from a list, In this article, I will explain creating DataFrame and RDD from List using PySpark examples.
A list is a data structure in Python that holds a collection/tuple of items. List items are enclosed in square brackets, like [data1, data2, data3].
In PySpark, when you have data in a list that means you have a collection of data in a PySpark driver. When you create a DataFrame, this collection is going to be parallelized.
First, let’ create a list of data.
dept = [("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
]
Here, we have 4 elements in a list. now let’s convert this to a DataFrame.
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)
This yields below output. Here we have assigned columns to a DataFrame from a list.
root
|-- dept_name: string (nullable = true)
|-- dept_id: long (nullable = true)
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|20 |
|Sales |30 |
|IT |40 |
+---------+-------+
Now, let’s add a columns using Schema.
from pyspark.sql.types import StructType,StructField, StringType
deptSchema = StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])
deptDF = spark.createDataFrame(data=dept, schema = deptSchema)
deptDF.printSchema()
deptDF.show(truncate=False)
This yields the same output as above. You can also create a DataFrame from a list of Row type.
# Using list of Row type
from pyspark.sql import Row
dept2 = [Row("Finance",10),
Row("Marketing",20),
Row("Sales",30),
Row("IT",40)
]
Finally, let’s create an RDD from a list. Note that RDDs are not schema based hence we cannot add column names to RDD.
# Convert list to RDD
rdd = spark.sparkContext.parallelize(dept)
Once you have an RDD, you can also convert this into DataFrame.
Complete example of creating DataFrame from list
Below is a complete to create PySpark DataFrame from list.
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import StructType,StructField, StringType
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
#Using List
dept = [("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
]
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)
deptSchema = StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])
deptDF1 = spark.createDataFrame(data=dept, schema = deptSchema)
deptDF1.printSchema()
deptDF1.show(truncate=False)
# Using list of Row type
dept2 = [Row("Finance",10),
Row("Marketing",20),
Row("Sales",30),
Row("IT",40)
]
deptDF2 = spark.createDataFrame(data=dept2, schema = deptColumns)
deptDF2.printSchema()
deptDF2.show(truncate=False)
# Convert list to RDD
rdd = spark.sparkContext.parallelize(dept)
This complete example is also available at PySpark github project.
Happy Learning !!
Related Articles
- Convert PySpark RDD to DataFrame
- Create a PySpark DataFrame from Multiple Lists.
- PySpark Collect() – Retrieve data from DataFrame
- PySpark Create RDD with Examples
- How to Convert PySpark Column to List?
- PySpark parallelize() – Create RDD from a list data
- Dynamic way of doing ETL through Pyspark
- PySpark Get Number of Rows and Columns
- PySpark Join Types | Join Two DataFrames
Good Blog for beginner to understand basics with ease .. Thanks