Create a PySpark DataFrame from Multiple Lists

Prabha

4 months ago

In PySpark, we can create a DataFrame from multiple lists (two or many) using Python’s zip() function; The zip() function combines multiple lists into tuples, and by passing the tuple to createDataFrame() method, we can create the DataFrame from multiple lists.

In Python, a list is a collection of objects that holds different types of data, and when you use it with PySpark, you will have a collection of data in a PySpark driver. When you create a DataFrame, this collection will be parallelized across different nodes in a cluster.

Key Points

When creating a PySpark DataFrame from multiple lists, ensure that the lists are aligned correctly. Each list represents a column, and their lengths should be the same to avoid data misalignment.
The zip function is commonly used to combine multiple lists element-wise. It creates tuples, with each tuple containing values from corresponding positions in the input lists.
Utilize the spark.createDataFrame() method to initialize a DataFrame from the zipped tuples. Specify the column names explicitly to ensure clarity in the resulting DataFrame.
PySpark can infer the schema based on the data provided. However, specifying the schema explicitly during DataFrame creation enhances control over data types and nullability.

Below is an example of using zip.


# zip with lists
zip(list1,list2,., list n)

1. Create PySpark DataFrame using Multiple Lists

Creating a PySpark DataFrame from multiple lists (two or more) involves using the PySpark SQL module and the createDataFrame() method. First, create a SparkSession, which is the entry point to using PySpark functionalities and define multiple lists that you want to combine into a PySpark DataFrame. Each list represents a column in the DataFrame.


# Imports
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("SparkByExamples").getOrCreate()

# Sample Data with two lists
names = ["Ricky", "Bunny", "Coco"]
ages = [10, 15, 20]

Now, apply the zip() function on the list names and ages and use the createDataFrame() function as shown in the snippet below which creates a DataFrame from multiple lists.


# Create DataFrame from multiple(two) lists
df1 = spark.createDataFrame(zip(names, ages), ["Name", "Age"])
df1.show()

In the above code, zip() combines the elements of the “names” and “ages” lists into tuples. For example, [(“Ricky”, 10), (“Bunny”, 150), (“Coco”, 20)]. And spark calls the createDataFrame() function to convert the list of tuples into a DataFrame df1. The resulting DataFrame df1 has two columns, “Name” and “Age” with corresponding values from the provided lists. Below is the output.


# Output
+-----+---+
| Name|Age|
+-----+---+
|Ricky| 10|
|Bunny| 15|
| Coco| 20|
+-----+---+

Alternatively, you can also try the below code. Here, Provide the lists as a dictionary where keys are column names and values are the corresponding lists.


# Create DataFrame
data = {"Name": names, "Age": ages}
df1 = spark.createDataFrame(list(zip(*data.values())), schema=list(data.keys()))

2. Create PySpark DataFrame using three Lists

Here, we will be creating a data frame by using three lists where we have taken three lists “names,” “ages,” and “country.” Below is the code snippet.


# Create Data from three lists
names = ["Ricky", "Bunny", "Coco"]
ages = [10, 15, 20]
country = ["India", "UK", "USA"]

# Creating DataFrame 
df2 = spark.createDataFrame(zip(names, ages, country), ["Name", "Age", "Country"])

The zip() function combines the three lists into tuples: [(“Ricky,” 10, “India”), (“Bunny,” 150, “UK”), (“Coco,” 20, “USA”)]. It calls createDataFrame() on the Spark session (spark) to convert the list of tuples into a PySpark DataFrame (df2). The column names are specified as [“Name”,”Age” ,”Country”].

Yields below output.


# Output
+-----+----+-------+
| Name|Age |Country|
+-----+----+-------+
|Ricky|  10|  India|
|Bunny|  15|    UK |
|Coco |  20|   USA |
+-----+----+-------+

3. Create DataFrame using a List of Tuples

We can also create a PySpark DataFrame from multiple lists using a list of tuples. In the below example, we are creating a list of tuples named students, representing information about students (name, age, subject). The “students” tuple is then passed to createDataFrame() along with the columns([“Name”, “Age”, “Subject”]) which creates the DataFrame.


# Create sample Data using tuple.
students = [("Ricky", 10, "English"), ("Bunny", 15, "Mathematics"), ("Coco", 20, "Arts")]

# Create DataFrame out of list of tuples.
df4 = spark.createDataFrame(students, ["Name", "Age", "Subject"])
df4.show()

Yields below output.


# Output
+-----+---+-----------+
| Name|Age|   Subject |
+-----+---+-----------+
|Ricky| 10|    English|
|Bunny| 15|Mathematics|
| Coco| 20|       Arts|
+-----+---+-----------+

Using Multiple Lists Representing a Row

The examples you have seen above contain a list with the same data type representing a DataFrame column. If you have a list with elements representing a row, for example, “name,” “age,” and “country” in a single list (with different data types), you can use the example below.


from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Create lists of Data
student1 = ["Ricky",10,"India"]
student2 = ["Bunny", 15,"UK"]
student3 = ["Coco", 20, "USA"]

# Define the schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Country", StringType(), True)
])

# Convert lists of Data into tuples
data = [tuple(student1),
        tuple(student2),
        tuple(student3)]

# Create a DataFrame
df5 = spark.createDataFrame(data, schema=schema)
df5.show()

Yields below output.


# Output
+-----+---+-------+
| Name|Age|Country|
+-----+---+-------+
|Ricky| 10|  India|
|Bunny| 15|     UK|
| Coco| 20|    USA|
+-----+---+-------+

Since each list represents a row in a DataFrame, the code essentially converts the provided Python lists (student1, student2, student3) into tuples and then creates a PySpark DataFrame (df) from these tuples, following the specified schema. The resulting DataFrame will have columns “Name,” “Age,” and “Country” with data corresponding to the provided students.

Conclusion

You have learned about creating PySpark DataFrame in this article using multiple lists and tuples. The createDataFrame() method, combines with the zip function allowing conversion of lists and tuples into tabular structures. The resulting DataFrames enable efficient data manipulation and analysis in a distributed computing environment.

Keep Learning!!

1. Create PySpark DataFrame using Multiple Lists

2. Create PySpark DataFrame using three Lists

3. Create DataFrame using a List of Tuples

Using Multiple Lists Representing a Row

Conclusion

Related Articles