In Polars, the join()
function is used to combine two DataFrames based on a common key or index. It is similar to SQL joins and the Pandas merge()
function. Polars supports various types of joins, including inner, left, outer, and cross joins.
Key Points –
- Combines two DataFrames based on a common key or index, similar to SQL joins or Pandas’
merge()
. - Supports
inner
,left
,outer
, andcross
joins to handle different merging scenarios. - Polars’ join operations are optimized for high-speed execution, especially with large datasets.
- Allows joining on multiple columns by specifying them in
left_on
andright_on
parameters. - Automatically appends a suffix (default
_right
) to overlapping column names to avoid conflicts. - Ensures that columns in
left_on
andright_on
are appropriately aligned for the join.
Quick Examples of Polars Join
Following are quick examples of the polars join() method.
# Quick examples on join() method
# Inner Join (matches rows based on the "Courses" column)
result = df1.join(df2, on="Courses", how="inner")
# Left Join (preserves all rows from df1)
result = df1.join(df2, on="Courses", how="left")
# Right Join (keeps all rows from df2)
result = df1.join(df2, on="Courses", how="right")
# Anti Join
result = df1.join(df2, on="Courses", how="anti")
# Semi Join
result = df1.join(df2, on="Courses", how="semi")
# Outer Join (keeps all rows from both DataFrames)
result = df1.join(df2, on="Courses", how="outer")
Polars DataFrame.join() Introduction
Following is the syntax of the Polars DataFrame join() method.
# Syntax of polars DataFrame.join()
DataFrame.join(
other: DataFrame,
left_on: str | list[str] | None = None,
right_on: str | list[str] | None = None,
on: str | list[str] | None = None,
how: str = "inner",
suffix: str = "_right"
)
Parameters of the Polars DataFrame.join()
Following are the parameters of the polars DataFrame.join() method.
other
– The other DataFrame to join with.left_on
– Columns or expressions to join on in the current DataFrame.right_on
– Columns or expressions to join on in the other DataFrame.on
– Columns to match in both DataFrames. Use this when column names in both are identical.how
– Join type ("inner"
,"left"
,"outer"
,‘semi’
,‘anti’
,"cross"
). Default is"inner"
.inner
– (default): Only matching rows.left
– All rows from the left DataFrame, matching rows from the right.outer
– All rows from both DataFrames.semi
– Rows in the left DataFrame with a match in the right.anti
– Rows in the left DataFrame without a match in the right.cross
– Returns the Cartesian product of rows from both tables
suffix
– Suffix to add to overlapping column names from theother
DataFrame.
Usage of Polars Join
The Polars join()
method is used to combine two DataFrames based on a common column or index. It supports different types of joins to meet a variety of data manipulation needs.
Now, let’s create two DataFrames using data from a dictionary.
import polars as pl
technologies = {
'Courses':["Spark","PySpark","Python","Pandas"],
'Fees' :[20000,25000,22000,30000],
'Duration':['30days','40days','35days','50days'],
}
df1 = pl.DataFrame(technologies)
print("First DataFrame:\n", df1)
technologies2 = {
'Courses':["Spark","Java","Python","C++"],
'Discount':[2000,2300,1200,2000]
}
df2 = pl.DataFrame(technologies2)
print("Second DataFRame:\n", df2)
Yields below output.
Inner Join (Default join type)
An Inner Join
in Polars is the default join type. It combines rows from two DataFrames based on a matching column, keeping only the rows where there is a match in both DataFrames. If there are no matching rows, those rows are excluded from the result.
# Inner Join (matches rows based on the "Courses" column)
result = df1.join(df2, on="Courses", how="inner")
print("Inner join:\n",result)
Here,
- The
Inner Join
includes only the rows where there is a matching value in bothdf1
anddf2
for theCourses
column. - The result contains rows for
Spark
andPython
because they exist in bothdf1
anddf2
. The rows forPySpark
andPandas
are excluded because they do not have matching entries indf2
.
Left Join
A Left Join
in Polars keeps all the rows from the left DataFrame and combines them with matching rows from the right DataFrame based on a specified column. If there are no matches in the right DataFrame, the result will contain null
values in the columns of the right DataFrame.
# Left Join (preserves all rows from df1)
result = df1.join(df2, on="Courses", how="left")
print("Left Join:\n",result)
# Output:
# Left Join:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark ┆ 20000 ┆ 30days ┆ 2000 │
│ PySpark ┆ 25000 ┆ 40days ┆ null │
│ Python ┆ 22000 ┆ 35days ┆ 1200 │
│ Pandas ┆ 30000 ┆ 50days ┆ null │
└─────────┴───────┴──────────┴──────────┘
Here,
- The
Left Join
keeps all rows fromdf1
(the left DataFrame), and adds matching rows fromdf2
(the right DataFrame). - If there is no match in
df2
, like withPySpark
andPandas
, the correspondingDiscount
value isnull
. - Rows like
Spark
andPython
match in both DataFrames, so theirDiscount
values are included.
Here’s an example of a left join with coalesce=True
in Polars. This ensures that if any column from the right DataFrame contains null
values after the join, they will be replaced with corresponding values from the left DataFrame.
# Use left Join with coalesce=True
result = df1.join(df2, on="Courses", how="left", coalesce=True)
print(result)
# Output:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark ┆ 20000 ┆ 30days ┆ 2000 │
│ PySpark ┆ 25000 ┆ 40days ┆ null │
│ Python ┆ 22000 ┆ 35days ┆ 1200 │
│ Pandas ┆ 30000 ┆ 50days ┆ null │
└─────────┴───────┴──────────┴──────────┘
Right Join
A Right Join
in Polars keeps all rows from the right DataFrame (df2
) and includes matching rows from the left DataFrame (df1
) where the join condition is met. If there are no matches, the columns from the left DataFrame will contain null values.
# Right Join (keeps all rows from df2)
result = df1.join(df2, on="Courses", how="right")
print("Right Join:\n",result)
# Output:
# Right Join:
# shape: (4, 4)
┌───────┬──────────┬─────────┬──────────┐
│ Fees ┆ Duration ┆ Courses ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═══════╪══════════╪═════════╪══════════╡
│ 20000 ┆ 30days ┆ Spark ┆ 2000 │
│ null ┆ null ┆ Java ┆ 2300 │
│ 22000 ┆ 35days ┆ Python ┆ 1200 │
│ null ┆ null ┆ C++ ┆ 2000 │
└───────┴──────────┴─────────┴──────────┘
Here,
- The
Right Join
ensures all rows fromdf2
are retained in the result. - Matching rows from
df1
(based on theCourses
column) are included, while unmatched rows fromdf1
contributenull
values.
Anti Join
An Anti Join
in Polars is used to find rows from the left DataFrame (df1
) that do not have a matching row in the right DataFrame (df2
). It is useful for filtering out rows that have a corresponding match in another DataFrame based on the join condition.
# Anti Join
result = df1.join(df2, on="Courses", how="anti")
print("Anti Join:\n", result)
# Output:
# Anti Join:
# shape: (2, 3)
┌─────────┬───────┬──────────┐
│ Courses ┆ Fees ┆ Duration │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════════╪═══════╪══════════╡
│ PySpark ┆ 25000 ┆ 40days │
│ Pandas ┆ 30000 ┆ 50days │
└─────────┴───────┴──────────┘
Here,
- The
Anti Join
returns the rows from the left DataFrame (df1
) where there is no matching value in the right DataFrame (df2
). - In this example,
PySpark
andPandas
are the rows fromdf1
that don’t have corresponding entries indf2
, so they are returned in the result. - Rows like
Spark
andPython
are excluded because they have corresponding matches indf2
.
Semi Join
A Semi Join
in Polars is used to find rows from the left DataFrame (df1
) that have a matching row in the right DataFrame (df2
) based on a specified column. Unlike other joins, it does not include columns from the right DataFrame in the result; it only filters rows from the left DataFrame.
# Semi Join
result = df1.join(df2, on="Courses", how="semi")
print("Semi Join:\n", result)
# Output:
# Semi Join:
# shape: (2, 3)
┌─────────┬───────┬──────────┐
│ Courses ┆ Fees ┆ Duration │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════════╪═══════╪══════════╡
│ Spark ┆ 20000 ┆ 30days │
│ Python ┆ 22000 ┆ 35days │
└─────────┴───────┴──────────┘
Here,
- The
Semi Join
includes only the rows fromdf1
where there is a matching value indf2
based on theCourses
column. - The resulting DataFrame contains only rows from
df1
, and no columns fromdf2
are included. - In this example,
Spark
andPython
are included because they exist in bothdf1
anddf2
. - Rows like
PySpark
andPandas
are excluded because they do not have corresponding matches indf2
.
Outer Join
An Outer Join
in Polars combines all rows from both the left (df1
) and right (df2
) DataFrames, keeping all records and filling with null values where there is no match on the join column. An Outer Join ensures that no rows are lost from either DataFrame.
# Outer Join (keeps all rows from both DataFrames)
result = df1.join(df2, on="Courses", how="outer")
print(result)
# Output:
# shape: (6, 5)
┌─────────┬───────┬──────────┬───────────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Courses_right ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪═══════════════╪══════════╡
│ Spark ┆ 20000 ┆ 30days ┆ Spark ┆ 2000 │
│ null ┆ null ┆ null ┆ Java ┆ 2300 │
│ Python ┆ 22000 ┆ 35days ┆ Python ┆ 1200 │
│ null ┆ null ┆ null ┆ C++ ┆ 2000 │
│ PySpark ┆ 25000 ┆ 40days ┆ null ┆ null │
│ Pandas ┆ 30000 ┆ 50days ┆ null ┆ null │
└─────────┴───────┴──────────┴───────────────┴──────────┘
Conclusion
In this article, I have explained how to polars join()
DataFrames using the join()
method. This method is a versatile and efficient tool for combining DataFrames, supporting various join types and options for handling missing data and overlapping columns.
Happy Learning!!