• Post author:
  • Post category:Polars
  • Post last modified:December 20, 2024
  • Reading time:14 mins read

In Polars, the join() function is used to combine two DataFrames based on a common key or index. It is similar to SQL joins and the Pandas merge() function. Polars supports various types of joins, including inner, left, outer, and cross joins.

Advertisements

Key Points –

  • Combines two DataFrames based on a common key or index, similar to SQL joins or Pandas’ merge().
  • Supports inner, left, outer, and cross joins to handle different merging scenarios.
  • Polars’ join operations are optimized for high-speed execution, especially with large datasets.
  • Allows joining on multiple columns by specifying them in left_on and right_on parameters.
  • Automatically appends a suffix (default _right) to overlapping column names to avoid conflicts.
  • Ensures that columns in left_on and right_on are appropriately aligned for the join.

Quick Examples of Polars Join

Following are quick examples of the polars join() method.


# Quick examples on join() method

# Inner Join (matches rows based on the "Courses" column)
result = df1.join(df2, on="Courses", how="inner")

# Left Join (preserves all rows from df1)
result = df1.join(df2, on="Courses", how="left")

# Right Join (keeps all rows from df2)
result = df1.join(df2, on="Courses", how="right")

# Anti Join 
result = df1.join(df2, on="Courses", how="anti")

# Semi Join 
result = df1.join(df2, on="Courses", how="semi")

# Outer Join (keeps all rows from both DataFrames)
result = df1.join(df2, on="Courses", how="outer")

Polars DataFrame.join() Introduction

Following is the syntax of the Polars DataFrame join() method.


# Syntax of polars DataFrame.join()
DataFrame.join(
    other: DataFrame,
    left_on: str | list[str] | None = None,
    right_on: str | list[str] | None = None,
    on: str | list[str] | None = None,
    how: str = "inner",
    suffix: str = "_right"
)

Parameters of the Polars DataFrame.join()

Following are the parameters of the polars DataFrame.join() method.

  • other – The other DataFrame to join with.
  • left_on – Columns or expressions to join on in the current DataFrame.
  • right_on – Columns or expressions to join on in the other DataFrame.
  • on – Columns to match in both DataFrames. Use this when column names in both are identical.
  • how – Join type ("inner", "left", "outer",  ‘semi’, ‘anti’, "cross"). Default is "inner".
    • inner – (default): Only matching rows.
    • left – All rows from the left DataFrame, matching rows from the right.
    • outer – All rows from both DataFrames.
    • semi – Rows in the left DataFrame with a match in the right.
    • anti – Rows in the left DataFrame without a match in the right.
    • cross – Returns the Cartesian product of rows from both tables
  • suffix – Suffix to add to overlapping column names from the other DataFrame.

Usage of Polars Join

The Polars join() method is used to combine two DataFrames based on a common column or index. It supports different types of joins to meet a variety of data manipulation needs.

Now, let’s create two DataFrames using data from a dictionary.


import polars as pl

technologies = {
    'Courses':["Spark","PySpark","Python","Pandas"],
    'Fees' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
df1 = pl.DataFrame(technologies)
print("First DataFrame:\n", df1)

technologies2 = {
    'Courses':["Spark","Java","Python","C++"],
    'Discount':[2000,2300,1200,2000]
              }
df2 = pl.DataFrame(technologies2)
print("Second DataFRame:\n", df2)

Yields below output.

polars join

Inner Join (Default join type)

An Inner Join in Polars is the default join type. It combines rows from two DataFrames based on a matching column, keeping only the rows where there is a match in both DataFrames. If there are no matching rows, those rows are excluded from the result.


# Inner Join (matches rows based on the "Courses" column)
result = df1.join(df2, on="Courses", how="inner")
print("Inner join:\n",result)

Here,

  • The Inner Join includes only the rows where there is a matching value in both df1 and df2 for the Courses column.
  • The result contains rows for Spark and Python because they exist in both df1 and df2. The rows for PySpark and Pandas are excluded because they do not have matching entries in df2.
polars join

Left Join

A Left Join in Polars keeps all the rows from the left DataFrame and combines them with matching rows from the right DataFrame based on a specified column. If there are no matches in the right DataFrame, the result will contain null values in the columns of the right DataFrame.


# Left Join (preserves all rows from df1)
result = df1.join(df2, on="Courses", how="left")
print("Left Join:\n",result)

# Output:
# Left Join:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark   ┆ 20000 ┆ 30days   ┆ 2000     │
│ PySpark ┆ 25000 ┆ 40days   ┆ null     │
│ Python  ┆ 22000 ┆ 35days   ┆ 1200     │
│ Pandas  ┆ 30000 ┆ 50days   ┆ null     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • The Left Join keeps all rows from df1 (the left DataFrame), and adds matching rows from df2 (the right DataFrame).
  • If there is no match in df2, like with PySpark and Pandas, the corresponding Discount value is null.
  • Rows like Spark and Python match in both DataFrames, so their Discount values are included.

Here’s an example of a left join with coalesce=True in Polars. This ensures that if any column from the right DataFrame contains null values after the join, they will be replaced with corresponding values from the left DataFrame.


# Use left Join with coalesce=True
result = df1.join(df2, on="Courses", how="left", coalesce=True)
print(result)

# Output:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark   ┆ 20000 ┆ 30days   ┆ 2000     │
│ PySpark ┆ 25000 ┆ 40days   ┆ null     │
│ Python  ┆ 22000 ┆ 35days   ┆ 1200     │
│ Pandas  ┆ 30000 ┆ 50days   ┆ null     │
└─────────┴───────┴──────────┴──────────┘

Right Join

A Right Join in Polars keeps all rows from the right DataFrame (df2) and includes matching rows from the left DataFrame (df1) where the join condition is met. If there are no matches, the columns from the left DataFrame will contain null values.


# Right Join (keeps all rows from df2)
result = df1.join(df2, on="Courses", how="right")
print("Right Join:\n",result)

# Output:
# Right Join:
# shape: (4, 4)
┌───────┬──────────┬─────────┬──────────┐
│ Fees  ┆ Duration ┆ Courses ┆ Discount │
│ ---   ┆ ---      ┆ ---     ┆ ---      │
│ i64   ┆ str      ┆ str     ┆ i64      │
╞═══════╪══════════╪═════════╪══════════╡
│ 20000 ┆ 30days   ┆ Spark   ┆ 2000     │
│ null  ┆ null     ┆ Java    ┆ 2300     │
│ 22000 ┆ 35days   ┆ Python  ┆ 1200     │
│ null  ┆ null     ┆ C++     ┆ 2000     │
└───────┴──────────┴─────────┴──────────┘

Here,

  • The Right Join ensures all rows from df2 are retained in the result.
  • Matching rows from df1 (based on the Courses column) are included, while unmatched rows from df1 contribute null values.

Anti Join

An Anti Join in Polars is used to find rows from the left DataFrame (df1) that do not have a matching row in the right DataFrame (df2). It is useful for filtering out rows that have a corresponding match in another DataFrame based on the join condition.


# Anti Join 
result = df1.join(df2, on="Courses", how="anti")
print("Anti Join:\n", result)

# Output:
# Anti Join:
# shape: (2, 3)
┌─────────┬───────┬──────────┐
│ Courses ┆ Fees  ┆ Duration │
│ ---     ┆ ---   ┆ ---      │
│ str     ┆ i64   ┆ str      │
╞═════════╪═══════╪══════════╡
│ PySpark ┆ 25000 ┆ 40days   │
│ Pandas  ┆ 30000 ┆ 50days   │
└─────────┴───────┴──────────┘

Here,

  • The Anti Join returns the rows from the left DataFrame (df1) where there is no matching value in the right DataFrame (df2).
  • In this example, PySpark and Pandas are the rows from df1 that don’t have corresponding entries in df2, so they are returned in the result.
  • Rows like Spark and Python are excluded because they have corresponding matches in df2.

Semi Join

A Semi Join in Polars is used to find rows from the left DataFrame (df1) that have a matching row in the right DataFrame (df2) based on a specified column. Unlike other joins, it does not include columns from the right DataFrame in the result; it only filters rows from the left DataFrame.


# Semi Join 
result = df1.join(df2, on="Courses", how="semi")
print("Semi Join:\n", result)

# Output:
# Semi Join:
# shape: (2, 3)
┌─────────┬───────┬──────────┐
│ Courses ┆ Fees  ┆ Duration │
│ ---     ┆ ---   ┆ ---      │
│ str     ┆ i64   ┆ str      │
╞═════════╪═══════╪══════════╡
│ Spark   ┆ 20000 ┆ 30days   │
│ Python  ┆ 22000 ┆ 35days   │
└─────────┴───────┴──────────┘

Here,

  • The Semi Join includes only the rows from df1 where there is a matching value in df2 based on the Courses column.
  • The resulting DataFrame contains only rows from df1, and no columns from df2 are included.
  • In this example, Spark and Python are included because they exist in both df1 and df2.
  • Rows like PySpark and Pandas are excluded because they do not have corresponding matches in df2.

Outer Join

An Outer Join in Polars combines all rows from both the left (df1) and right (df2) DataFrames, keeping all records and filling with null values where there is no match on the join column. An Outer Join ensures that no rows are lost from either DataFrame.


# Outer Join (keeps all rows from both DataFrames)
result = df1.join(df2, on="Courses", how="outer")
print(result)

# Output:
# shape: (6, 5)
┌─────────┬───────┬──────────┬───────────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Courses_right ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---           ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ str           ┆ i64      │
╞═════════╪═══════╪══════════╪═══════════════╪══════════╡
│ Spark   ┆ 20000 ┆ 30days   ┆ Spark         ┆ 2000     │
│ null    ┆ null  ┆ null     ┆ Java          ┆ 2300     │
│ Python  ┆ 22000 ┆ 35days   ┆ Python        ┆ 1200     │
│ null    ┆ null  ┆ null     ┆ C++           ┆ 2000     │
│ PySpark ┆ 25000 ┆ 40days   ┆ null          ┆ null     │
│ Pandas  ┆ 30000 ┆ 50days   ┆ null          ┆ null     │
└─────────┴───────┴──────────┴───────────────┴──────────┘

Conclusion

In this article, I have explained how to polars join() DataFrames using the join() method. This method is a versatile and efficient tool for combining DataFrames, supporting various join types and options for handling missing data and overlapping columns.

Happy Learning!!

References