• Post author:
  • Post category:Polars
  • Post last modified:February 19, 2025
  • Reading time:16 mins read
You are currently viewing Polars DataFrame update() – Usage & Examples

In Polars, the update() method allows you to modify specific values in a DataFrame based on a given condition or another DataFrame. It works similarly to Pandas’ update() function but with the efficiency of Polars. By default, it updates non-null values based on the row index. However, you can customize its behavior using different parameters, making it useful for modifying specific values while preserving the rest of the data.

Advertisements

In this article, I will explain the Polars DataFrame update() function by using its syntax, parameters, usage, and how to return a new Polars DataFrame with updated values based on the specified parameters.

Key Points –

  • The update() method is used to modify values in a Polars DataFrame using another DataFrame while keeping the existing structure.
  • It allows updates based on a common key column (on parameter) or row indices if no key is specified.
  • Supports "left", "inner", and "full" joins to control which rows get updated.
  • Only columns present in both DataFrames are updated, while others remain unchanged.
  • By default, None values from the updating DataFrame do not overwrite existing values unless include_nulls=True is specified.
  • Since Polars is optimized for speed, update() is much faster than similar operations in Pandas.
  • Instead of on, left_on and right_on parameters can be used to join on different column names.
  • If how="left" (default), unmatched rows from the original DataFrame remain unchanged.
  • The method ensures only existing rows are modified and does not introduce new columns unless explicitly merged using other methods.

Syntax of Polars DataFrame update() Method

Let’s know the syntax of the Polars DataFrame update() method.


# Syntax of update()
DataFrame.update(
other: DataFrame,
on: str | Sequence[str] | None = None,
how: Literal['left', 'inner', 'full'] = 'left',
*,
left_on: str | Sequence[str] | None = None,
right_on: str | Sequence[str] | None = None,
include_nulls: bool = False,
) → DataFrame[source]

Parameters of the Polars DataFrame update()

Following are the parameters of the update() method.

  • other (DataFrame) – The DataFrame containing new values that will update the current DataFrame.
  • on – (str | Sequence[str] | None, default=None) – Column(s) to use as the join key. If None, updates are performed based on row indices.
  • how Literal[‘left’, ‘inner’, ‘full’], default=’left’) – Specifies the type of join to use for the update,
    • "left" – Updates only matching rows while keeping all rows from the original DataFrame.
    • "inner" – Updates only rows that exist in both DataFrames.
    • "full" – Updates existing rows and includes new rows from other.
  • left_on (str | Sequence[str] | None, default=None) – Column(s) from the left DataFrame to use as keys. Overrides on.
  • right_on (str | Sequence[str] | None, default=None) – Column(s) from the right DataFrame (other) to use as keys. Overrides on.
  • include_nulls (bool, default=False) – If True, allows None (null) values from other to overwrite existing values in the original DataFrame.

Return Value

This function returns a new DataFrame with updated values.

Usage of Polars DataFrame update() Method

The update() method in Polars allows you to update values in a DataFrame based on another DataFrame. It is useful for merging new data, correcting values, or synchronizing information between DataFrames.

First, let’s create a Polars DataFrame.


import polars as pl

technologies= {
    'Courses':["Spark", "PySpark", "Hadoop", "Pandas"],
    'Fee' :[22000, 25000, 24000, 26000],
    'Discount':[1000, 1200, 2500, 2000],
    'Duration':['35days', '40days', '65days', '50days']
          }

df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)

Yields below output.

polars update

By default, when you use update() without specifying a key (on parameter), it updates the DataFrame based on row index positions.


# Creating an update DataFrame (using row index positions)
updates = pl.DataFrame({
    "Fee": [None, 27000],  # None means Fee for row index 0 remains unchanged
    "Discount": [1800, 3000]  # Update row index 0 & 1
})

# Applying update (by default, updates rows by index)
updated_df = df.update(updates)
print("Updated DataFrame:\n", updated_df)

Here,

  • If no on key is provided, updates happen by row index.
  • Missing values (None) in the update DataFrame do not overwrite existing values.
  • Only columns that exist in both DataFrames are updated.
polars update

Update with a Join Key (on)

Instead of updating by row index, you can update values based on a key column using the on parameter. This works like a SQL-style update, where values are modified only for matching rows.


# Update DataFrame (contains updates for some courses)
updates = pl.DataFrame({
    "Courses": ["Spark", "Pandas"],  # Matching key column
    "Fee": [23000, 27000],  # Updated values
    "Discount": [1500, 1800]  # Updated values
})

# Applying update using 'on' key
updated_df = df.update(updates, on="Courses")
print("\nUpdated DataFrame:\n", updated_df)

# Output:
# Updated DataFrame:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fee   ┆ Discount ┆ Duration │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ i64      ┆ str      │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark   ┆ 23000 ┆ 1500     ┆ 35days   │
│ PySpark ┆ 25000 ┆ 1200     ┆ 40days   │
│ Hadoop  ┆ 24000 ┆ 2500     ┆ 65days   │
│ Pandas  ┆ 27000 ┆ 1800     ┆ 50days   │
└─────────┴───────┴──────────┴──────────┘

Here,

  • The "Courses" column is used as the join key (on="Courses").
  • Only rows with matching "Courses" values are updated.
  • Other rows remain unchanged.
  • Columns that do not exist in the update DataFrame are left as-is (Duration is not affected).

Updating with left_on and right_on

Sometimes, the column names in the two DataFrames don’t match. In such cases, you can use left_on and right_on to specify the keys explicitly.


# Update DataFrame with a different column name
updates = pl.DataFrame({
    "Course_ID": ["Spark", "Pandas"],  # Different key column name
    "Fee": [25000, 29000],  # Updated values
    "Discount": [1800, 2200]  # Updated values
})

# Applying update with left_on and right_on
updated_df = df.update(updates, left_on="Courses", right_on="Course_ID")
print("Updated DataFrame:\n", updated_df)

# Output:
# Updated DataFrame:
 shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fee   ┆ Discount ┆ Duration │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ i64      ┆ str      │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark   ┆ 25000 ┆ 1800     ┆ 35days   │
│ PySpark ┆ 25000 ┆ 1200     ┆ 40days   │
│ Hadoop  ┆ 24000 ┆ 2500     ┆ 65days   │
│ Pandas  ┆ 29000 ┆ 2200     ┆ 50days   │
└─────────┴───────┴──────────┴──────────┘

Here,

  • left_on="Courses" and right_on="Course_ID" ensure proper column mapping.
  • Only rows with matching values in the respective columns get updated.
  • Other columns remain unchanged (Duration is unaffected).
  • Useful when column names do not match between DataFrames.

Updating with how=”inner” (Only Matching Rows)

By default, update() uses a left join, meaning it updates the existing DataFrame while keeping all original rows. However, if you only want to keep rows that have a match in the update DataFrame, use how="inner".


# Update DataFrame (contains updates for some courses)
updates = pl.DataFrame({
    "Courses": ["Spark", "Pandas"],  # Matching key column
    "Fee": [25000, 30000],  # Updated values
    "Discount": [1200, 1500]  # Updated values
})

# Applying update with how="inner"
updated_df = df.update(updates, on="Courses", how="inner")
print("Updated DataFrame (Only Matching Rows):\n", updated_df)

# Output:
# Updated DataFrame (Only Matching Rows):
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fee   ┆ Discount ┆ Duration │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ i64      ┆ str      │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark   ┆ 25000 ┆ 1200     ┆ 35days   │
│ Pandas  ┆ 30000 ┆ 1500     ┆ 50days   │
└─────────┴───────┴──────────┴──────────┘

Here,

  • how="inner" removes rows that don’t have a match in the update DataFrame.
  • Only "Spark" and "Pandas" remain because they exist in both DataFrames.
  • The other rows (“PySpark”, “Hadoop”) are removed.
  • Useful when you want to filter out non-matching rows while updating.

Updating with how=”full” (Include New Rows)

Using how="full" ensures that all rows from both DataFrames are included in the result. If a row is present in both DataFrames, its values are updated. If a row exists only in the updated DataFrame, it is appended as a new row.


# Update DataFrame (contains updates and new rows)
updates = pl.DataFrame({
    "Courses": ["Spark", "Pandas", "Scala"],  # "Scala" is a new course
    "Fee": [23000, 27000, 28000],  # Updated values for matching courses
    "Discount": [1500, 1800, 3000]  # Updated values
})

# Applying update with how="full"
updated_df = df.update(updates, on="Courses", how="full")
print("Updated DataFrame (Including New Rows):\n", updated_df)

# Output:
# Updated DataFrame (Including New Rows):
# shape: (5, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fee   ┆ Discount ┆ Duration │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ i64      ┆ str      │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark   ┆ 23000 ┆ 1500     ┆ 35days   │
│ PySpark ┆ 25000 ┆ 1200     ┆ 40days   │
│ Hadoop  ┆ 24000 ┆ 2500     ┆ 65days   │
│ Pandas  ┆ 27000 ┆ 1800     ┆ 50days   │
│ Scala   ┆ 28000 ┆ 3000     ┆ null     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • how="full" includes all rows from both DataFrames.
  • "Spark" and "Pandas" got updated.
  • "PySpark" and "Hadoop" remained unchanged.
  • "Scala" (a new row in the update DataFrame) was added.
  • Missing values (null) appear if a column exists only in one DataFrame (Duration for "Scala" is null).

Using include_nulls=True to Overwrite with Nulls

By default, null values in the update DataFrame do not overwrite existing values in the original DataFrame. If you want to allow null values to overwrite existing values, use include_nulls=True.


# Update DataFrame (contains null values)
updates = pl.DataFrame({
    "Courses": ["Spark", "Pandas"],
    "Fee": [23000, None],       # Pandas' Fee is set to NULL
    "Discount": [None, 1800],   # Spark's Discount is set to NULL
    "Duration": [None, None]    # Duration will be NULL for Spark & Pandas
})

# Applying update with include_nulls=True
updated_df = df.update(updates, on="Courses", how="left", include_nulls=True)
print("Updated DataFrame (Null Overwrites Allowed):\n", updated_df)

# Output:
# Updated DataFrame (Null Overwrites Allowed):
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fee   ┆ Discount ┆ Duration │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ i64      ┆ str      │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark   ┆ 23000 ┆ null     ┆ null     │
│ PySpark ┆ 25000 ┆ 1200     ┆ 40days   │
│ Hadoop  ┆ 24000 ┆ 2500     ┆ 65days   │
│ Pandas  ┆ null  ┆ 1800     ┆ null     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • include_nulls=True allows None values in the update DataFrame to overwrite existing values.
  • "Spark": Discount and Duration became null.
  • "Pandas": Fee and Duration became null.
  • "PySpark" and "Hadoop" remained unchanged because they were not in the update DataFrame.
  • Default behavior (include_nulls=False) would have ignored these null values.

Conclusion

In this article, I have explained the Polars DataFrame update() method, including its syntax, parameters, usage, and how it returns a new DataFrame with updated values based on the specified options.

Happy Learning!!

Reference