• Post author:
  • Post category:Polars
  • Post last modified:February 14, 2025
  • Reading time:14 mins read
You are currently viewing Polars DataFrame sample() Method

In Polars, the sample() method is used to randomly sample rows from a DataFrame. This is useful when you need to analyze a subset of your data without loading the entire dataset into memory. This function is particularly useful for testing, data exploration, visualization, or working with large datasets when you need a smaller subset for quick analysis. Additionally, it supports shuffling and ensures reproducibility through seed control.

Advertisements

In this article, I will explain the Polars DataFrame sample() method, covering its syntax, parameters, and usage. This function returns a new DataFrame with the sampled rows while keeping the original DataFrame intact.

Key Points –

  • The sample() method is used to randomly select rows from a Polars DataFrame.
  • You can specify the number of rows to sample using the n parameter.
  • Alternatively, you can use the fraction parameter to sample a percentage of rows.
  • Either n or fraction must be provided; both cannot be None.
  • Setting with_replacement=True allows the same row to be picked multiple times.
  • The shuffle parameter determines whether the sampled rows should be randomly reordered.
  • Using seed ensures reproducibility, meaning the same random sample is generated each time.
  • If n is greater than the number of available rows and with_replacement=False, an error occurs.
  • It returns a new DataFrame containing the sampled rows, leaving the original DataFrame unchanged.
  • This function is useful for random row selection, bootstrapping, and testing with subsets of data.

Polars DataFrame sample() Introduction

Let’s know the syntax of the sample() method.


# Syntax of sample()
DataFrame.sample(
n: int | Series | None = None,
*,
fraction: float | Series | None = None,
with_replacement: bool = False,
shuffle: bool = False,
seed: int | None = None,
) → DataFrame

Parameters of the Polars DataFrame sample()

Following are the parameters of the sample() method.

  • n – (int | Series | None, default=None)
    • The number of rows to sample.
    • Either n or fraction must be specified.
  • fraction – (float | Series | None, default=None)
    • The fraction of rows to sample (e.g., 0.2 means 20% of rows).
    • Either n or fraction must be specified.
  • with_replacement – (bool, default=False)
    • If True, sampling is done with replacement (allows duplicates).
    • If False, sampling is without replacement.
  • shuffle – (bool, default=False)
    • If True, the sampled rows are shuffled.
    • If False, they retain their original order.
  • seed – (int | None, default=None)
    • A random seed for reproducibility.

Return Value

This function returns a new Polars DataFrame containing the sampled rows.

Usage of Polars DataFrame sample() Method

The sample() method in Polars allows you to randomly extract a subset of rows from a DataFrame. You can define the number of rows (n), a fraction of rows (fraction), enable sampling with replacement, and customize other options.

First, let’s create a Polars DataFrame.


import polars as pl

# Creating a new Polars DataFrame
technologies = {
    'Courses': ["Spark", "Hadoop", "Python", "Pandas"],
    'Fees': [22000, 25000, 20000, 26000],
    'Duration': ['30days', '50days', '40days', '60days'],
    'Discount': [1000, 1500, 1200, 2000]
}

df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)

Yields below output.

polars sample

You can use the sample() method in Polars to select a fixed number of rows (n) randomly from the DataFrame.


# Sampling 2 random rows
df2 = df.sample(n=2)
print("Sampled DataFrame:\n", df2)

Here,

  • n=2 ensures exactly 2 rows are selected at random.
  • Each run may return different results unless a seed is set.
  • The original DataFrame remains unchanged.
  • If you want reproducible results, use seed (e.g., df.sample(n=2, seed=42)).
polars sample

Sample a Fraction of Rows (fraction)

You can sample a fraction of rows from in polars DataFrame using the sample() method along with the fraction parameter.


# Sampling 50% of the rows
df2 = df.sample(fraction=0.5)
print("Sampled DataFrame:\n", df2)

# Output:
# Sampled DataFrame:
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ Pandas  ┆ 26000 ┆ 60days   ┆ 2000     │
│ Spark   ┆ 22000 ┆ 30days   ┆ 1000     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • fraction=0.5 selects 50% of the rows randomly.
  • Each run may return different results unless a seed is set.
  • The original DataFrame remains unchanged.
  • Works well for downsampling large datasets.

Use a Random Seed (seed) for Reproducibility

By default, the sample() method in Polars produces different results with each execution. To maintain consistency across multiple runs, use the seed parameter. This guarantees that the same random rows are selected every time the code is executed.


# Sample 50% of rows with a fixed seed
df2 = df.sample(fraction=0.5, seed=42)
print("Sampled DataFrame (with seed=42):\n", df2)

# Output:
# Sampled DataFrame (with seed=42):
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark   ┆ 22000 ┆ 30days   ┆ 1000     │
│ Hadoop  ┆ 25000 ┆ 50days   ┆ 1500     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • Setting seed=42 (or any fixed number) ensures that the random sampling results are the same every time the code runs.
  • fraction=0.5 Select 50% of the rows randomly.
  • Useful for experiments, testing, and reproducibility in data analysis.
  • Works with both n (fixed number of rows) and fraction (percentage-based sampling).

Shuffle Sampled Rows (shuffle=True)

To shuffle the sampled rows in a Polars DataFrame, use the shuffle=True parameter in the sample() method. This ensures that the selected rows appear in a random order.


# Sample 50% of rows and shuffle them
df2 = df.sample(fraction=0.5, shuffle=True, seed=42)
print("Sampled and Shuffled DataFrame:\n", df2)

# Output:
# Sampled and Shuffled DataFrame:
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark   ┆ 22000 ┆ 30days   ┆ 1000     │
│ Hadoop  ┆ 25000 ┆ 50days   ┆ 1500     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • shuffle=True randomizes the order of sampled rows.
  • Without shuffle, rows are returned in their original order.
  • To ensure reproducibility, use seed (e.g., df.sample(n=2, shuffle=True, seed=42)).

Sample with Replacement (with_replacement=True)

You can use the with_replacement=True parameter in the sample() method in Polars to allow rows to be selected multiple times.


# Sample 6 rows with replacement (some rows may appear multiple times)
df2 = df.sample(n=6, with_replacement=True, seed=42)
print("Sampled DataFrame (with replacement):\n", df2)

# Output:
# Sampled DataFrame (with replacement):
# shape: (6, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark   ┆ 22000 ┆ 30days   ┆ 1000     │
│ Hadoop  ┆ 25000 ┆ 50days   ┆ 1500     │
│ Hadoop  ┆ 25000 ┆ 50days   ┆ 1500     │
│ Python  ┆ 20000 ┆ 40days   ┆ 1200     │
│ Python  ┆ 20000 ┆ 40days   ┆ 1200     │
│ Spark   ┆ 22000 ┆ 30days   ┆ 1000     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • n=6 Select 6 rows (more than the original 4 rows).
  • with_replacement=True allows rows to appear multiple times in the sampled output.
  • seed=42 Ensures reproducibility.
  • Useful when generating bootstrap samples. Can be combined with shuffle and seed for controlled randomness.

Conclusion

In summary, the sample() method in Polars offers a versatile approach to randomly selecting rows from a DataFrame. It supports both fixed row sampling (n) and fraction-based sampling (fraction). Additionally, options like shuffling (shuffle=True), replacement (with_replacement=True), and setting a random seed (seed) enhance its flexibility for various use cases.

Happy Learning!!

References