In Polars, the sample()
method is used to randomly sample rows from a DataFrame. This is useful when you need to analyze a subset of your data without loading the entire dataset into memory. This function is particularly useful for testing, data exploration, visualization, or working with large datasets when you need a smaller subset for quick analysis. Additionally, it supports shuffling and ensures reproducibility through seed control.
In this article, I will explain the Polars DataFrame sample()
method, covering its syntax, parameters, and usage. This function returns a new DataFrame with the sampled rows while keeping the original DataFrame intact.
Key Points –
- The
sample()
method is used to randomly select rows from a Polars DataFrame. - You can specify the number of rows to sample using the
n
parameter. - Alternatively, you can use the
fraction
parameter to sample a percentage of rows. - Either
n
orfraction
must be provided; both cannot beNone
. - Setting
with_replacement=True
allows the same row to be picked multiple times. - The
shuffle
parameter determines whether the sampled rows should be randomly reordered. - Using
seed
ensures reproducibility, meaning the same random sample is generated each time. - If
n
is greater than the number of available rows andwith_replacement=False
, an error occurs. - It returns a new DataFrame containing the sampled rows, leaving the original DataFrame unchanged.
- This function is useful for random row selection, bootstrapping, and testing with subsets of data.
Polars DataFrame sample() Introduction
Let’s know the syntax of the sample() method.
# Syntax of sample()
DataFrame.sample(
n: int | Series | None = None,
*,
fraction: float | Series | None = None,
with_replacement: bool = False,
shuffle: bool = False,
seed: int | None = None,
) → DataFrame
Parameters of the Polars DataFrame sample()
Following are the parameters of the sample() method.
n
– (int | Series | None, default=None)- The number of rows to sample.
- Either
n
orfraction
must be specified.
fraction
– (float | Series | None, default=None)- The fraction of rows to sample (e.g.,
0.2
means 20% of rows). - Either
n
orfraction
must be specified.
- The fraction of rows to sample (e.g.,
with_replacement
– (bool, default=False)- If
True
, sampling is done with replacement (allows duplicates). - If
False
, sampling is without replacement.
- If
shuffle
– (bool, default=False)- If
True
, the sampled rows are shuffled. - If
False
, they retain their original order.
- If
seed
– (int | None, default=None)- A random seed for reproducibility.
Return Value
This function returns a new Polars DataFrame containing the sampled rows.
Usage of Polars DataFrame sample() Method
The sample()
method in Polars allows you to randomly extract a subset of rows from a DataFrame. You can define the number of rows (n
), a fraction of rows (fraction
), enable sampling with replacement, and customize other options.
First, let’s create a Polars DataFrame.
import polars as pl
# Creating a new Polars DataFrame
technologies = {
'Courses': ["Spark", "Hadoop", "Python", "Pandas"],
'Fees': [22000, 25000, 20000, 26000],
'Duration': ['30days', '50days', '40days', '60days'],
'Discount': [1000, 1500, 1200, 2000]
}
df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)
Yields below output.
You can use the sample()
method in Polars to select a fixed number of rows (n
) randomly from the DataFrame.
# Sampling 2 random rows
df2 = df.sample(n=2)
print("Sampled DataFrame:\n", df2)
Here,
n=2
ensures exactly 2 rows are selected at random.- Each run may return different results unless a seed is set.
- The original DataFrame remains unchanged.
- If you want reproducible results, use
seed
(e.g.,df.sample(n=2, seed=42)
).
Sample a Fraction of Rows (fraction)
You can sample a fraction of rows from in polars DataFrame using the sample()
method along with the fraction
parameter.
# Sampling 50% of the rows
df2 = df.sample(fraction=0.5)
print("Sampled DataFrame:\n", df2)
# Output:
# Sampled DataFrame:
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ Pandas ┆ 26000 ┆ 60days ┆ 2000 │
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
└─────────┴───────┴──────────┴──────────┘
Here,
fraction=0.5
selects 50% of the rows randomly.- Each run may return different results unless a seed is set.
- The original DataFrame remains unchanged.
- Works well for downsampling large datasets.
Use a Random Seed (seed) for Reproducibility
By default, the sample()
method in Polars produces different results with each execution. To maintain consistency across multiple runs, use the seed
parameter. This guarantees that the same random rows are selected every time the code is executed.
# Sample 50% of rows with a fixed seed
df2 = df.sample(fraction=0.5, seed=42)
print("Sampled DataFrame (with seed=42):\n", df2)
# Output:
# Sampled DataFrame (with seed=42):
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
│ Hadoop ┆ 25000 ┆ 50days ┆ 1500 │
└─────────┴───────┴──────────┴──────────┘
Here,
- Setting
seed=42
(or any fixed number) ensures that the random sampling results are the same every time the code runs. fraction=0.5
Select 50% of the rows randomly.- Useful for experiments, testing, and reproducibility in data analysis.
- Works with both
n
(fixed number of rows) andfraction
(percentage-based sampling).
Shuffle Sampled Rows (shuffle=True)
To shuffle the sampled rows in a Polars DataFrame, use the shuffle=True
parameter in the sample()
method. This ensures that the selected rows appear in a random order.
# Sample 50% of rows and shuffle them
df2 = df.sample(fraction=0.5, shuffle=True, seed=42)
print("Sampled and Shuffled DataFrame:\n", df2)
# Output:
# Sampled and Shuffled DataFrame:
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
│ Hadoop ┆ 25000 ┆ 50days ┆ 1500 │
└─────────┴───────┴──────────┴──────────┘
Here,
shuffle=True
randomizes the order of sampled rows.- Without
shuffle
, rows are returned in their original order. - To ensure reproducibility, use
seed
(e.g.,df.sample(n=2, shuffle=True, seed=42)
).
Sample with Replacement (with_replacement=True)
You can use the with_replacement=True
parameter in the sample()
method in Polars to allow rows to be selected multiple times.
# Sample 6 rows with replacement (some rows may appear multiple times)
df2 = df.sample(n=6, with_replacement=True, seed=42)
print("Sampled DataFrame (with replacement):\n", df2)
# Output:
# Sampled DataFrame (with replacement):
# shape: (6, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
│ Hadoop ┆ 25000 ┆ 50days ┆ 1500 │
│ Hadoop ┆ 25000 ┆ 50days ┆ 1500 │
│ Python ┆ 20000 ┆ 40days ┆ 1200 │
│ Python ┆ 20000 ┆ 40days ┆ 1200 │
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
└─────────┴───────┴──────────┴──────────┘
Here,
n=6
Select 6 rows (more than the original 4 rows).with_replacement=True
allows rows to appear multiple times in the sampled output.seed=42
Ensures reproducibility.- Useful when generating bootstrap samples. Can be combined with shuffle and seed for controlled randomness.
Conclusion
In summary, the sample()
method in Polars offers a versatile approach to randomly selecting rows from a DataFrame. It supports both fixed row sampling (n
) and fraction-based sampling (fraction
). Additionally, options like shuffling (shuffle=True
), replacement (with_replacement=True
), and setting a random seed (seed
) enhance its flexibility for various use cases.
Happy Learning!!
Related Articles
- Polars DataFrame quantile() Method
- Polars DataFrame max() Method
- Polars DataFrame drop() Method
- Polars DataFrame select() Method
- Polars Cast String to Integer
- Convert Polars Cast Int to String
- Convert Polars Cast String to Float
- Convert Polars Cast Float to String
- Polars DataFrame schema() Usage & Examples
- How to Convert String to Date or Datetime in Polars