• Post author:
  • Post category:Pandas
  • Post last modified:July 8, 2024
  • Reading time:15 mins read

In pandas, the sample() function is used to generate a random sample of rows from a DataFrame. This function can be particularly useful for testing, exploration, or when working with large datasets and you want to work with a subset of the data.

Advertisements

In this article, I will explain the Pandas DataFrame sample() method by using its syntax, parameters, usage, and how we can return a new object of the same type as the caller containing the sampled items.

Key Points –

  • The sample() function selects rows randomly from a DataFrame, allowing for randomized data exploration or partitioning.
  • It accepts parameters like n (number of rows), frac (fraction of rows), and replace (whether sampling with or without replacement).
  • By setting random_state, you can ensure reproducibility of the sampled results across different runs of the code.
  • If neither n nor frac is specified, sample() returns a single row by default.

Pandas DataFrame sample() Introduction

Following is the syntax of the pandas DataFrame sample() function.


# Syntax of Pandas dataframe sample()
DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)

Parameters of the DataFrame sample() Function

Following are the parameters of the DataFrame sample() function.

  • n – int, optional. Number of items to return. Default is one row if frac is None. Cannot be used with frac.
  • frac – float, optional. Fraction of axis items to return. For example, frac=0.5 returns 50% of the rows. Cannot be used with n.
  • replace – bool, default False. Whether to sample with replacement. If True, the same row can be selected more than once. Default is False.
  • weights – str or ndarray-like, optional. Weights to assign to the sampled rows. If not provided, rows are sampled with equal probability.
  • random_state – int value or numpy.random.RandomState, optional. Seed for the random number generator. If set to a particular integer, it ensures that the same rows are returned in every iteration, making the sample reproducible.
  • axis – {0 or ‘row’, 1 or ‘column’}, default 0. Axis to sample. 0 or row means sampling rows, and 1 or column means sampling columns.
  • ignore_index – bool, default False. If True, the resulting index will be labeled 0, 1, …, n – 1.

Return Value

It returns a new DataFrame containing the sampled rows.

Usage of Pandas DataFrame sample() Function

The sample() function in pandas is quite versatile and useful for generating random samples from a DataFrame.

To run some examples of Pandas DataFrame sample() function, let’s create a Pandas DataFrame using data from a dictionary.


import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark", "PySpark", "Hadoop", "Python", "Pandas"],
    'Fee' :[22000, 25000, 23000, 24000, 26000],
    'Discount':[1000, 2300, 1000, 1200, 2500],
    'Duration':['35days', '35days', '40days', '30days', '25days']
          }

df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.

pandas dataframe sample

Sample Three Rows from a DataFrame

To sample 3 rows from a DataFrame, you can use the sample() function from pandas. For instance, we create a DataFrame df using the dictionary technologies which contains details about different courses, their fees, discounts, and durations. We use the sample() function with the parameter n=3 to randomly select 3 rows from the DataFrame.


# Sample 3 rows from the DataFrame
df2 = df.sample(n=3)
print("Sampled DataFrame:\n", df2)

Yields below output.

pandas dataframe sample

Sample 20% of the Rows from a DataFrame

Alternatively, sample 20% of the rows from a DataFrame using the sample() function in pandas, you can specify the frac parameter as 0.2.


# Sample 20% of the rows from a dataframe
df2 = df.sample(frac=0.2)
print("Sampled DataFrame (20%):\n", df2)

# Output:
# Sampled DataFrame (20%):
#    Courses    Fee  Discount Duration
# 0   Spark  22000      1000   35days

In the above example, use the sample() function with the parameter frac=0.2 to randomly select 20% of the rows from the DataFrame. Print the sampled DataFrame to see the randomly selected rows.

Sample with Replacement

To sample with replacement from a DataFrame, you can set the replace parameter to True in the sample() function.


# Sample 50% of the rows from the DataFrame with replacement
df2 = df.sample(frac=0.5, replace=True)
print("Sampled 50% of the rows with replacement:\n", df2)

# Output:
# Sampled 50% of the rows with replacement:
#     Courses    Fee  Discount Duration
# 3   Python  24000      1200   30days
# 1  PySpark  25000      2300   35days

In the above examples, use the sample() function with the parameters frac=0.5 to randomly select 50% of the rows, and replace=True to allow sampling with replacement.

Sample with a Given Random State for Reproducibility

To sample with a given random state for reproducibility in pandas, you can use the random_state parameter within the sample() function. This parameter ensures that the same rows are sampled every time you run the code, which is useful for consistency in analysis or debugging.


# Sample 50% of the rows from the DataFrame 
# With a specific random state
df2 = df.sample(frac=0.5, random_state=42)
print("Sampled 50% of the rows with random state:\n", df2)

# Output:
# Sampled 50% of the rows with random state:
#     Courses    Fee  Discount Duration
# 1  PySpark  25000      2300   35days
# 4   Pandas  26000      2500   25days

In the above example, use the sample() function with the parameters frac=0.5 to randomly select 50% of the rows, replace=True to allow sampling with replacement, and random_state=42 to ensure reproducibility.

Sample with Weights

Sampling with weights in pandas allows you to specify probabilities for each row to be selected. This is useful when you want certain rows to have a higher probability of being sampled than others.


# Define weights 
weights = [0.1, 0.3, 0.2, 0.2, 0.2]

# Sample with weights
df2 = df.sample(n=3, weights=weights, replace=True, random_state=42)
print("Sampled rows with weights:\n", df2)

# Output:
# Sampled rows with weights:
#     Courses    Fee  Discount Duration
# 1  PySpark  25000      2300   35days
# 4   Pandas  26000      2500   25days
# 3   Python  24000      1200   30days

In the above example, Define weights, which is a list that assigns probabilities to each row. In this example, row 1 (PySpark) has a higher weight (0.3), so it is more likely to be sampled. Use sample() with weights=weights to sample rows based on the specified probabilities. n=3 specifies the number of rows to sample, replace=True allows sampling with replacement, and random_state=42 ensures reproducibility.

FAQ on Pandas DataFrame sample() Function

What is the purpose of the sample() function in pandas?

The sample() function in pandas is used to randomly sample rows (or columns) from a DataFrame. It helps in creating a smaller subset of the data for analysis or testing purposes.

How do you sample a specific number of rows?

To sample a specific number of rows from a pandas DataFrame, you can use the sample() function with the n parameter. The n parameter specifies the exact number of rows you want to randomly select.

How do you sample a fraction of rows?

To sample a fraction of rows from a pandas DataFrame, you can use the sample() function with the frac parameter. The frac parameter specifies the fraction of rows you want to randomly select.

How do you sample with weights?

To sample rows from a pandas DataFrame with weights, you can use the sample() function with the weights parameter. The weights parameter allows you to specify the probability of each row being selected.

What happens if you specify both n and frac?

You cannot use both n and frac simultaneously. If both are specified, pandas will raise a ValueError.

Conclusion

In this article, you have learned the Pandas DataFrame sample() function by using its syntax, parameters, usage, and how we can return a new object of the same type as the original, containing n items randomly sampled from the original object.

Happy Learning!!

Reference