In pandas, the sample()
function is used to generate a random sample of rows from a DataFrame. This function can be particularly useful for testing, exploration, or when working with large datasets and you want to work with a subset of the data.
In this article, I will explain the Pandas DataFrame sample()
method by using its syntax, parameters, usage, and how we can return a new object of the same type as the caller containing the sampled items.
Key Points –
- The
sample()
function selects rows randomly from a DataFrame, allowing for randomized data exploration or partitioning. - It accepts parameters like
n
(number of rows),frac
(fraction of rows), andreplace
(whether sampling with or without replacement). - By setting
random_state
, you can ensure reproducibility of the sampled results across different runs of the code. - If neither
n
norfrac
is specified,sample()
returns a single row by default.
Pandas DataFrame sample() Introduction
Following is the syntax of the pandas DataFrame sample() function.
# Syntax of Pandas dataframe sample()
DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)
Parameters of the DataFrame sample() Function
Following are the parameters of the DataFrame sample() function.
n
– int, optional. Number of items to return. Default is one row iffrac
isNone
. Cannot be used withfrac
.frac
– float, optional. Fraction of axis items to return. For example,frac=0.5
returns 50% of the rows. Cannot be used withn
.replace
– bool, default False. Whether to sample with replacement. IfTrue
, the same row can be selected more than once. Default isFalse
.weights
– str or ndarray-like, optional. Weights to assign to the sampled rows. If not provided, rows are sampled with equal probability.random_state
– int value or numpy.random.RandomState, optional. Seed for the random number generator. If set to a particular integer, it ensures that the same rows are returned in every iteration, making the sample reproducible.axis
– {0 or ‘row’, 1 or ‘column’}, default 0. Axis to sample.0
orrow
means sampling rows, and1
orcolumn
means sampling columns.ignore_index
– bool, default False. If True, the resulting index will be labeled 0, 1, …, n – 1.
Return Value
It returns a new DataFrame containing the sampled rows.
Usage of Pandas DataFrame sample() Function
The sample()
function in pandas is quite versatile and useful for generating random samples from a DataFrame.
To run some examples of Pandas DataFrame sample() function, let’s create a Pandas DataFrame using data from a dictionary.
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark", "PySpark", "Hadoop", "Python", "Pandas"],
'Fee' :[22000, 25000, 23000, 24000, 26000],
'Discount':[1000, 2300, 1000, 1200, 2500],
'Duration':['35days', '35days', '40days', '30days', '25days']
}
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)
Yields below output.
Sample Three Rows from a DataFrame
To sample 3 rows from a DataFrame, you can use the sample()
function from pandas. For instance, we create a DataFrame df
using the dictionary technologies
which contains details about different courses, their fees, discounts, and durations. We use the sample()
function with the parameter n=3
to randomly select 3 rows from the DataFrame.
# Sample 3 rows from the DataFrame
df2 = df.sample(n=3)
print("Sampled DataFrame:\n", df2)
Yields below output.
Sample 20% of the Rows from a DataFrame
Alternatively, sample 20% of the rows from a DataFrame using the sample()
function in pandas, you can specify the frac
parameter as 0.2.
# Sample 20% of the rows from a dataframe
df2 = df.sample(frac=0.2)
print("Sampled DataFrame (20%):\n", df2)
# Output:
# Sampled DataFrame (20%):
# Courses Fee Discount Duration
# 0 Spark 22000 1000 35days
In the above example, use the sample()
function with the parameter frac=0.2
to randomly select 20%
of the rows from the DataFrame. Print the sampled DataFrame to see the randomly selected rows.
Sample with Replacement
To sample with replacement from a DataFrame, you can set the replace
parameter to True
in the sample()
function.
# Sample 50% of the rows from the DataFrame with replacement
df2 = df.sample(frac=0.5, replace=True)
print("Sampled 50% of the rows with replacement:\n", df2)
# Output:
# Sampled 50% of the rows with replacement:
# Courses Fee Discount Duration
# 3 Python 24000 1200 30days
# 1 PySpark 25000 2300 35days
In the above examples, use the sample()
function with the parameters frac=0.5
to randomly select 50%
of the rows, and replace=True
to allow sampling with replacement.
Sample with a Given Random State for Reproducibility
To sample with a given random state for reproducibility in pandas, you can use the random_state
parameter within the sample()
function. This parameter ensures that the same rows are sampled every time you run the code, which is useful for consistency in analysis or debugging.
# Sample 50% of the rows from the DataFrame
# With a specific random state
df2 = df.sample(frac=0.5, random_state=42)
print("Sampled 50% of the rows with random state:\n", df2)
# Output:
# Sampled 50% of the rows with random state:
# Courses Fee Discount Duration
# 1 PySpark 25000 2300 35days
# 4 Pandas 26000 2500 25days
In the above example, use the sample()
function with the parameters frac=0.5
to randomly select 50% of the rows, replace=True
to allow sampling with replacement, and random_state=42
to ensure reproducibility.
Sample with Weights
Sampling with weights in pandas allows you to specify probabilities for each row to be selected. This is useful when you want certain rows to have a higher probability of being sampled than others.
# Define weights
weights = [0.1, 0.3, 0.2, 0.2, 0.2]
# Sample with weights
df2 = df.sample(n=3, weights=weights, replace=True, random_state=42)
print("Sampled rows with weights:\n", df2)
# Output:
# Sampled rows with weights:
# Courses Fee Discount Duration
# 1 PySpark 25000 2300 35days
# 4 Pandas 26000 2500 25days
# 3 Python 24000 1200 30days
In the above example, Define weights
, which is a list that assigns probabilities to each row. In this example, row 1 (PySpark
) has a higher weight (0.3), so it is more likely to be sampled. Use sample()
with weights=weights
to sample rows based on the specified probabilities. n=3
specifies the number of rows to sample, replace=True
allows sampling with replacement, and random_state=42
ensures reproducibility.
FAQ on Pandas DataFrame sample() Function
The sample()
function in pandas is used to randomly sample rows (or columns) from a DataFrame. It helps in creating a smaller subset of the data for analysis or testing purposes.
To sample a specific number of rows from a pandas DataFrame, you can use the sample()
function with the n
parameter. The n
parameter specifies the exact number of rows you want to randomly select.
To sample a fraction of rows from a pandas DataFrame, you can use the sample()
function with the frac
parameter. The frac
parameter specifies the fraction of rows you want to randomly select.
To sample rows from a pandas DataFrame with weights, you can use the sample()
function with the weights
parameter. The weights
parameter allows you to specify the probability of each row being selected.
You cannot use both n
and frac
simultaneously. If both are specified, pandas will raise a ValueError
.
Conclusion
In this article, you have learned the Pandas DataFrame sample()
function by using its syntax, parameters, usage, and how we can return a new object of the same type as the original, containing n items randomly sampled from the original object.
Happy Learning!!
Related Articles
- Pandas DataFrame copy() Function
- Pandas DataFrame insert() Function
- Pandas DataFrame sum() Method
- Pandas DataFrame corr() Method
- pandas.DataFrame.mean() Examples
- Pandas DataFrame assign() Method
- Pandas Get DataFrame Shape
- Pandas DataFrame clip() Method
- Pandas DataFrame median() Method
- How to Unpivot DataFrame in Pandas?