In pandas, the `sample()`

function is used to generate a random sample of rows from a DataFrame. This function can be particularly useful for testing, exploration, or when working with large datasets and you want to work with a subset of the data.

In this article, I will explain the Pandas DataFrame `sample()`

method by using its syntax, parameters, usage, and how we can return a new object of the same type as the caller containing the sampled items.

**Key Points –**

- The
`sample()`

function selects rows randomly from a DataFrame, allowing for randomized data exploration or partitioning. - It accepts parameters like
`n`

(number of rows),`frac`

(fraction of rows), and`replace`

(whether sampling with or without replacement). - By setting
`random_state`

, you can ensure reproducibility of the sampled results across different runs of the code. - If neither
`n`

nor`frac`

is specified,`sample()`

returns a single row by default.

## Pandas DataFrame sample() Introduction

Following is the syntax of the pandas DataFrame sample() function.

```
# Syntax of Pandas dataframe sample()
DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)
```

### Parameters of the DataFrame sample() Function

Following are the parameters of the DataFrame sample() function.

`n`

– int, optional. Number of items to return. Default is one row if`frac`

is`None`

. Cannot be used with`frac`

.`frac`

– float, optional. Fraction of axis items to return. For example,`frac=0.5`

returns 50% of the rows. Cannot be used with`n`

.`replace`

– bool, default False. Whether to sample with replacement. If`True`

, the same row can be selected more than once. Default is`False`

.`weights`

– str or ndarray-like, optional. Weights to assign to the sampled rows. If not provided, rows are sampled with equal probability.`random_state`

– int value or numpy.random.RandomState, optional. Seed for the random number generator. If set to a particular integer, it ensures that the same rows are returned in every iteration, making the sample reproducible.`axis`

– {0 or ‘row’, 1 or ‘column’}, default 0. Axis to sample.`0`

or`row`

means sampling rows, and`1`

or`column`

means sampling columns.`ignore_index`

– bool, default False. If True, the resulting index will be labeled 0, 1, …, n – 1.

### Return Value

It returns a new DataFrame containing the sampled rows.

## Usage of Pandas DataFrame sample() Function

The `sample()`

function in pandas is quite versatile and useful for generating random samples from a DataFrame.

To run some examples of Pandas DataFrame sample() function, let’s create a Pandas DataFrame using data from a dictionary.

```
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark", "PySpark", "Hadoop", "Python", "Pandas"],
'Fee' :[22000, 25000, 23000, 24000, 26000],
'Discount':[1000, 2300, 1000, 1200, 2500],
'Duration':['35days', '35days', '40days', '30days', '25days']
}
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)
```

Yields below output.

## Sample Three Rows from a DataFrame

To sample 3 rows from a DataFrame, you can use the `sample()`

function from pandas. For instance, we create a DataFrame `df`

using the dictionary `technologies`

which contains details about different courses, their fees, discounts, and durations. We use the `sample()`

function with the parameter `n=3`

to randomly select 3 rows from the DataFrame.

```
# Sample 3 rows from the DataFrame
df2 = df.sample(n=3)
print("Sampled DataFrame:\n", df2)
```

Yields below output.

## Sample 20% of the Rows from a DataFrame

Alternatively, sample 20% of the rows from a DataFrame using the `sample()`

function in pandas, you can specify the `frac`

parameter as 0.2.

```
# Sample 20% of the rows from a dataframe
df2 = df.sample(frac=0.2)
print("Sampled DataFrame (20%):\n", df2)
# Output:
# Sampled DataFrame (20%):
# Courses Fee Discount Duration
# 0 Spark 22000 1000 35days
```

In the above example, use the `sample()`

function with the parameter `frac=0.2`

to randomly select `20%`

of the rows from the DataFrame. Print the sampled DataFrame to see the randomly selected rows.

## Sample with Replacement

To sample with replacement from a DataFrame, you can set the `replace`

parameter to `True`

in the `sample()`

function.

```
# Sample 50% of the rows from the DataFrame with replacement
df2 = df.sample(frac=0.5, replace=True)
print("Sampled 50% of the rows with replacement:\n", df2)
# Output:
# Sampled 50% of the rows with replacement:
# Courses Fee Discount Duration
# 3 Python 24000 1200 30days
# 1 PySpark 25000 2300 35days
```

In the above examples, use the `sample()`

function with the parameters `frac=0.5`

to randomly select `50%`

of the rows, and `replace=True`

to allow sampling with replacement.

## Sample with a Given Random State for Reproducibility

To sample with a given random state for reproducibility in pandas, you can use the `random_state`

parameter within the `sample()`

function. This parameter ensures that the same rows are sampled every time you run the code, which is useful for consistency in analysis or debugging.

```
# Sample 50% of the rows from the DataFrame
# With a specific random state
df2 = df.sample(frac=0.5, random_state=42)
print("Sampled 50% of the rows with random state:\n", df2)
# Output:
# Sampled 50% of the rows with random state:
# Courses Fee Discount Duration
# 1 PySpark 25000 2300 35days
# 4 Pandas 26000 2500 25days
```

In the above example, use the `sample()`

function with the parameters `frac=0.5`

to randomly select 50% of the rows, `replace=True`

to allow sampling with replacement, and `random_state=42`

to ensure reproducibility.

## Sample with Weights

Sampling with weights in pandas allows you to specify probabilities for each row to be selected. This is useful when you want certain rows to have a higher probability of being sampled than others.

```
# Define weights
weights = [0.1, 0.3, 0.2, 0.2, 0.2]
# Sample with weights
df2 = df.sample(n=3, weights=weights, replace=True, random_state=42)
print("Sampled rows with weights:\n", df2)
# Output:
# Sampled rows with weights:
# Courses Fee Discount Duration
# 1 PySpark 25000 2300 35days
# 4 Pandas 26000 2500 25days
# 3 Python 24000 1200 30days
```

In the above example, Define `weights`

, which is a list that assigns probabilities to each row. In this example, row 1 (`PySpark`

) has a higher weight (0.3), so it is more likely to be sampled. Use `sample()`

with `weights=weights`

to sample rows based on the specified probabilities. `n=3`

specifies the number of rows to sample, `replace=True`

allows sampling with replacement, and `random_state=42`

ensures reproducibility.

## FAQ on Pandas DataFrame sample() Function

**What is the purpose of the sample() function in pandas?**

The `sample()`

function in pandas is used to randomly sample rows (or columns) from a DataFrame. It helps in creating a smaller subset of the data for analysis or testing purposes.

**How do you sample a specific number of rows?**

To sample a specific number of rows from a pandas DataFrame, you can use the `sample()`

function with the `n`

parameter. The `n`

parameter specifies the exact number of rows you want to randomly select.

**How do you sample a fraction of rows?**

To sample a fraction of rows from a pandas DataFrame, you can use the `sample()`

function with the `frac`

parameter. The `frac`

parameter specifies the fraction of rows you want to randomly select.

**How do you sample with weights?**

To sample rows from a pandas DataFrame with weights, you can use the `sample()`

function with the `weights`

parameter. The `weights`

parameter allows you to specify the probability of each row being selected.

**What happens if you specify both n and frac?**

You cannot use both `n`

and `frac`

simultaneously. If both are specified, pandas will raise a `ValueError`

.

## Conclusion

In this article, you have learned the Pandas DataFrame `sample()`

function by using its syntax, parameters, usage, and how we can return a new object of the same type as the original, containing n items randomly sampled from the original object.

Happy Learning!!

