Pandas create different samples for test and train from DataFrame can be achieved by using `DataFrame.sample()`

, and by applying `sklearn’s train_test_split()`

functions and `model_selection()`

function. In this article, I will explain how to create test and train samples DataFrame’s by splitting the rows from DataFrame.

The examples explained here will help you split the pandas DataFrame into two random samples (80% and 20%) for training and testing. These samples make sense if you have a large Dataset. To make it simple for this article, I am using quite a simple DataFrame, but you can use these approaches in your real-time projects to split the DataFrame.

**Key Points –**

- Utilize the
`sample`

method in Pandas to randomly select rows from the DataFrame for creating test and train sets. - Ensure proper stratification if dealing with imbalanced classes by using
`stratify`

parameter in the splitting process to maintain class distribution in both train and test sets. - Use the
`train_test_split`

function from the`sklearn.model_selection`

module to split the DataFrame into training and testing sets. - Define the test size (proportion of the dataset to include in the test split) and optionally set random state for reproducibility.
- After splitting, you’ll typically have separate DataFrames for training and testing data, which can then be used for model training and evaluation.
- Specify the features and target variable for the split operation.

## 1. Quick Examples to Create Test and Train Samples

If you are in hurry below are some quick examples to create test and train samples in Pandas DataFrame.

```
# Quick examples to create test and train samples
# Using DataFrame.sample()
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
# Use train_test_split() Method.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
# Using model_selection() method.
from sklearn.model_selection import train_test_split
y = df.pop('Courses')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train]
# Using numpy.random.rand() Method.
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test=df[~msk]
```

Now, let’s create a pandas DataFrame with a few rows and columns, execute these examples and validate results. Our DataFrame contains column names `Courses`

, `Fee`

and `Duration`

.

```
# Create DataFrame
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Spark","Python","PySpark"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan]
}
df = pd.DataFrame(technologies)
print(df)
```

Yields below output.

```
# Output:
Courses Fee Duration
0 Spark 22000 30days
1 PySpark 25000 50days
2 Spark 23000 30days
3 Python 24000 None
4 PySpark 26000 NaN
```

## 2. Using DataFrame.sample() Method To get Test & Train Samples

`DataFrame.sample()`

return a random sample of elements from the DataFrame. You can use this to select the train and test samples.

The `random_state`

parameter controls the shuffling applied to the data before the split. By defining the `random_state`

, we can reproduce the same split of the data across multiple calls.

Using **Shuffle** parameter to generate random shuffled before splitting.

```
# Using DataFrame.sample() Method by random_state arg.
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
print(train)
```

Yields below output.

```
# Output:
Courses Fee Duration
3 Python 24000 None
4 PySpark 26000 NaN
0 Spark 22000 30days
1 PySpark 25000 50days
```

## 3. Use sklearn to Create Test and Train Samples

The `train_test_split()`

function of the `sklearn`

library is able to handle Pandas DataFrames as well as arrays. Therefore, we can simply call the corresponding function by providing the dataset and other parameters.

**Test_size**: This parameter represents the proportion of the dataset that should be included in the test split. The default value for this parameter is set to 0.25, meaning that if we don’t specify the test_size, the resulting split consists of 75% train and 25% test data.

```
# Use train_test_split() Method.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
print(train)
```

Yields below output.

```
# Output:
Courses Fee Duration
4 PySpark 26000 NaN
2 Spark 23000 30days
3 Python 24000 None
1 PySpark 25000 50days
```

## 4. Using model_selection() Method

`model_selection()`

is a method for setting and analyze data and then using it to measure new data. Selecting a proper model allows you to generate accurate results. You need to train your model by using a specific dataset. Then, you test the model against another dataset.

```
# Using model_selection() method.
from sklearn.model_selection import train_test_split
y = df.pop('Courses')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train]
```

Yields below output.

```
# Output:
Fee Duration
4 26000 NaN
0 22000 30days
2 23000 30days
3 24000 None
```

## 5. Using Numpy.random.rand() Method

`np.random.rand()`

generates random numbers from the standard uniform distribution (i.e., the uniform distribution from 0 to 1), and outputs those numbers as a Numpy array. The `np.random.rand()`

produces random numbers, structured as a Numpy array. A Numpy array is a data structure that we use for storing and manipulating numeric data.

`np.random.rand(len(df))`

is an array of size `len(df)`

with randomly and uniformly distributed float values in range [0, 1]. The `< 0.8`

applies the comparison element-wise and stores the result in place. Thus values` < 0.8`

become `True`

and value `>= 0.8`

become `False`

.

```
# Using numpy.random.rand() Method.
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test=df[~msk]
print(test)
```

Yields below output.

```
# Output:
Courses Fee Duration
0 Spark 22000 30days
2 Spark 23000 30days
```

NOTE: Alternatively, as long as `msk`

is of dtype `bool`

, `df[msk]`

, `df.iloc[msk]`

and `df.loc[msk]`

always return the same result.

## 6. Complete Examples of Create Test and Train Samples DataFrame

Below are some complete examples of creating test and train samples of pandas DataFrame.

```
# Create a Pandas DataFrame.
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Spark","Python","PySpark"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan]
}
df = pd.DataFrame(technologies)
print(df)
# Use train_test_split() Method.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
print(train)
# Using numpy.random.rand() Method.
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test=df[~msk]
print(test)
# Using DataFrame.sample()
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
print(train)
# Using model_selection() method.
from sklearn.model_selection import train_test_split
y = df.pop('Courses')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train]
```

## Frequently Asked Questions on Creating Test and Train Samples from DataFrame

**How do I split a DataFrame into train and test sets using Pandas?**

You can use various methods, but a common approach is to use the `train_test_split`

function from the `sklearn.model_selection`

module. This function allows you to specify the size of the test set and can handle the splitting process efficiently.

**Can I specify the features and target variable separately when creating train and test sets?**

You can specify the columns representing your features and the target variable independently when splitting the DataFrame. This allows you to ensure that the train and test sets contain the appropriate data for modeling and evaluation.

**What should I consider when splitting data for machine learning tasks?**

It’s crucial to consider factors such as data distribution, class imbalance, and the size of your dataset. Techniques like stratified sampling can help maintain the distribution of classes in both train and test sets, ensuring that your model learns effectively and generalizes well to unseen data.

**How can I validate the effectiveness of my train-test split?**

You can evaluate the performance of your model using metrics such as accuracy, precision, recall, or F1-score on the test set. Additionally, techniques like cross-validation can provide more robust estimates of model performance by averaging results over multiple train-test splits.

**Should I shuffle the data before splitting into train and test sets?**

Shuffling the data before splitting helps prevent any bias that may arise from the ordering of the dataset. This ensures that both the train and test sets are representative of the overall data distribution and can improve the generalization ability of your model.

## Conclusion

In this article, you have learned how to create test and train samples of pandas DataFrame by using `DataFrame.drop()`

, `DataFrame.sample()`

, and by applying `sklearn’s train_test_split()`

and `model_selection()`

function with examples.

## Related Articles

- Pandas Convert String Column To DateTime
- Convert List of Dictionaries to Pandas DataFrame
- Sum Pandas DataFrame Columns With Examples
- How to Print Pandas DataFrame without Index
- Check If a Column Exists in Pandas DataFrame
- How to Split Pandas DataFrame?
- Pandas Get Total / Sum of Columns
- How to Change Column Name in Pandas
- Convert Pandas Timestamp to Datetime
- Pandas Check If DataFrame is Empty
- Pandas Add Column based on Another Column
- Pandas Get First Row Value of a Given Column
- How to Generate Time Series Plot in Pandas