Pandas create different samples for test and train from DataFrame can be achieved by using DataFrame.sample()
, and by applying sklearn’s train_test_split()
functions and model_selection()
function. In this article, I will explain how to create test and train samples DataFrame’s by splitting the rows from DataFrame.
The examples explained here will help you split the pandas DataFrame into two random samples (80% and 20%) for training and testing. These samples make sense if you have a large Dataset. To make it simple for this article, I am using quite a simple DataFrame, but you can use these approaches in your real-time projects to split the DataFrame.
Key Points –
- Utilize the
sample
method in Pandas to randomly select rows from the DataFrame for creating test and train sets. - Ensure proper stratification if dealing with imbalanced classes by using
stratify
parameter in the splitting process to maintain class distribution in both train and test sets. - Use the
train_test_split
function from thesklearn.model_selection
module to split the DataFrame into training and testing sets. - Define the test size (proportion of the dataset to include in the test split) and optionally set random state for reproducibility.
- After splitting, you’ll typically have separate DataFrames for training and testing data, which can then be used for model training and evaluation.
- Specify the features and target variable for the split operation.
1. Quick Examples to Create Test and Train Samples
If you are in hurry below are some quick examples to create test and train samples in Pandas DataFrame.
# Quick examples to create test and train samples
# Using DataFrame.sample()
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
# Use train_test_split() Method.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
# Using model_selection() method.
from sklearn.model_selection import train_test_split
y = df.pop('Courses')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train]
# Using numpy.random.rand() Method.
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test=df[~msk]
Now, let’s create a pandas DataFrame with a few rows and columns, execute these examples and validate results. Our DataFrame contains column names Courses
, Fee
and Duration
.
# Create DataFrame
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Spark","Python","PySpark"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan]
}
df = pd.DataFrame(technologies)
print(df)
Yields below output.
# Output:
Courses Fee Duration
0 Spark 22000 30days
1 PySpark 25000 50days
2 Spark 23000 30days
3 Python 24000 None
4 PySpark 26000 NaN
2. Using DataFrame.sample() Method To get Test & Train Samples
DataFrame.sample()
return a random sample of elements from the DataFrame. You can use this to select the train and test samples.
The random_state
parameter controls the shuffling applied to the data before the split. By defining the random_state
, we can reproduce the same split of the data across multiple calls.
Using Shuffle parameter to generate random shuffled before splitting.
# Using DataFrame.sample() Method by random_state arg.
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
print(train)
Yields below output.
# Output:
Courses Fee Duration
3 Python 24000 None
4 PySpark 26000 NaN
0 Spark 22000 30days
1 PySpark 25000 50days
3. Use sklearn to Create Test and Train Samples
The train_test_split()
function of the sklearn
library is able to handle Pandas DataFrames as well as arrays. Therefore, we can simply call the corresponding function by providing the dataset and other parameters.
Test_size: This parameter represents the proportion of the dataset that should be included in the test split. The default value for this parameter is set to 0.25, meaning that if we don’t specify the test_size, the resulting split consists of 75% train and 25% test data.
# Use train_test_split() Method.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
print(train)
Yields below output.
# Output:
Courses Fee Duration
4 PySpark 26000 NaN
2 Spark 23000 30days
3 Python 24000 None
1 PySpark 25000 50days
4. Using model_selection() Method
model_selection()
is a method for setting and analyze data and then using it to measure new data. Selecting a proper model allows you to generate accurate results. You need to train your model by using a specific dataset. Then, you test the model against another dataset.
# Using model_selection() method.
from sklearn.model_selection import train_test_split
y = df.pop('Courses')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train]
Yields below output.
# Output:
Fee Duration
4 26000 NaN
0 22000 30days
2 23000 30days
3 24000 None
5. Using Numpy.random.rand() Method
np.random.rand()
generates random numbers from the standard uniform distribution (i.e., the uniform distribution from 0 to 1), and outputs those numbers as a Numpy array. The np.random.rand()
produces random numbers, structured as a Numpy array. A Numpy array is a data structure that we use for storing and manipulating numeric data.
np.random.rand(len(df))
is an array of size len(df)
with randomly and uniformly distributed float values in range [0, 1]. The < 0.8
applies the comparison element-wise and stores the result in place. Thus values < 0.8
become True
and value >= 0.8
become False
.
# Using numpy.random.rand() Method.
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test=df[~msk]
print(test)
Yields below output.
# Output:
Courses Fee Duration
0 Spark 22000 30days
2 Spark 23000 30days
NOTE: Alternatively, as long as msk
is of dtype bool
, df[msk]
, df.iloc[msk]
and df.loc[msk]
always return the same result.
6. Complete Examples of Create Test and Train Samples DataFrame
Below are some complete examples of creating test and train samples of pandas DataFrame.
# Create a Pandas DataFrame.
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Spark","Python","PySpark"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan]
}
df = pd.DataFrame(technologies)
print(df)
# Use train_test_split() Method.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
print(train)
# Using numpy.random.rand() Method.
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test=df[~msk]
print(test)
# Using DataFrame.sample()
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
print(train)
# Using model_selection() method.
from sklearn.model_selection import train_test_split
y = df.pop('Courses')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train]
Frequently Asked Questions on Creating Test and Train Samples from DataFrame
You can use various methods, but a common approach is to use the train_test_split
function from the sklearn.model_selection
module. This function allows you to specify the size of the test set and can handle the splitting process efficiently.
You can specify the columns representing your features and the target variable independently when splitting the DataFrame. This allows you to ensure that the train and test sets contain the appropriate data for modeling and evaluation.
It’s crucial to consider factors such as data distribution, class imbalance, and the size of your dataset. Techniques like stratified sampling can help maintain the distribution of classes in both train and test sets, ensuring that your model learns effectively and generalizes well to unseen data.
You can evaluate the performance of your model using metrics such as accuracy, precision, recall, or F1-score on the test set. Additionally, techniques like cross-validation can provide more robust estimates of model performance by averaging results over multiple train-test splits.
Shuffling the data before splitting helps prevent any bias that may arise from the ordering of the dataset. This ensures that both the train and test sets are representative of the overall data distribution and can improve the generalization ability of your model.
Conclusion
In this article, you have learned how to create test and train samples of pandas DataFrame by using DataFrame.drop()
, DataFrame.sample()
, and by applying sklearn’s train_test_split()
and model_selection()
function with examples.
Related Articles
- Pandas Convert String Column To DateTime
- Convert List of Dictionaries to Pandas DataFrame
- Sum Pandas DataFrame Columns With Examples
- How to Print Pandas DataFrame without Index
- Check If a Column Exists in Pandas DataFrame
- How to Split Pandas DataFrame?
- Pandas Get Total / Sum of Columns
- How to Change Column Name in Pandas
- Convert Pandas Timestamp to Datetime
- Pandas Check If DataFrame is Empty
- Pandas Add Column based on Another Column
- Pandas Get First Row Value of a Given Column
- How to Generate Time Series Plot in Pandas