Pandas create different samples for test and train from DataFrame can be achieved by using DataFrame.sample()
, and by applying sklearn’s train_test_split()
function and model_selection()
function. In this article, I will explain how to create test and train samples DataFrame’s by splitting the rows from DataFrame.
The examples explained here will help you split the pandas DataFrame into two random samples (80% and 20%) for training and testing. These samples make sense if you have a large Dataset. To make it simple for this article, I am using quite a simple DataFrame, but you can use these approaches in your real-time projects to split the DataFrame.
1. Quick Examples to Create Test and Train Samples
If you are in hurry below are some quick examples to create test and train samples in pandas DataFrame.
# Using DataFrame.sample()
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
# Below are some Quick examples
# Use train_test_split() Method.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
# Using model_selection() method.
from sklearn.model_selection import train_test_split
y = df.pop('Courses')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train]
# Using numpy.random.rand() Method.
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test=df[~msk]
Now, let’s create a pandas DataFrame with a few rows and columns, execute these examples and validate results. Our DataFrame contains column names Courses
, Fee
and Duration
.
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Spark","Python","PySpark"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan]
}
df = pd.DataFrame(technologies)
print(df)
Yields below output.
Courses Fee Duration
0 Spark 22000 30days
1 PySpark 25000 50days
2 Spark 23000 30days
3 Python 24000 None
4 PySpark 26000 NaN
2. Using DataFrame.sample() Method To get Test & Train Samples
DataFrame.sample() return a random sample of elements from the DataFrame. You can use this to select the train and test samples.
The random_state
parameter controls the shuffling applied to the data before the split. By defining the random_state, we can reproduce the same split of the data across multiple calls.
Using Shuffle parameter to generate random shuffled before splitting.
# Using DataFrame.sample() Method by random_state arg.
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
print(train)
Yields below output.
Courses Fee Duration
3 Python 24000 None
4 PySpark 26000 NaN
0 Spark 22000 30days
1 PySpark 25000 50days
3. Use sklearn to Create Test and Train Samples
The train_test_split()
function of the sklearn
library is able to handle Pandas DataFrames as well as arrays. Therefore, we can simply call the corresponding function by providing the dataset and other parameters.
Test_size: This parameter represents the proportion of the dataset that should be included in the test split. The default value for this parameter is set to 0.25, meaning that if we don’t specify the test_size, the resulting split consists of 75% train and 25% test data.
# Use train_test_split() Method.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
print(train)
Yields below output.
Courses Fee Duration
4 PySpark 26000 NaN
2 Spark 23000 30days
3 Python 24000 None
1 PySpark 25000 50days
4. Using model_selection() Method
model_selection()
is a method for setting and analyze data and then using it to measure new data. Selecting a proper model allows you to generate accurate results. You need to train your model by using a specific dataset. Then, you test the model against another dataset.
# Using model_selection() method.
from sklearn.model_selection import train_test_split
y = df.pop('Courses')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train]
Yields below output.
Fee Duration
4 26000 NaN
0 22000 30days
2 23000 30days
3 24000 None
5. Using Numpy.random.rand() Method
np.random.rand()
generates random numbers from the standard uniform distribution (i.e., the uniform distribution from 0 to 1), and outputs those numbers as a Numpy array. The np.random.rand()
produces random numbers, structured as a Numpy array. A Numpy array is a data structure that we use for storing and manipulating numeric data.
np.random.rand(len(df))
is an array of size len(df)
with randomly and uniformly distributed float values in range [0, 1]. The < 0.8
applies the comparison element-wise and stores the result in place. Thus values < 0.8 become True
and value >= 0.8 become False
.
# Using numpy.random.rand() Method.
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test=df[~msk]
print(test)
Yields below output.
Courses Fee Duration
0 Spark 22000 30days
2 Spark 23000 30days
NOTE: Alternatively, as long as msk
is of dtype bool
, df[msk]
, df.iloc[msk]
and df.loc[msk]
always return the same result.
6. Complete Examples of Create Test and Train Samples DataFrame
Below are some complete examples of creating test and train samples of pandas DataFrame.
# Create a Pandas DataFrame.
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Spark","Python","PySpark"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan]
}
df = pd.DataFrame(technologies)
print(df)
# Use train_test_split() Method.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
print(train)
# Using numpy.random.rand() Method.
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test=df[~msk]
print(test)
# Using DataFrame.sample()
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
print(train)
# Using model_selection() method.
from sklearn.model_selection import train_test_split
y = df.pop('Courses')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train]
Conclusion
In this article, you have learned how to create test and train samples of pandas DataFrame by using DataFrame.drop()
, DataFrame.sample()
, and by applying sklearn’s train_test_split()
and model_selection()
function with examples.
Related Articles
- Pandas Convert String Column To DateTime
- Convert List of Dictionaries to Pandas DataFrame
- Sum Pandas DataFrame Columns With Examples
- How to Print Pandas DataFrame without Index
- Check If a Column Exists in Pandas DataFrame
- How to Split Pandas DataFrame?
- Pandas Add Column based on Another Column
- How to Generate Time Series Plot in Pandas