You can get/select a list of pandas DataFrame columns based on data type in several ways. In this article, I will explain different ways to get all the column names of the data type (for example object
) and get column names of multiple data types with examples. To select int types just use int64
, to select float type, use float64
, and to select DateTime, use datetime64[ns]
.
Key Points –
- Use the
select_dtypes()
method in Pandas to filter DataFrame columns by data type. - Specify the data types you want to select using the
include
parameter, providing a list of data types or a single data type. - Alternatively, you can exclude specific data types by using the
exclude
parameter withselect_dtypes()
. - This method allows for efficient selection of columns based on their data type, facilitating data manipulation and analysis.
- It’s particularly useful for tasks such as data cleaning, feature engineering, or statistical analysis where segregating columns by data type is necessary.
Quick Examples of Get List of DataFrame Columns Based on Data Type
If you are in a hurry, below are some quick examples of how to get a list of DataFrame columns based on the data type.
# Below are the quick examples
# Select column names of object date type
sel_cols = list(df.select_dtypes(include='object'))
# Returns DataFrame by selected column names
df2=df.select_dtypes(include='object')
# Alternate way to get column names by data type
sel_cols = [column for column, is_type in (df.dtypes=="object").items() if is_type]
# Get DataFrame Column Names of a Multiple Data Types
sel_cols = list(df.select_dtypes(include=['object', 'datetime64[ns]' ]).columns)
# Get DataFrame Column Names of a Multiple Data Types
sel_cols = [c for c in df.columns if df[c].dtype in ['object', 'datetime64[ns]']]
# By Using groupby
col = df.columns.to_series().groupby(df.dtypes).groups
col2 = {k.name: v for k, v in col.items()}
Now, let’s create a DataFrame with a few rows and columns, execute these examples and validate results. Our DataFrame contains column names Courses
, Fee
, Duration
, Discount
and StartDate
.
# Create DataFrame
import pandas as pd
import numpy as np
technologies = [
("Spark", 22000,'30days',1000.0,"2021-11-21"),
("PySpark",25000,'50days',2300.0,"2020-08-21"),
("Hadoop",23000,'55days',1500.0,"2021-10-02")
]
df = pd.DataFrame(technologies,columns = ['Courses','Fee','Duration','Discount', "StartDate"])
df['StartDate'] = pd.to_datetime(df['StartDate'], format='%Y-%m-%d')
print(df)
# Use Dataframe.dtypes to get data types of all columns
print(df.dtypes)
Yields below output.
# Output:
Courses Fee Duration Discount StartDate
0 Spark 22000 30days 1000.0 2021-11-21
1 PySpark 25000 50days 2300.0 2020-08-21
2 Hadoop 23000 55days 1500.0 2021-10-02
Courses object
Fee int64
Duration object
Discount float64
StartDate datetime64[ns]
dtype: object
As you see above, you can get the data types of all columns using df.dtypes
. You can also get the same using df.infer_objects().dtypes
.
2. Get DataFrame Column Names of a Selected Data Type
Using DateFrame.select_dtypes()
methods you can get the pandas DataFrame column names based on the data type.
# Select column names of object date type
sel_cols = list(df.select_dtypes(include='object'))
print(sel_cols)
# Output:
# ['Courses', 'Duration']
Here,
pd.DataFrame(technologies, columns=['Courses', 'Fee', 'Duration', 'Discount', "StartDate"])
creates a DataFramedf
from thetechnologies
list with specified column names.pd.to_datetime(df['StartDate'], format='%Y-%m-%d')
converts the ‘StartDate’ column to datetime format.df.select_dtypes(include='object')
selects columns of object data type.list()
is used to convert the resulting pandas Index object to a list.print(sel_cols)
prints the list of column names with object data type.
If you want to select DataFrame columns based on their data types, you can use the select_dtypes()
method of the DataFrame. For instance, df.select_dtypes(include='object')
selects columns with object data type.
# Returns DataFrame with selected columns.
df2=df.select_dtypes(include='object')
print(df2)
# Output:
# Courses Duration
# 0 Spark 30days
# 1 PySpark 50days
# 2 Hadoop 55days
Alternatively, if you are using an older version, you can use it as below to get column names by data type.
# Alternate way to get column names by data type
sel_cols = [column for column, is_type in (df.dtypes=="object").items() if is_type]
3. Get DataFrame Column Names of Multiple Data Types
You can use DateFrame.select_dtypes()
method to get the pandas DataFrame column names of multiple data types. This program correctly selects DataFrame columns that are either of type ‘object’ or ‘datetime64[ns]’ and prints their column names.
# Get DataFrame Column Names of a Multiple Data Types
sel_cols = list(df.select_dtypes(include=['object', 'datetime64[ns]' ]).columns)
print(sel_cols)
# Output:
# ['Courses', 'Duration', 'StartDate']
Another way to get the same output.
sel_cols = [c for c in df.columns if df[c].dtype in ['object', 'datetime64[ns]']]
print(sel_cols)
4. Use DataFrame.columns.to_series() & groupby() Function
Let’s see another different approach to get column names of a data type. Utilizes the groupby
function in combination with df.columns.to_series()
to group DataFrame columns based on their data types.
# By Using groupby
col = df.columns.to_series().groupby(df.dtypes).groups
print(col)
# Outputs:
{int64: ['Fee'], float64: ['Discount'], datetime64[ns]: ['StartDate'], object: ['Courses', 'Duration']}
Here,
df.columns.to_series()
to convert the DataFrame columns into a pandas Series. This operation transforms the DataFrame’s columns into a one-dimensional Series, preserving the column names as the Series index..groupby(df.dtypes)
Group the Series by data type. It groups together columns that have the same data type. The result is a GroupBy object where each group is indexed by the data type of its elements.- Accesses the groups created by the
groupby
operation. It returns a dictionary-like object where the keys are the unique data types found in the DataFrame, and the values are the indices of the columns belonging to each data type. - Stores the resulting dictionary. Each key represents a data type, and its corresponding value is a list of column names with that data type.
- Prints the dictionary
col
, displaying the grouped columns by data type.
To get column names by grouping data types.
# Get all columns for each data type.
col2 = {k.name: v for k, v in col.items()}
print(col2)
# Output:
# {'int64': Index(['Fee'], dtype='object'), 'float64':
# Index(['Discount'], dtype='object'), 'datetime64[ns]':
# Index(['StartDate'], dtype='object'), 'object': Index(['Courses',
# 'Duration'], dtype='object')}
5. Use DataFrame.dtypes & DataFrame.loc[] Method
You can use the DataFrame.dtypes
attribute to access the data types of each column in the DataFrame, and then use boolean indexing with DataFrame.loc[]
to select columns based on their data types. You can use boolean mask on the dtypes
attribute.
# Use DataFrame.dtypes method
mask = df.dtypes == np.float64
print(mask)
# Output:
# Courses False
# Fee False
# Duration False
# Discount True
# dtype: bool
You can use df.loc[:,mask]
to look at just those columns with the desired dtype
.
# Use DataFrame.loc[] Method
mask = df.dtypes == np.float64
df2 =df.loc[:, mask]
print(df2)
# Output:
# Discount
# 0 1000.0
# 1 2300.0
# 2 1500.0
Now you can use Numpy.round()
(or whatever) and assign it back.
# Use Numpy.round() Method
mask = df.dtypes == np.float64
df2 = np.round(df.loc[:, mask], 2)
print(df2)
# Output:
# Discount
# 0 1000.0
# 1 2300.0
# 2 1500.0
# Use DataFrame.loc[] & Numpy.round() method
mask = df.dtypes == np.float64
df.loc[:, mask] = np.round(df.loc[:, mask], 2)
print(df)
# Output:
# Courses Fee Duration Discount
# 0 Spark 22000 30days 1000.0
# 1 PySpark 25000 50days 2300.0
# 2 Hadoop 23000 55days 1500.0
6. Use DataFrame.dtypes to Get Data Types of All Columns
To retrieve the data types of all columns in a DataFrame using the DataFrame.dtypes
attribute, you can simply access it directly. If you want to know data types of all the columns at once, you can use the plural of dtype
as dtypes
. For E.x: df.dtypes
.
# Use Dataframe.dtypes to get data types of all columns
df2 = df.dtypes
print(df2)
# Use DataFrame.infer_objects().dtypes method
df2 = df.infer_objects().dtypes
print(df2)
Yields below output.
# Output:
Courses object
Fee int64
Duration object
Discount float64
dtype: object
You can use dtypes
will give you desired column’s data type. Use DataFrame.dtypes
to get data type of single column.
# Get data type of single column
df2 = df.dtypes['Discount']
print(df2)
# Output:
# float64
# Use DataFrame.dtypes to get single column
df2 = df['Discount'].dtype
print(df2)
# Output:
# float64
Complete Example For Get List of DataFrame Columns Based on Data Type
import pandas as pd
import numpy as np
technologies = [
("Spark", 22000,'30days',1000.0),
("PySpark",25000,'50days',2300.0),
("Hadoop",23000,'55days',1500.0)
]
df = pd.DataFrame(technologies,columns = ['Courses','Fee','Duration','Discount'])
print(df)
# Use Dataframe.dtypes to get data types of all columns
df2 = df.dtypes
print(df2)
# Use DataFrame.infer_objects().dtypes method
df2 = df.infer_objects().dtypes
print(df2)
# Get data type of single column
df2 = df.dtypes['Discount']
print(df2)
# Use DataFrame.dtypes to get single column
df2 = df['Discount'].dtype
print(df2)
# Use DataFrame.columns.to_series() & groupby() function
df2 = df.columns.to_series().groupby(df.dtypes).groups
print(df2)
# Get all 'object' dtype columns
df2 = df.select_dtypes(include='object').columns
print(df2)
# Get list columns Using DataFrame.select_dtypes()
df2 = list(df.select_dtypes(include='object').columns)
print(df2)
# Use DataFrame.dtypes method
mask = df.dtypes == np.float64
print(mask)
# Use DataFrame.loc[] Method
mask = df.dtypes == np.float64
df2 =df.loc[:, mask]
print(df2
# Use Numpy.round() Method
mask = df.dtypes == np.float64
df2 = np.round(df.loc[:, mask], 2)
print(df2)
# Use DataFrame.loc[] & Numpy.round() method
mask = df.dtypes == np.float64
df.loc[:, mask] = np.round(df.loc[:, mask], 2)
print(df)
Frequently Asked Questions on Get DataFrame Columns by Data Type
You can use the select_dtypes()
method of the DataFrame to select columns based on their data types. For example, to get columns of type ‘int64’, you can use df.select_dtypes(include='int64')
.
To pass a list of data types to the include
parameter of the select_dtypes()
method. For example, to get columns of types ‘int64’ and ‘float64’, you can use df.select_dtypes(include=['int64', 'float64'])
.
You can use the dtypes
attribute of the DataFrame to get a Series containing the data types of each column. Then, you can use boolean indexing or other methods to filter columns based on data types.
You can convert the DataFrame columns into a Series using df.columns.to_series()
and then use the groupby()
function along with df.dtypes
to group columns by their data types.
You can use boolean indexing with df.dtypes
to filter columns based on data types. For example, df[df.dtypes == 'int64']
will return columns with data type ‘int64’.
Conclusion
In this article, you have learned how to get a list of pandas DataFrame columns based on data type using DataFrame.dtypes
, DataFrame.columns.to_series()
, DataFrame.groupby()
, DataFrame.loc[]
and DataFrame.select_dtypes()
methods with more examples.
Happy Learning !!
Related Articles
- Change the Order of Pandas DataFrame Columns
- How to Change Position of a Column in Pandas
- Pandas Shuffle DataFrame Rows Examples
- How to Change Column Name in Pandas
- Convert String to Float in Pandas DataFrame
- Convert Float to Integer in Pandas DataFrame
- Count NaN Values in Pandas DataFrame
- Get Unique Rows in Pandas DataFrame
- Convert Pandas Timestamp to Datetime
- Pandas apply() Return Multiple Columns
- pandas rename multiple columns
- Apply Multiple Filters to Pandas DataFrame or Series
- Append Pandas DataFrames Using for Loop
- How to Union Pandas DataFrames using Concat?
References
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html