• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:21 mins read
You are currently viewing Pandas Get DataFrame Columns by Data Type

You can get/select a list of pandas DataFrame columns based on data type in several ways. In this article, I will explain different ways to get all the column names of the data type (for example object) and get column names of multiple data types with examples. To select int types just use int64, to select float type, use float64, and to select DateTime, use datetime64[ns].

Key Points –

  • Use the select_dtypes() method in Pandas to filter DataFrame columns by data type.
  • Specify the data types you want to select using the include parameter, providing a list of data types or a single data type.
  • Alternatively, you can exclude specific data types by using the exclude parameter with select_dtypes().
  • This method allows for efficient selection of columns based on their data type, facilitating data manipulation and analysis.
  • It’s particularly useful for tasks such as data cleaning, feature engineering, or statistical analysis where segregating columns by data type is necessary.

Quick Examples of Get List of DataFrame Columns Based on Data Type

If you are in a hurry, below are some quick examples of how to get a list of DataFrame columns based on the data type.


# Below are the quick examples

# Select column names of object date type
sel_cols = list(df.select_dtypes(include='object'))

# Returns DataFrame by selected column names
df2=df.select_dtypes(include='object')

# Alternate way to get column names by data type
sel_cols = [column for column, is_type in (df.dtypes=="object").items() if is_type]

# Get DataFrame Column Names of a Multiple Data Types
sel_cols = list(df.select_dtypes(include=['object', 'datetime64[ns]' ]).columns)

# Get DataFrame Column Names of a Multiple Data Types
sel_cols = [c for c in df.columns if df[c].dtype in ['object', 'datetime64[ns]']]

# By Using groupby
col = df.columns.to_series().groupby(df.dtypes).groups

col2 = {k.name: v for k, v in col.items()}

Now, let’s create a DataFrame with a few rows and columns, execute these examples and validate results. Our DataFrame contains column names Courses, Fee, Duration, Discount and StartDate.


# Create DataFrame
import pandas as pd
import numpy as np
technologies = [
            ("Spark", 22000,'30days',1000.0,"2021-11-21"),
            ("PySpark",25000,'50days',2300.0,"2020-08-21"),
            ("Hadoop",23000,'55days',1500.0,"2021-10-02")
            ]
df = pd.DataFrame(technologies,columns = ['Courses','Fee','Duration','Discount', "StartDate"])
df['StartDate'] = pd.to_datetime(df['StartDate'], format='%Y-%m-%d')
print(df)

# Use Dataframe.dtypes to get data types of all columns
print(df.dtypes)

Yields below output.


# Output:
   Courses    Fee Duration  Discount  StartDate
0    Spark  22000   30days    1000.0 2021-11-21
1  PySpark  25000   50days    2300.0 2020-08-21
2   Hadoop  23000   55days    1500.0 2021-10-02

Courses              object
Fee                   int64
Duration             object
Discount            float64
StartDate    datetime64[ns]
dtype: object

As you see above, you can get the data types of all columns using df.dtypes. You can also get the same using df.infer_objects().dtypes.

2. Get DataFrame Column Names of a Selected Data Type

Using DateFrame.select_dtypes() methods you can get the pandas DataFrame column names based on the data type.


# Select column names of object date type
sel_cols = list(df.select_dtypes(include='object'))
print(sel_cols)

# Output:
# ['Courses', 'Duration']

Here,

  • pd.DataFrame(technologies, columns=['Courses', 'Fee', 'Duration', 'Discount', "StartDate"]) creates a DataFrame df from the technologies list with specified column names.
  • pd.to_datetime(df['StartDate'], format='%Y-%m-%d') converts the ‘StartDate’ column to datetime format.
  • df.select_dtypes(include='object') selects columns of object data type.
  • list() is used to convert the resulting pandas Index object to a list.
  • print(sel_cols) prints the list of column names with object data type.

If you want to select DataFrame columns based on their data types, you can use the select_dtypes() method of the DataFrame. For instance, df.select_dtypes(include='object') selects columns with object data type.


# Returns DataFrame with selected columns.
df2=df.select_dtypes(include='object')
print(df2)

# Output:
#   Courses Duration
# 0    Spark   30days
# 1  PySpark   50days
# 2   Hadoop   55days

Alternatively, if you are using an older version, you can use it as below to get column names by data type.


# Alternate way to get column names by data type
sel_cols = [column for column, is_type in (df.dtypes=="object").items() if is_type]

3. Get DataFrame Column Names of Multiple Data Types

You can use DateFrame.select_dtypes() method to get the pandas DataFrame column names of multiple data types. This program correctly selects DataFrame columns that are either of type ‘object’ or ‘datetime64[ns]’ and prints their column names.


# Get DataFrame Column Names of a Multiple Data Types
sel_cols = list(df.select_dtypes(include=['object', 'datetime64[ns]' ]).columns)
print(sel_cols)

# Output:
# ['Courses', 'Duration', 'StartDate']

Another way to get the same output.


sel_cols = [c for c in df.columns if df[c].dtype in ['object', 'datetime64[ns]']]
print(sel_cols)

4. Use DataFrame.columns.to_series() & groupby() Function

Let’s see another different approach to get column names of a data type. Utilizes the groupby function in combination with df.columns.to_series() to group DataFrame columns based on their data types.


# By Using groupby
col = df.columns.to_series().groupby(df.dtypes).groups
print(col)

# Outputs:
{int64: ['Fee'], float64: ['Discount'], datetime64[ns]: ['StartDate'], object: ['Courses', 'Duration']}

Here,

  • df.columns.to_series() to convert the DataFrame columns into a pandas Series. This operation transforms the DataFrame’s columns into a one-dimensional Series, preserving the column names as the Series index.
  • .groupby(df.dtypes) Group the Series by data type. It groups together columns that have the same data type. The result is a GroupBy object where each group is indexed by the data type of its elements.
  • Accesses the groups created by the groupby operation. It returns a dictionary-like object where the keys are the unique data types found in the DataFrame, and the values are the indices of the columns belonging to each data type.
  • Stores the resulting dictionary. Each key represents a data type, and its corresponding value is a list of column names with that data type.
  • Prints the dictionary col, displaying the grouped columns by data type.

To get column names by grouping data types.


# Get all columns for each data type.
col2 = {k.name: v for k, v in col.items()}
print(col2)

# Output:
# {'int64': Index(['Fee'], dtype='object'), 'float64': 
# Index(['Discount'], dtype='object'), 'datetime64[ns]':
# Index(['StartDate'], dtype='object'), 'object': Index(['Courses', 
# 'Duration'], dtype='object')}

5. Use DataFrame.dtypes & DataFrame.loc[] Method

You can use the DataFrame.dtypes attribute to access the data types of each column in the DataFrame, and then use boolean indexing with DataFrame.loc[] to select columns based on their data types. You can use boolean mask on the dtypes attribute.


# Use DataFrame.dtypes method
mask = df.dtypes == np.float64
print(mask)

# Output:
# Courses     False
# Fee         False
# Duration    False
# Discount     True
# dtype: bool

You can use df.loc[:,mask] to look at just those columns with the desired dtype.


# Use DataFrame.loc[] Method
mask = df.dtypes == np.float64
df2 =df.loc[:, mask]
print(df2)

# Output:
#   Discount
# 0    1000.0
# 1    2300.0
# 2    1500.0

Now you can use Numpy.round() (or whatever) and assign it back.


# Use Numpy.round() Method
mask = df.dtypes == np.float64
df2 = np.round(df.loc[:, mask], 2)
print(df2)

# Output:
#   Discount
# 0    1000.0
# 1    2300.0
# 2    1500.0

# Use DataFrame.loc[] & Numpy.round() method
mask = df.dtypes == np.float64
df.loc[:, mask] = np.round(df.loc[:, mask], 2)
print(df)

# Output:
#   Courses    Fee Duration  Discount
# 0    Spark  22000   30days    1000.0
# 1  PySpark  25000   50days    2300.0
# 2   Hadoop  23000   55days    1500.0

6. Use DataFrame.dtypes to Get Data Types of All Columns

To retrieve the data types of all columns in a DataFrame using the DataFrame.dtypes attribute, you can simply access it directly. If you want to know data types of all the columns at once, you can use the plural of dtype as dtypes. For E.x: df.dtypes.


# Use Dataframe.dtypes to get data types of all columns
df2 = df.dtypes
print(df2)

# Use DataFrame.infer_objects().dtypes method
df2 = df.infer_objects().dtypes
print(df2)

Yields below output.


# Output:
Courses      object
Fee           int64
Duration     object
Discount    float64
dtype: object

You can use dtypes will give you desired column’s data type. Use DataFrame.dtypes to get data type of single column.


# Get data type of single column
df2 = df.dtypes['Discount']
print(df2)

# Output:
# float64

# Use DataFrame.dtypes to get single column
df2 = df['Discount'].dtype
print(df2)

# Output:
# float64

Complete Example For Get List of DataFrame Columns Based on Data Type


import pandas as pd
import numpy as np
technologies = [
            ("Spark", 22000,'30days',1000.0),
            ("PySpark",25000,'50days',2300.0),
            ("Hadoop",23000,'55days',1500.0)
            ]
df = pd.DataFrame(technologies,columns = ['Courses','Fee','Duration','Discount'])
print(df)

# Use Dataframe.dtypes to get data types of all columns
df2 = df.dtypes
print(df2)

# Use DataFrame.infer_objects().dtypes method
df2 = df.infer_objects().dtypes
print(df2)

# Get data type of single column
df2 = df.dtypes['Discount']
print(df2)

# Use DataFrame.dtypes to get single column
df2 = df['Discount'].dtype
print(df2)

# Use DataFrame.columns.to_series() & groupby() function
df2 = df.columns.to_series().groupby(df.dtypes).groups
print(df2)

# Get all 'object' dtype columns
df2 = df.select_dtypes(include='object').columns
print(df2)

# Get list columns Using DataFrame.select_dtypes()
df2 = list(df.select_dtypes(include='object').columns)
print(df2)

# Use DataFrame.dtypes method
mask = df.dtypes == np.float64
print(mask)

# Use DataFrame.loc[] Method
mask = df.dtypes == np.float64
df2 =df.loc[:, mask]
print(df2

# Use Numpy.round() Method
mask = df.dtypes == np.float64
df2 = np.round(df.loc[:, mask], 2)
print(df2)

# Use DataFrame.loc[] & Numpy.round() method
mask = df.dtypes == np.float64
df.loc[:, mask] = np.round(df.loc[:, mask], 2)
print(df)

Frequently Asked Questions on Get DataFrame Columns by Data Type

How can I get columns of a specific data type in a DataFrame?

You can use the select_dtypes() method of the DataFrame to select columns based on their data types. For example, to get columns of type ‘int64’, you can use df.select_dtypes(include='int64').

How can I get columns of multiple data types in a DataFrame?

To pass a list of data types to the include parameter of the select_dtypes() method. For example, to get columns of types ‘int64’ and ‘float64’, you can use df.select_dtypes(include=['int64', 'float64']).

How can I get the names of columns by data type?

You can use the dtypes attribute of the DataFrame to get a Series containing the data types of each column. Then, you can use boolean indexing or other methods to filter columns based on data types.

How can I group DataFrame columns by data type?

You can convert the DataFrame columns into a Series using df.columns.to_series() and then use the groupby() function along with df.dtypes to group columns by their data types.

How can I access DataFrame columns by data type using boolean indexing?

You can use boolean indexing with df.dtypes to filter columns based on data types. For example, df[df.dtypes == 'int64'] will return columns with data type ‘int64’.

Conclusion

In this article, you have learned how to get a list of pandas DataFrame columns based on data type using DataFrame.dtypes, DataFrame.columns.to_series(), DataFrame.groupby(), DataFrame.loc[] and DataFrame.select_dtypes() methods with more examples.

Happy Learning !!

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply