• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:16 mins read
You are currently viewing pandas DataFrame.astype() – Examples

DataFrame.astype() function is used to cast a column data type (dtype) in pandas object, it supports String, flat, date, int, datetime any many other dtypes supported by Numpy. This comes in handy when you wanted to cast the DataFrame column from one data type to another.

Advertisements

Key Points –

  • Use the astype() method to convert strings to integers in Pandas.
  • Ensure the string values represent valid integers, otherwise, the conversion will raise errors.
  • Handle missing or non-numeric values appropriately to avoid conversion issues.
  • Consider using pd.to_numeric() for more flexibility and error handling.
  • Validate the resulting integer data type to confirm successful conversion and compatibility with downstream analysis.
  • Verify the converted integer dtype matches your expectations and data requirements.

DataFrame.astype() Syntax

Following is a syntax of the DataFrame.astype(). This function takes dtype, copy, and errors params.


# astype() Syntax
DataFrame.astype(dtype, copy=True, errors='raise')

Following are the parameters of astype() function.

  • dtype – Accepts a numpy.dtype or Python type to cast entire pandas object to the same type. Use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns.
  • copy -Default True. Return a copy when copy=True (be very careful setting copy=False as changes to values then may propagate to other pandas objects).
  • errors – Default raise.
    • Use ‘raise’ to generate exception when unable to cast due to invalid data for type.
    • Use ‘ignore’ to not raise exception (supress errors/exceptions). On error return original object.

DataFrame.astype() – Cast All Columns Data Type (dtype)

By default pandas astype() function tries to cast all DataFrame columns to specified numpy.dtype or Python types (int, string, float, date, datetime). If any of the columns are unable to cast due to the invalid data or nan, it raises the error ‘ValueError: invalid literal’ and fails the operation.

The below example demonstrates casting all columns data types.


# DataFrame.astype() - Cast All Columns Data Type (dtype)
import pandas as pd
import numpy as np
# Create DataFrame from Dictionary
technologies = {
    'Fee' :["20000","25000","26000"],
    'Discount':["1000","2300","1500"]
              }
df = pd.DataFrame(technologies)
print(df.dtypes)

# Output:
# Fee         object
# Discount    object
# dtype: object

DataFrame.dtypes returns the Column name and dtypes for all DataFrame columns. Note that the above DataFrame has object types for all columns.

Now let’s cast the data type to 64-bit signed integer, you can use numpy.int64,numpy.int_, int64 or int as param. To cast to 32-bit signed integer, use numpy.int32, int32.


# Cast all columns to int
df = df.astype(np.int64)
df = df.astype('int64')
df = df.astype('int')

print(df.dtypes) 

# output:
# Fee         int64
# Discount    int64
# dtype: object

Notice that it updated all columns with the new dtype.

Let’s cast it to String, using numpy.str_ or string.


# Cast all columns to string
df = df.astype('string')
print(df.dtypes)

# Output:
# Fee         string
# Discount    string
# dtype: object

Let’s cast it to float type using numpy.float64, numpy.float_, float


# Cast all columns to float
df = df.astype('float')
print(df.dtypes)

# Output:
# Fee         float
# Discount    flat
# dtype: object

Change Specific Column Type

Similarly, you can also change the specific column type by using Series.astype() function, since each column on DataFrame is pandas Series, I will get the column from DataFrame as Series and use astype() function. In the below example df.Fee or df['Fee'] returns Series object.


# Cast specific column type
df.Fee = df.Fee.astype('int')
(or)
df.Fee = df['Fee'].astype('int')
print(df.dtypes)

# Output:
# Fee          int64
# Discount    object
# dtype: object

astype() – Cast Multiple Columns Using Dict

dtype param of the astype() function also supports Dictionary in format {col: dtype, …} where col is a column label and dtype is a numpy.dtype or Python type (int, string, float, date, datetime) to cast one or multiple DataFrame columns.


# Astype() - Cast Multiple Columns Using Dict 
import pandas as pd
import numpy as np
# Create DataFrame from Dictionary
technologies = {
    'Courses':["Spark","PySpark","Hadoop"],
    'Fee' :["20000","25000","26000"],
    'Duration':['30day','40days','35days'],
    'Discount':["1000","2300","1500"]
              }

df = pd.DataFrame(technologies)
print(df.dtypes)

# Output:
# Courses     object
# Fee         object
# Duration    object
# Discount    object
# dtype: object

Now, by using the pandas DataFrame.astype() function, cast the Courses column to string, Fee column to int and Discount column to float.


# Apply cast type for multiple columns
df2 = df.astype({'Courses':'string','Fee':'int','Discount':'float'})
print(df2.dtypes)

# Output:
# Courses      string
# Fee           int64
# Duration     object
# Discount    float64
# dtype: object

astype() with raise or ignore Error

Finally, let’s see how you can raise or ignore the error while casting, to do so you should use errors param. By default, it uses raise as a value meaning generate an exception when unable to cast due to invalid data for type.

From our DataFrame Courses column have string data, let’s cast this to int and see what happens.


# Raise error when unable to cast
df.Courses = df.Courses.astype('int')

# Output:
# ValueError: invalid literal for int() with base 10: 'Spark'

As you see, it raised the error when unable to cast. Now let’s suppress the exception using ignore value on errors param. With this, when errors happen it ignores the error and returns the same object without updating.


# Ignore error when unable to cast
df.Courses = df.Courses.astype('int', errors='ignore')
print(df.dtypes)

# Output:
# Courses      string
# Fee           int64
# Duration     object
# Discount    float64
# dtype: object

Frequently Asked Questions on pandas DataFrame.astype() 

What does the astype() method do in pandas DataFrame?

The astype() method in pandas DataFrame is used to change the data type of one or more columns to a specified data type. It enables users to convert the data type of columns within a DataFrame, facilitating data manipulation and analysis.

What data types can be converted using astype()?

he astype() method can be used to convert columns to various data types such as int, float, string, datetime, category, etc.

How does astype() handle invalid conversions?

If the conversion cannot be performed (e.g., due to invalid values in the column), astype() may raise an error or coerce the values to the closest valid representation, depending on the parameters provided.

Can astype() be used to convert multiple columns simultaneously?

The astype() method in pandas DataFrame can be used to convert multiple columns simultaneously. You can achieve this by passing a dictionary where the keys are the column names and the values are the target data types to which you want to convert the corresponding columns. This allows for efficient conversion of multiple columns in a single operation.

Are there alternative methods to astype() for type conversion in pandas?

Besides astype(), pandas provides other methods like pd.to_numeric(), pd.to_datetime(), and pd.to_timedelta() for more specialized conversions. Additionally, you can use custom functions or apply transformations using lambda functions.

How does astype() handle missing values (NaN) during conversion?

Missing values (NaN) are usually preserved during conversion, but the behavior may vary depending on the target data type and conversion process. It’s important to handle missing values appropriately before or after using astype().

Conclusion

In this article, I have explained the pandas DataFrame.astype() syntax, examples of casting entire DataFrame, specific columns, multiple columns to numpy.dtype or Python type (int, string, float, date, datetime).

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium