PySpark Retrieve DataType & Column Names of DataFrame

| *** Please Subscribe for Ad Free & Premium Content ***

Post author:Naveen Nelamali
Post category:PySpark
Post last modified:March 27, 2024
Reading time:6 mins read

You are currently viewing PySpark Retrieve DataType & Column Names of DataFrame

You can find all column names & data types (DataType) of PySpark DataFrame by using df.dtypes and df.schema and you can also retrieve the data type of a specific column name using df.schema["name"].dataType, let’s see all these with PySpark(Python) examples.

1. PySpark Retrieve All Column DataType and Names

By using df.dtypes you can retrieve PySpark DataFrame all column names and data type (datatype) as a list of tuple. Iterate the list and get the column name & data type from the tuple.


from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [(1,"Robert"), (2,"Julia")]
df =spark.createDataFrame(data,["id","name"])

#Get All column names and it's types
for col in df.dtypes:
    print(col[0]+" , "+col[1])

# Prints column name and data type
# id , bigint
# name , string

Similarly, by using df.schema, you can find all column data types and names; schema returns a PySpark StructType which includes metadata of DataFrame columns. Use df.schema.fields to get the list of StructField’s and iterate through it to get name and type.


#Get All column names and it's types
for field in df.schema.fields:
    print(field.name +" , "+str(field.dataType))

This yields the same output as above.

2. Get DataType of a Specific Column Name

If you want to retrieve the data type of a specific DataFrame column by name then use the below example.


#Get data type of a specific column
print(df.schema["name"].dataType)
#StringType

#Get data type of a specific column from dtypes
print(dict(df.dtypes)['name'] )
#string

3. PySpark Get All Column Names as a List

You can get all column names of a DataFrame as a list of strings by using df.columns.


#Get All column names from DataFrame
print(df.columns)

#Print all column names in comma separated string
# ['id', 'name']

4. Get DataFrame Schema

As you would already know, use df.printSchema() to display column names and types to the console.


df.printSchema()

#root
# |-- id: integer (nullable = false)
# |-- name: string (nullable = true)

5. Get Column Nullable Property & Metadata

Let’s see how to get if a column is accepts null values (Nullable) and Metadata of the column.


df.schema["name"].nullable
metaData=df.schema["name"].metadata

6. Other ways to get DataFrame Schema


print(df.schema.simpleString)
# <bound method="" structtype.simplestring="" of="" structtype(list(structfield(id,longtype,true),structfield(name,stringtype,true)))="">

print(df.schema.json)
# <bound method="" datatype.json="" of="" structtype(list(structfield(id,longtype,true),structfield(name,stringtype,true)))="">

print(df.schema.jsonValue)
#<bound method="" structtype.jsonvalue="" of="" structtype(list(structfield(id,longtype,true),structfield(name,stringtype,true)))="">

7. Conclusion

In summary, you can retrieve the names and data type’s (DataType) of all DataFrame column’s by using df.dtypes and df.schema and also you can use several StructFeild methods to get the additional details of the PySpark DataFrame column’s.

Happy Learning !!

This Post Has 6 Comments

Rob September 30, 2021

I want to be able to return a list of columns by datatype and use the subset of values in a udf
Rob September 29, 2021

Basically I want to identify specific datatypes and depending on the datatypes run validations on the data to confirm they are in the correct format.
I have a raw and clean dataframe scheme. The raw scheme are untyped, string types. The clean will have floats, strings, dates etc. I have a function that will pass in the values from the raw dataframe that takes in a parameter of “datatype”. If I can pass into the function all the columns which are of a particular datatype dynamically, I can call the function once and pass in the different datatypes from the “clean” dataframe to validate the values.
i.e. run_date, datetime | currency, string | rate, float | cost, float

I would then want to use the output of df.dtypes to show me all of the float columns and pass into the function that validates that they are floats.
Then all the datetime columns and validate as true dates.

Does that make sense?
NNK September 29, 2021

Hi Rob, I didn’t quite get what you are trying to do. Why are you converting data types list to dataframe?
Rob September 29, 2021

HI, thanks for the post. To a novice like myself ithis is very useful. Wow would you output the dataset into a dataframe? I’ve tried using something like this

df.dtypes.to_frame(‘df2’).reset_index()

but get
‘list’ object has no attribute ‘to_frame’
NNK June 23, 2021

Thanks. It’s a typo and corrected now.
Anonymous June 16, 2021

It’s df.dtypes not df.dttypes

Comments are closed.