You can find all column names & data types (DataType) of PySpark DataFrame by using df.dtypes
and df.schema
and you can also retrieve the data type of a specific column name using df.schema["name"].dataType
, let’s see all these with PySpark(Python) examples.
1. PySpark Retrieve All Column DataType and Names
By using df.dtypes
you can retrieve PySpark DataFrame all column names and data type (datatype) as a list of tuple. Iterate the list and get the column name & data type from the tuple.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [(1,"Robert"), (2,"Julia")]
df =spark.createDataFrame(data,["id","name"])
#Get All column names and it's types
for col in df.dtypes:
print(col[0]+" , "+col[1])
# Prints column name and data type
# id , bigint
# name , string
Similarly, by using df.schema
, you can find all column data types and names; schema
returns a PySpark StructType which includes metadata of DataFrame columns. Use df.schema.fields
to get the list of StructField’s and iterate through it to get name and type.
#Get All column names and it's types
for field in df.schema.fields:
print(field.name +" , "+str(field.dataType))
This yields the same output as above.
2. Get DataType of a Specific Column Name
If you want to retrieve the data type of a specific DataFrame column by name then use the below example.
#Get data type of a specific column
print(df.schema["name"].dataType)
#StringType
#Get data type of a specific column from dtypes
print(dict(df.dtypes)['name'] )
#string
3. PySpark Get All Column Names as a List
You can get all column names of a DataFrame as a list of strings by using df.columns
.
#Get All column names from DataFrame
print(df.columns)
#Print all column names in comma separated string
# ['id', 'name']
4. Get DataFrame Schema
As you would already know, use df.printSchema()
to display column names and types to the console.
df.printSchema()
#root
# |-- id: integer (nullable = false)
# |-- name: string (nullable = true)
5. Get Column Nullable Property & Metadata
Let’s see how to get if a column is accepts null values (Nullable) and Metadata of the column.
df.schema["name"].nullable
metaData=df.schema["name"].metadata
6. Other ways to get DataFrame Schema
print(df.schema.simpleString)
# <bound method="" structtype.simplestring="" of="" structtype(list(structfield(id,longtype,true),structfield(name,stringtype,true)))="">
print(df.schema.json)
# <bound method="" datatype.json="" of="" structtype(list(structfield(id,longtype,true),structfield(name,stringtype,true)))="">
print(df.schema.jsonValue)
#<bound method="" structtype.jsonvalue="" of="" structtype(list(structfield(id,longtype,true),structfield(name,stringtype,true)))="">
7. Conclusion
In summary, you can retrieve the names and data type’s (DataType) of all DataFrame column’s by using df.dtypes
and df.schema
and also you can use several StructFeild methods to get the additional details of the PySpark DataFrame column’s.
Happy Learning !!
Related Articles
- PySpark Find Maximum Row per Group in DataFrame
- PySpark SparkContext Explained
- What is PySpark DataFrame?
- PySpark Replace Column Values in DataFrame
- PySpark alias() Column & DataFrame Examples
- PySpark DataFrame groupBy and Sort by Descending Order
- PySpark Count of Non null, nan Values in DataFrame
- PySpark Replace Empty Value With None/null on DataFrame
I want to be able to return a list of columns by datatype and use the subset of values in a udf
Basically I want to identify specific datatypes and depending on the datatypes run validations on the data to confirm they are in the correct format.
I have a raw and clean dataframe scheme. The raw scheme are untyped, string types. The clean will have floats, strings, dates etc. I have a function that will pass in the values from the raw dataframe that takes in a parameter of “datatype”. If I can pass into the function all the columns which are of a particular datatype dynamically, I can call the function once and pass in the different datatypes from the “clean” dataframe to validate the values.
i.e. run_date, datetime | currency, string | rate, float | cost, float
I would then want to use the output of df.dtypes to show me all of the float columns and pass into the function that validates that they are floats.
Then all the datetime columns and validate as true dates.
Does that make sense?
Hi Rob, I didn’t quite get what you are trying to do. Why are you converting data types list to dataframe?
HI, thanks for the post. To a novice like myself ithis is very useful. Wow would you output the dataset into a dataframe? I’ve tried using something like this
df.dtypes.to_frame(‘df2’).reset_index()
but get
‘list’ object has no attribute ‘to_frame’
Thanks. It’s a typo and corrected now.
It’s df.dtypes not df.dttypes