• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:6 mins read
You are currently viewing PySpark Retrieve DataType & Column Names of DataFrame

You can find all column names & data types (DataType) of PySpark DataFrame by using df.dtypes and df.schema and you can also retrieve the data type of a specific column name using df.schema["name"].dataType, let’s see all these with PySpark(Python) examples.

1. PySpark Retrieve All Column DataType and Names

By using df.dtypes you can retrieve PySpark DataFrame all column names and data type (datatype) as a list of tuple. Iterate the list and get the column name & data type from the tuple.


from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [(1,"Robert"), (2,"Julia")]
df =spark.createDataFrame(data,["id","name"])

#Get All column names and it's types
for col in df.dtypes:
    print(col[0]+" , "+col[1])

# Prints column name and data type
# id , bigint
# name , string

Similarly, by using df.schema, you can find all column data types and names; schema returns a PySpark StructType which includes metadata of DataFrame columns. Use df.schema.fields to get the list of StructField’s and iterate through it to get name and type.


#Get All column names and it's types
for field in df.schema.fields:
    print(field.name +" , "+str(field.dataType))

This yields the same output as above.

2. Get DataType of a Specific Column Name

If you want to retrieve the data type of a specific DataFrame column by name then use the below example.


#Get data type of a specific column
print(df.schema["name"].dataType)
#StringType

#Get data type of a specific column from dtypes
print(dict(df.dtypes)['name'] )
#string

3. PySpark Get All Column Names as a List

You can get all column names of a DataFrame as a list of strings by using df.columns.


#Get All column names from DataFrame
print(df.columns)

#Print all column names in comma separated string
# ['id', 'name']

4. Get DataFrame Schema

As you would already know, use df.printSchema() to display column names and types to the console.


df.printSchema()

#root
# |-- id: integer (nullable = false)
# |-- name: string (nullable = true)

5. Get Column Nullable Property & Metadata

Let’s see how to get if a column is accepts null values (Nullable) and Metadata of the column.


df.schema["name"].nullable
metaData=df.schema["name"].metadata

6. Other ways to get DataFrame Schema


print(df.schema.simpleString)
# <bound method="" structtype.simplestring="" of="" structtype(list(structfield(id,longtype,true),structfield(name,stringtype,true)))="">

print(df.schema.json)
# <bound method="" datatype.json="" of="" structtype(list(structfield(id,longtype,true),structfield(name,stringtype,true)))="">

print(df.schema.jsonValue)
#<bound method="" structtype.jsonvalue="" of="" structtype(list(structfield(id,longtype,true),structfield(name,stringtype,true)))="">

7. Conclusion

In summary, you can retrieve the names and data type’s (DataType) of all DataFrame column’s by using df.dtypes and df.schema and also you can use several StructFeild methods to get the additional details of the PySpark DataFrame column’s.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

This Post Has 6 Comments

  1. Rob

    I want to be able to return a list of columns by datatype and use the subset of values in a udf

  2. Rob

    Basically I want to identify specific datatypes and depending on the datatypes run validations on the data to confirm they are in the correct format.
    I have a raw and clean dataframe scheme. The raw scheme are untyped, string types. The clean will have floats, strings, dates etc. I have a function that will pass in the values from the raw dataframe that takes in a parameter of “datatype”. If I can pass into the function all the columns which are of a particular datatype dynamically, I can call the function once and pass in the different datatypes from the “clean” dataframe to validate the values.
    i.e. run_date, datetime | currency, string | rate, float | cost, float

    I would then want to use the output of df.dtypes to show me all of the float columns and pass into the function that validates that they are floats.
    Then all the datetime columns and validate as true dates.

    Does that make sense?

  3. NNK

    Hi Rob, I didn’t quite get what you are trying to do. Why are you converting data types list to dataframe?

  4. Rob

    HI, thanks for the post. To a novice like myself ithis is very useful. Wow would you output the dataset into a dataframe? I’ve tried using something like this

    df.dtypes.to_frame(‘df2’).reset_index()

    but get
    ‘list’ object has no attribute ‘to_frame’

  5. NNK

    Thanks. It’s a typo and corrected now.

  6. Anonymous

    It’s df.dtypes not df.dttypes

Comments are closed.