PySpark Retrieve DataType & Column Names of DataFrame

Naveen Nelamali

3 years ago

You can find all column names & data types (DataType) of PySpark DataFrame by using df.dtypes and df.schema and you can also retrieve the data type of a specific column name using df.schema["name"].dataType, let’s see all these with PySpark(Python) examples.

1. PySpark Retrieve All Column DataType and Names

By using df.dtypes you can retrieve PySpark DataFrame all column names and data type (datatype) as a list of tuple. Iterate the list and get the column name & data type from the tuple.


from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [(1,"Robert"), (2,"Julia")]
df =spark.createDataFrame(data,["id","name"])

#Get All column names and it's types
for col in df.dtypes:
    print(col[0]+" , "+col[1])

# Prints column name and data type
# id , bigint
# name , string

Similarly, by using df.schema, you can find all column data types and names; schema returns a PySpark StructType which includes metadata of DataFrame columns. Use df.schema.fields to get the list of StructField’s and iterate through it to get name and type.


#Get All column names and it's types
for field in df.schema.fields:
    print(field.name +" , "+str(field.dataType))

This yields the same output as above.

2. Get DataType of a Specific Column Name

If you want to retrieve the data type of a specific DataFrame column by name then use the below example.


#Get data type of a specific column
print(df.schema["name"].dataType)
#StringType

#Get data type of a specific column from dtypes
print(dict(df.dtypes)['name'] )
#string

3. PySpark Get All Column Names as a List

You can get all column names of a DataFrame as a list of strings by using df.columns.


#Get All column names from DataFrame
print(df.columns)

#Print all column names in comma separated string
# ['id', 'name']

4. Get DataFrame Schema

As you would already know, use df.printSchema() to display column names and types to the console.


df.printSchema()

#root
# |-- id: integer (nullable = false)
# |-- name: string (nullable = true)

5. Get Column Nullable Property & Metadata

Let’s see how to get if a column is accepts null values (Nullable) and Metadata of the column.


df.schema["name"].nullable
metaData=df.schema["name"].metadata

6. Other ways to get DataFrame Schema


print(df.schema.simpleString)
# <bound method="" structtype.simplestring="" of="" structtype(list(structfield(id,longtype,true),structfield(name,stringtype,true)))="">

print(df.schema.json)
# <bound method="" datatype.json="" of="" structtype(list(structfield(id,longtype,true),structfield(name,stringtype,true)))="">

print(df.schema.jsonValue)
#<bound method="" structtype.jsonvalue="" of="" structtype(list(structfield(id,longtype,true),structfield(name,stringtype,true)))="">

7. Conclusion

In summary, you can retrieve the names and data type’s (DataType) of all DataFrame column’s by using df.dtypes and df.schema and also you can use several StructFeild methods to get the additional details of the PySpark DataFrame column’s.

Happy Learning !!