• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:4 mins read
You are currently viewing PySpark Refer Column Name With Dot (.)

Problem: I have a PySpark (Spark with Python) DataFrame with a dot in the Column names, could you please explain how to access/refer to this column with the dot by using <a href="https://sparkbyexamples.com/pyspark/pyspark-withcolumn/">withColumn()</a> or <a href="https://sparkbyexamples.com/pyspark/select-columns-from-pyspark-dataframe/">select()</a> transformations. Below is the PySpark DataFrame with column name.fname with dot.


from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data=[("James",23),("Ann",40)]
df=spark.createDataFrame(data).toDF("name.fname","gender")

PySpark DataFrame Column Name with Dot (.)

Solution: Generally as a best practice column names should not contain special characters except underscore (_) however, sometimes we may need to handle it.

In order to access PySpark/Spark DataFrame Column Name with a dot from wihtColumn() & select(), you just need to enclose the column name with backticks (`)

Using Column Name with Dot on select().


from pyspark.sql.functions import col
df.select(col("`name.fname`")).show()
df.select(df["`name.fname`"]).show()
+----------+
|name.fname|
+----------+
|     James|
|       Ann|
+----------+

Using Column Name with Dot on withColumn()


df.withColumn("new_col",col("`name.fname`").substr(1,2)).show()
+----------+------+-------+
|name.fname|gender|new_col|
+----------+------+-------+
|     James|    23|     Ja|
|       Ann|    40|     An|
+----------+------+-------+

Using on filter(), for more example refer to PySpark Filter() Usage & Examples


df.filter(col("`name.fname`").startswith("J")).show()
+----------+------+
|name.fname|gender|
+----------+------+
|     James|    23|
+----------+------+

Replace All Column Names from Dot (.) to underscore (_)

Have a column name with a dot leads us into confusion as in PySpark/Spark dot notation is used to refer to the nested column of the struct type. so if possible try to replace all column names with dot to underscore before processing it.


new_cols=(column.replace('.', '_') for column in df.columns)
df2 = df.toDF(*new_cols)
df2.show()

+----------+------+
|name_fname|gender|
+----------+------+
|     James|    23|
|       Ann|    40|
+----------+------+

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium