• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:4 mins read
You are currently viewing PySpark Refer Column Name With Dot (.)

Problem: I have a PySpark (Spark with Python) DataFrame with a dot in the Column names, could you please explain how to access/refer to this column with the dot by using <a href="https://sparkbyexamples.com/pyspark/pyspark-withcolumn/">withColumn()</a> or <a href="https://sparkbyexamples.com/pyspark/select-columns-from-pyspark-dataframe/">select()</a> transformations. Below is the PySpark DataFrame with column name.fname with dot.


from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data=[("James",23),("Ann",40)]
df=spark.createDataFrame(data).toDF("name.fname","gender")

PySpark DataFrame Column Name with Dot (.)

Solution: Generally as a best practice column names should not contain special characters except underscore (_) however, sometimes we may need to handle it.

In order to access PySpark/Spark DataFrame Column Name with a dot from wihtColumn() & select(), you just need to enclose the column name with backticks (`)

Using Column Name with Dot on select().


from pyspark.sql.functions import col
df.select(col("`name.fname`")).show()
df.select(df["`name.fname`"]).show()
+----------+
|name.fname|
+----------+
|     James|
|       Ann|
+----------+

Using Column Name with Dot on withColumn()


df.withColumn("new_col",col("`name.fname`").substr(1,2)).show()
+----------+------+-------+
|name.fname|gender|new_col|
+----------+------+-------+
|     James|    23|     Ja|
|       Ann|    40|     An|
+----------+------+-------+

Using on filter(), for more example refer to PySpark Filter() Usage & Examples


df.filter(col("`name.fname`").startswith("J")).show()
+----------+------+
|name.fname|gender|
+----------+------+
|     James|    23|
+----------+------+

Replace All Column Names from Dot (.) to underscore (_)

Have a column name with a dot leads us into confusion as in PySpark/Spark dot notation is used to refer to the nested column of the struct type. so if possible try to replace all column names with dot to underscore before processing it.


new_cols=(column.replace('.', '_') for column in df.columns)
df2 = df.toDF(*new_cols)
df2.show()

+----------+------+
|name_fname|gender|
+----------+------+
|     James|    23|
|       Ann|    40|
+----------+------+

Happy Learning !!