Spark Check String Column Has Numeric Values

Problem: In Spark, I have a string column on DataFrame and wanted to check if this string column has all or any numeric values, wondering if there is any function similar to the isNumeric function in other tools/languages.

Solution: Check String Column Has all Numeric Values

Unfortunately, Spark doesn’t have isNumeric() function hence you need to use existing functions to check if the string column has all or any numeric values. You may be tempted to write a Spark UDF for scenarios like this but it is not recommended to use UDF’s as they do not perform well.

Let’s create a simple DataFrame with numeric & alphanumeric columns for our example.


import spark.implicits._
val data = Seq((1,"123456","123456"),
  (2,"3456234","ABCD12345"),(3,"48973456","ABCDEFGH"))
val df = data.toDF("id","all_numeric","alphanumeric")
df.show()

// +---+-----------+------------+
// | id|all_numeric|alphanumeric|
// +---+-----------+------------+
// |  1|     123456|      123456|
// |  2|    3456234|   ABCD12345|
// |  3|   48973456|    ABCDEFGH|
// +---+-----------+------------+

1. Spark Check Column has Numeric Values

The below example creates a new Boolean column 'value', it holds true for the numeric value and false for non-numeric. In order to do this, I have done a column cast from string column to int and check the result of cast is null. cast() function return null when it unable to cast to a specific type.


// Spark check column has numeric values
import org.apache.spark.sql.functions.{col}
df.select(col("all_numeric"),
    col("all_numeric")
      .cast("int").isNotNull.alias("Value ")
  ).show()

// +-----------+------+
// |all_numeric|Value |
// +-----------+------+
// |     123456|  true|
// |    3456234|  true|
// |   48973456|  true|
// +-----------+------+

2. Filter Rows that have only Numeric Values in a Column

You can use a similar approach to filter the rows that have only numeric values on a string column. Below example checks for numeric values in alphanumeric column and return records that have numeric values.


// Filter rows that has numeric value on column
import org.apache.spark.sql.functions.{col}
df.filter(col("alphanumeric")
    .cast("int").isNotNull
 ).show()

// +---+-----------+------------+
// | id|all_numeric|alphanumeric|
// +---+-----------+------------+
// |  1|     123456|      123456|
// +---+-----------+------------+

Alternatively you can use rlike() function of Column class to filter rows that has numeric values on a column..


df.filter(col("alphanumeric")
    .rlike("^[0-9]*$")
  ).show()

And, you can also get using below code. This also returns same output as above.


df.filter(row => row.getAs[String]
    ("alphanumeric").matches("""\d+""")
  ).show()

3. PySpark Check String Column has all Numeric Values

PySpark example to check DataFrame string column has all numeric values.


# PySpark check string column has all numeric values
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

data = [(1,"123456","123456"),
  (2,"3456234","ABCD12345"),(3,"48973456","ABCDEFGH")]
df = spark.createDataFrame(data = data, schema = ["id","all_numeric","alphanumeric"])
df.show()

from pyspark.sql.functions import col
df.select(col("all_numeric"),
          col("all_numeric").cast("int").isNotNull().alias("Value ")
  ).show()

4. PySpark Filter Rows that have only Numeric Values

Filter the DataFrame rows that have only numeric values on PySpark DataFrame.


# PySpark filter rows that have only numeric values 
df.filter(col("alphanumeric").cast("int").isNotNull())
  .show()

df.filter(col("alphanumeric").rlike("^[0-9]*$"))
  .show()

5. Using SQL Query Expression


// Filter rows that has numeric value
df.createOrReplaceTempView("DATA")
spark.sql("select * from DATA where rlike(alphanumeric,'^[0-9]*$')").show()

Conclusion

In this article you have learned Spark * PySpark examples of how to check if a DataFrame string column has numeric values and also how to filter rows that has only numeric values.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply