You are currently viewing Null values in concat() of Spark

When using the concat functionality in Spark Scala to concatenate strings, null values in concat can cause issues. If any of the input strings are null, the resulting concatenation will also be null. In this article, we shall discuss the concat function and different ways to handle Null values for it.

Code Snippet

Each of these functions handles null values in a slightly different way, so it’s important to choose the one that best suits your use case.


import org.apache.spark.sql.functions.{concat, coalesce, lit}

// Using coalesce to replace null values with a default value
df.select(concat(coalesce($"first_name", lit("")), $"middle_name", 
      coalesce($"last_name", lit(""))).alias("full_name")).show()

// Using concat_ws to concatenate columns and automatically 
// Replace null values with an empty string
df.select(concat_ws(" ", $"first_name", $"middle_name", $"last_name")
     .alias("full_name")).show()

// Using when and otherwise to replace null values with a default value
df.select(concat(when($"first_name".isNotNull, $"first_name")
     .otherwise(lit("")), $"middle_name", 
        when($"last_name".isNotNull, $"last_name")
           .otherwise(lit(""))).alias("full_name")).show()

1. Concat() Function

In Spark Scala, the concat() function is used to concatenate two or more string columns together into a single column. The function takes multiple arguments, each of which is a column containing strings, and returns a new column that contains the concatenated strings.

The syntax for using the concat function in Spark Scala is as follows:


import org.apache.spark.sql.functions.concat

// Concatenate two columns
df.select(concat($"col1", $"col2").alias("new_col"))

// Concatenate three columns
df.select(concat($"col1", $"col2", $"col3").alias("new_col"))

// Concatenate columns with a delimiter
df.select(concat_ws(", ", $"col1", $"col2", $"col3").alias("new_col"))

Here’s an example:


import org.apache.spark.sql.functions.concat

val df = Seq(("John", "Doe"), ("Jane", "Smith"), ("Mike", "Jones"))
  .toDF("first_name", "last_name")

df.select(concat($"first_name", lit(" "), $"last_name").alias("full_name")).show()

// Output:
// +-----------+
// |  full_name|
// +-----------+
// |   John Doe|
// | Jane Smith|
// | Mike Jones|
// +-----------+

In this example, the concat function is used to concatenate the first_name and last_name columns with a space in between. The resulting full_name column contains the concatenated strings.

Note that the concat function requires at least two arguments, and all arguments must be of string type. If any of the input columns are null, the resulting concatenation will also be null.

2. Handling null values in concat functionality of Spark scala

When using the concat functionality in Spark Scala to concatenate strings, null values can cause issues. If any of the input strings are null, the resulting concatenation will also be null.

Here are some detailed examples of how to deal with null values when using the concat functionality in Spark Scala:

Example 1: Using coalesce to replace null values with a default value


// Using coalesce to replace null values with a default value
import org.apache.spark.sql.functions.{concat, coalesce, lit}

val df = Seq(("John", null, "Doe"), ("Jane", "A.", "Smith"), ("Mike", "B.", null))
  .toDF("first_name", "middle_name", "last_name")

df.select(concat(coalesce($"first_name", lit("")), $"middle_name", coalesce($"last_name", lit(""))).alias("full_name")).show()

// Output:
// +---------------+
// |  full_name    |
// +---------------+
// |   John Doe    |
// |  JaneA.Smith  | 
// |    Mike B.    |
// +---------------+

In this example, we use the coalesce function to replace null values in the first_name and last_name columns with an empty string before concatenating them with the middle_name column. The lit() function is used to create an empty string literal that is used as the default value for the coalesce function.

Example 2: Using concat_ws to concatenate columns and automatically replace null values with an empty string


// Using concat_ws to concatenate columns and automatically replace null values with an empty string
import org.apache.spark.sql.functions.concat_ws

val df = Seq(("John", null, "Doe"), ("Jane", "A.", "Smith"), ("Mike", "B.", null))
  .toDF("first_name", "middle_name", "last_name")

df.select(concat_ws(" ", $"first_name", $"middle_name", $"last_name").alias("full_name")).show()

// Output:
// +--------------+
// |   full_name  |
// +--------------+
// |  John Doe    |
// |Jane A. Smith |
// |   Mike B.    |
// +--------------+

In this example, we use the concat_ws function to concatenate the columns with a space separator. Null values in the columns are automatically replaced with an empty string.

Example 3: Using when and otherwise to replace null values with a default value


import org.apache.spark.sql.functions.{concat, lit, when}

val df = Seq(("John", null, "Doe"), ("Jane", "A.", "Smith"), ("Mike", "B.", null))
  .toDF("first_name", "middle_name", "last_name")

df.select(concat(when($"first_name".isNotNull, $"first_name").otherwise(lit("")), $"middle_name", when($"last_name".isNotNull, $"last_name").otherwise(lit(""))).alias("full_name")).show()

// Output:
// +---------------+
// |    full_name  |
// +---------------+
// |      John Doe |
// |JaneA.Smith    |
// |      Mike B.  |
// +---------------+

In this example, we use the when and otherwise functions to replace null values in the first_name and last_name columns with an empty string before concatenating them with the middle_name column. The isNotNull function is used to check if the column contains a null value. If it does not, the column value is used. Otherwise, the lit function is used to create an empty string literal that is used as the default value.

3. Conclusion

In conclusion, handling null values in the concat functionality of Spark Scala is important to avoid getting null results in the concatenated strings. This can be done using various Spark Scala functions such as coalesce, concat_ws, when, and otherwise.

Choosing the appropriate function to handle null values depends on the specific use case and the desired output format. It’s important to note that using these functions can also impact the performance of the Spark job, so it’s important to consider the trade-offs between performance and handling null values.

Related Articles

rimmalapudi

Data Engineer. I write about BigData Architecture, tools and techniques that are used to build Bigdata pipelines and other generic blogs.