Spark - Get Size/Length of Array & Map Column

Question: In Spark & PySpark, how to get the size/length of ArrayType (array) column and also how to find the size of MapType (map/Dic) type in DataFrame, could you also please explain with an example how to filter by array/map size?

1. Solution: Get Size/Length of Array & Map DataFrame Column

Spark/PySpark provides size() SQL function to get the size of the array & map type columns in DataFrame (number of elements in ArrayType or MapType columns). In order to use Spark with Scala, you need to import org.apache.spark.sql.functions.size and for PySpark from pyspark.sql.functions import size, Below are quick snippet’s how to use the size() function.

Note: By default this function return -1 for null array/map columns. To change this behavior and if you want to get null for null input set false to spark.sql.legacy.sizeOfNull or true to spark.sql.ansi.enabled.


// Filter based on array size of a column
import org.apache.spark.sql.functions.{size,col}
df.filter(size(col("languages")) > 1).show()

// Size of ArrayType & MapType columns
df.withColumn("lang_len",size(col("languages")))
  .withColumn("prop_len",size(col("properties")))
  .show()

// Using size() on Spark SQL.
df.createOrReplaceTempView("TAB")
spark.sql("select name,size(languages) as lang_len from TAB" +
    " where size(languages) > 1").show()

Let’s see with a detailed example


val arrayData = Seq(
    Row("James",List("Java","Scala"),Map("hair"->"black","eye"->"brown")),
    Row("Michael",List("Spark","Java",null),Map("hair"->"brown","eye"->null)),
    Row("Robert",List("CSharp"),Map("hair"->"red")),
    Row("Washington",null,null),
    Row("Jeferson",List(),Map())
)

val arraySchema = new StructType()
    .add("name",StringType)
    .add("languages", ArrayType(StringType))
    .add("properties", MapType(StringType,StringType))

val df = spark.createDataFrame(
    spark.sparkContext.parallelize(arrayData),arraySchema)

2. Filter Based On The size of Array Type Column

On the above Spark DataFrame, we have a column languages of type Array & properties of type Map, below example shows how to filter DataFrame that has a number of elements in languages column > 1.


// Filter Dataframe using size() of a column
import org.apache.spark.sql.functions.{size,col}
df.filter(size(col("languages")) > 1).show(false)

// +-------+--------------+-----------------------------+
// |name   |languages     |properties                   |
// +-------+--------------+-----------------------------+
// |James  |[Java, Scala] |[hair -> black, eye -> brown]|
// |Michael|[Spark, Java,]|[hair -> brown, eye ->]      |
// +-------+--------------+-----------------------------+

Another example that creates a new column with the size of array and map columns. note that when the column values are null it returns size as -1.


// Get the size of a column to create anotehr column
df.withColumn("lang_len",size(col("languages")))
  .withColumn("prop_len",size(col("properties")))
  .show(false)

// +----------+--------------+-----------------------------+--------+--------+
// |name      |languages     |properties                   |lang_len|prop_len|
// +----------+--------------+-----------------------------+--------+--------+
// |James     |[Java, Scala] |[hair -> black, eye -> brown]|2       |2       |
// |Michael   |[Spark, Java,]|[hair -> brown, eye ->]      |3       |2       |
// |Robert    |[CSharp]      |[hair -> red]                |1       |1       |
// |Washington|null          |null                         |-1      |-1      |
// |Jeferson  |[]            |[]                           |0       |0       |
// +----------+--------------+-----------------------------+--------+--------+

3. PySpark Get Size/Length of Array & Map type Columns

In PySpark size() function is available by importing from pyspark.sql.functions import size get the number of elements in a Array or Map type columns.


# Filter Dataframe using size() of a column
from pyspark.sql.functions import size,col
df.filter(size("languages") > 2).show(truncate=False)

# Get the size of a column to create another column
df.withColumn("lang_len",size(col("languages")))
  .withColumn("prop_len",size(col("properties")))
  .show(false)

4. Spark SQL Example

If you are using Spark SQL, you can also use size() function that returns the size of an array or map type columns. The Below example creates a new column lang_len with the size of the array column language and filters the DataFrame that has minimum languages 2.


// Using size() function on SQL
df.createOrReplaceTempView("TAB")
spark.sql("select name,size(languages) as lang_len from TAB" +
    " where size(languages) > 1").show(false)

// +-------+--------+
// |name   |lang_len|
// +-------+--------+
// |James  |2       |
// |Michael|3       |
// +-------+--------+

5. Conclusion

In summary SQL function size() is used to get the number of elements in array or map type DataFrame columns and this function return by default -1 for null values and you can change this behavior by setting spark config.

Happy Learning !!

1. Solution: Get Size/Length of Array & Map DataFrame Column

2. Filter Based On The size of Array Type Column

3. PySpark Get Size/Length of Array & Map type Columns

4. Spark SQL Example

5. Conclusion

Related Articles