Spark SQL – Select Columns From DataFrame

In Spark SQL, select() function is used to select one or multiple columns, nested columns, column by index, all columns, from the list, by regular expression from a DataFrame. select() is a transformation function in Spark and returns a new DataFrame with the selected columns. You can also alias column names while selecting.

Following are different syntax’s of select() transformation.


select(cols : Column*) : DataFrame
select(col : String, cols : String*) : DataFrame
select[U1](c1 : TypedColumn[T, U1]) : Dataset[U1]
select[U1, U2](c1 : TypedColumn[T, U1], c2 : TypedColumn[T, U2]) : Dataset[scala.Tuple2[U1, U2]]
// And more select() that takes up to 5 TypesColumns U1,U2,U3,U4,U5

Let’s see with an examples. First, Create a Spark Dataframe.


val spark = SparkSession.builder
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

val data = Seq(("James","Smith","USA","CA"),
  ("Michael","Rose","USA","NY"),
  ("Robert","Williams","USA","CA"),
  ("Maria","Jones","USA","FL")
  )
val columns = Seq("firstname","lastname","country","state")
import spark.implicits._
val df = data.toDF(columns:_*)
df.show(false)

1. Select Single & Multiple Columns

You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.

Related: Select All columns of String or Integer type in Spark

Below are different ways to select single or multiple columns columns from DataFrame.


df.select("firstname","lastname").show()

//Using Dataframe object name
df.select(df("firstname"),df("lastname")).show()

//Using col function, use alias() to get alias name
import org.apache.spark.sql.functions.col
df.select(col("firstname").alias("fname"),col("lastname")).show()

This yields below output.


+---------+--------+
|firstname|lastname|
+---------+--------+
|    James|   Smith|
|  Michael|    Rose|
|   Robert|Williams|
|    Maria|   Jones|
+---------+--------+

2. Select All Columns

Below are different ways to get all columns of Spark DataFrame, here we use df.columns to get all columns on a DataFrame as Array[String], convert it to Array[Column] using scala map() and finally use it on select().


//Show all columns from DataFrame
df.select("*").show()
val columnsAll=df.columns.map(m=>col(m))
df.select(columnsAll:_*).show()
df.select(columns.map(m=>col(m)):_*).show()

3. Select Columns from List

Some times you may have to select column names from an Array, List or Seq of String, below example provides snippet how to do this using list.


val listCols= List("lastname","country")
df.select(listCols.map(m=>col(m)):_*).show()

4. Select First N Columns

In order to select first N columns, you can use the df.columns to get all the columns on DataFrame and use the slice() method to select the first n columns. Below snippet select first 3 columns.


//Select first 3 columns.
df.select(df.columns.slice(0,3).map(m=>col(m)):_*).show()

5. Select Column By Position or Index

To select a column based out of position or index, first get all columns using df.columns and get the column name from index, also use slice() to get column names from start and end positions.


//Selects 4th column (index starts from zero)
df.select(df.columns(3)).show()
//Selects columns from index 2 to 4
df.select(df.columns.slice(2,4).map(m=>col(m)):_*).show()

6. Select Columns by Regular expression

You can use df.colRegex() to select columns based on a regular expression. The below example shows all columns that contains name string.


//Select columns by regular expression
df.select(df.colRegex("`^.*name*`")).show()

7. Select Columns Starts or Ends With

Below are some examples of how to select DataFrame columns by starts with and ends with a string.


df.select(df.columns.filter(f=>f.startsWith("first")).map(m=>col(m)):_*)
df.select(df.columns.filter(f=>f.endsWith("name")).map(m=>col(m)):_*)

8. Select Nested Struct Columns

When you are processing structured data, most of the time you will have Spark DataFrame columns nested with struct type (StructType), you need to use an explicit column qualifier in order to select. If you are new to Spark and you have not learned StructType yet, I would recommend skipping the rest of the section or first Understand Spark StructType before you proceed.

First, let’s create a new DataFrame with a nested struct type.


//Show Nested columns
import org.apache.spark.sql.types.{StringType, StructType}
val data2 = Seq(Row(Row("James","","Smith"),"OH","M"),
    Row(Row("Anna","Rose",""),"NY","F"),
    Row(Row("Julia","","Williams"),"OH","F"),
    Row(Row("Maria","Anne","Jones"),"NY","M"),
    Row(Row("Jen","Mary","Brown"),"NY","M"),
    Row(Row("Mike","Mary","Williams"),"OH","M")
)

val schema = new StructType()
    .add("name",new StructType()
      .add("firstname",StringType)
      .add("middlename",StringType)
      .add("lastname",StringType))
    .add("state",StringType)
    .add("gender",StringType)

val df2 = spark.createDataFrame(
    spark.sparkContext.parallelize(data2),schema)
df2.printSchema()
df2.show(false)

Yields below schema output. If you notice the column name is a struct type which consists of columns firstname, middlename, lastname.


root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- state: string (nullable = true)
 |-- gender: string (nullable = true)

+----------------------+-----+------+
|name                  |state|gender|
+----------------------+-----+------+
|[James,, Smith]       |OH   |M     |
|[Anna, Rose, ]        |NY   |F     |
|[Julia, , Williams]   |OH   |F     |
|[Maria, Anne, Jones]  |NY   |M     |
|[Jen, Mary, Brown]    |NY   |M     |
|[Mike, Mary, Williams]|OH   |M     |
+----------------------+-----+------+

Now, let’s select struct column.


df2.select("name").show(false)

This returns struct column name as is.


+----------------------+
|name                  |
+----------------------+
|[James,, Smith]       |
|[Anna, Rose, ]        |
|[Julia, , Williams]   |
|[Maria, Anne, Jones]  |
|[Jen, Mary, Brown]    |
|[Mike, Mary, Williams]|
+----------------------+

In order to get the specific column from a struct, you need to explicitly qualify.


df2.select("name.firstname","name.lastname").show(false)

This outputs firstname and lastname from the name struct column.


+---------+--------+
|firstname|lastname|
+---------+--------+
|James    |Smith   |
|Anna     |        |
|Julia    |Williams|
|Maria    |Jones   |
|Jen      |Brown   |
|Mike     |Williams|
+---------+--------+

In order to get all columns from struct column.


df2.select("name.*").show(false)

This yields below output.


+---------+----------+--------+
|firstname|middlename|lastname|
+---------+----------+--------+
|James    |null      |Smith   |
|Anna     |Rose      |        |
|Julia    |          |Williams|
|Maria    |Anne      |Jones   |
|Jen      |Mary      |Brown   |
|Mike     |Mary      |Williams|
+---------+----------+--------+

9. Complete Example


import org.apache.spark.sql.types.{StringType, StructType}
import org.apache.spark.sql.{Row, SparkSession}

object SelectExamples extends App{

  val spark = SparkSession.builder
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  val data = Seq(("James","Smith","USA","CA"),
  ("Michael","Rose","USA","NY"),
  ("Robert","Williams","USA","CA"),
  ("Maria","Jones","USA","FL")
  )

  val columns = Seq("firstname","lastname","country","state")

  import spark.implicits._
  val df = data.toDF(columns:_*)
  df.show(false)


  df.select("firstname","lastname").show()
  //Using Dataframe object name
  df.select(df("firstname"),df("lastname")).show()
  //Using col function
  import org.apache.spark.sql.functions.col
  df.select(col("firstname"),col("lastname")).show()

  //Show all columns
  df.select("*").show()
  val columnsAll=df.columns.map(m=>col(m))
  df.select(columnsAll:_*).show()
  df.select(columns.map(m=>col(m)):_*).show()

  //Show columns from list
  val listCols= List("lastname","country")
  df.select(listCols.map(m=>col(m)):_*).show()

  //Show first few columns
  df.select(df.columns.slice(0,3).map(m=>col(m)):_*).show(1)

  //Show columns by index or position
  df.select(df.columns(3)).show(3)

  //Show columns from start and end index
  df.select(df.columns.slice(2,4).map(m=>col(m)):_*).show(3)

  //Show columns by regular expression
  df.select(df.colRegex("`^.*name*`")).show()

  df.select(df.columns.filter(f=>f.startsWith("first")).map(m=>col(m)):_*).show(3)
  df.select(df.columns.filter(f=>f.endsWith("name")).map(m=>col(m)):_*).show(3)

  //Show Nested columns
  val data2 = Seq(Row(Row("James","","Smith"),"OH","M"),
    Row(Row("Anna","Rose",""),"NY","F"),
    Row(Row("Julia","","Williams"),"OH","F"),
    Row(Row("Maria","Anne","Jones"),"NY","M"),
    Row(Row("Jen","Mary","Brown"),"NY","M"),
    Row(Row("Mike","Mary","Williams"),"OH","M")
  )

  val schema = new StructType()
    .add("name",new StructType()
      .add("firstname",StringType)
      .add("middlename",StringType)
      .add("lastname",StringType))
    .add("state",StringType)
    .add("gender",StringType)
  
  val df2 = spark.createDataFrame(spark.sparkContext.parallelize(data2),schema)
  df2.printSchema()
  df2.show(false)
  df2.select("name").show(false)
  df2.select("name.firstname","name.lastname").show(false)
  df2.select("name.*").show(false)
}

This example is also available at Spark Scala GitHub Project for reference.

10. Conclusion

In this article, you have learned select() is a transformation function of the DataFrame and is used to select one or more columns, you have also learned how to select nested elements from the DataFrame.

Happy Learning !!

Related Articles:

Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ @ LinkedIn

Leave a Reply

This Post Has One Comment

  1. Vrunda Shah

    val df = data.toDF(columns:_*)

    df.select(columnsAll:_*).show()
    df.select(columns.map(m=>col(m)):_*).show()

    df.select(listCols.map(m=>col(m)):_*).show()

    can you please elaborate why do we use (columns:_*) and map(m=>col(m)):_*)

You are currently viewing Spark SQL – Select Columns From DataFrame