In Spark SQL, select()
function is used to select one or multiple columns, nested columns, column by index, all columns, from the list, by regular expression from a DataFrame. select() is a transformation function in Spark and returns a new DataFrame with the selected columns. You can also alias column names while selecting.
- Select a Single & Multiple Columns
- Select All Columns
- Select Columns From List
- Select First N Columns
- Select Column by Position or Index
- Select Column by Regular expression
- Select Columns Starts or Ends With
- Select a Nested Column
Following are different syntax’s of select()
transformation.
select(cols : Column*) : DataFrame
select(col : String, cols : String*) : DataFrame
select[U1](c1 : TypedColumn[T, U1]) : Dataset[U1]
select[U1, U2](c1 : TypedColumn[T, U1], c2 : TypedColumn[T, U2]) : Dataset[scala.Tuple2[U1, U2]]
// And more select() that takes up to 5 TypesColumns U1,U2,U3,U4,U5
Let’s see with an examples. First, Create a Spark Dataframe.
val spark = SparkSession.builder
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
val data = Seq(("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
)
val columns = Seq("firstname","lastname","country","state")
import spark.implicits._
val df = data.toDF(columns:_*)
df.show(false)
1. Select Single & Multiple Columns
You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select()
function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.
Related: Select All columns of String or Integer type in Spark
Below are different ways to select single or multiple columns columns from DataFrame.
df.select("firstname","lastname").show()
//Using Dataframe object name
df.select(df("firstname"),df("lastname")).show()
//Using col function, use alias() to get alias name
import org.apache.spark.sql.functions.col
df.select(col("firstname").alias("fname"),col("lastname")).show()
This yields below output.
+---------+--------+
|firstname|lastname|
+---------+--------+
| James| Smith|
| Michael| Rose|
| Robert|Williams|
| Maria| Jones|
+---------+--------+
2. Select All Columns
Below are different ways to get all columns of Spark DataFrame, here we use df.columns
to get all columns on a DataFrame as Array[String]
, convert it to Array[Column]
using scala map()
and finally use it on select()
.
//Show all columns from DataFrame
df.select("*").show()
val columnsAll=df.columns.map(m=>col(m))
df.select(columnsAll:_*).show()
df.select(columns.map(m=>col(m)):_*).show()
3. Select Columns from List
Some times you may have to select column names from an Array
, List
or Seq
of String, below example provides snippet how to do this using list.
val listCols= List("lastname","country")
df.select(listCols.map(m=>col(m)):_*).show()
4. Select First N Columns
In order to select first N columns, you can use the df.columns
to get all the columns on DataFrame and use the slice() method to select the first n columns. Below snippet select first 3 columns.
//Select first 3 columns.
df.select(df.columns.slice(0,3).map(m=>col(m)):_*).show()
5. Select Column By Position or Index
To select a column based out of position or index, first get all columns using df.columns
and get the column name from index, also use slice() to get column names from start and end positions.
//Selects 4th column (index starts from zero)
df.select(df.columns(3)).show()
//Selects columns from index 2 to 4
df.select(df.columns.slice(2,4).map(m=>col(m)):_*).show()
6. Select Columns by Regular expression
You can use df.colRegex()
to select columns based on a regular expression. The below example shows all columns that contains name
string.
//Select columns by regular expression
df.select(df.colRegex("`^.*name*`")).show()
7. Select Columns Starts or Ends With
Below are some examples of how to select DataFrame columns by starts with and ends with a string.
df.select(df.columns.filter(f=>f.startsWith("first")).map(m=>col(m)):_*)
df.select(df.columns.filter(f=>f.endsWith("name")).map(m=>col(m)):_*)
8. Select Nested Struct Columns
When you are processing structured data, most of the time you will have Spark DataFrame columns nested with struct type (StructType), you need to use an explicit column qualifier in order to select. If you are new to Spark and you have not learned StructType yet, I would recommend skipping the rest of the section or first Understand Spark StructType before you proceed.
First, let’s create a new DataFrame with a nested struct type.
//Show Nested columns
import org.apache.spark.sql.types.{StringType, StructType}
val data2 = Seq(Row(Row("James","","Smith"),"OH","M"),
Row(Row("Anna","Rose",""),"NY","F"),
Row(Row("Julia","","Williams"),"OH","F"),
Row(Row("Maria","Anne","Jones"),"NY","M"),
Row(Row("Jen","Mary","Brown"),"NY","M"),
Row(Row("Mike","Mary","Williams"),"OH","M")
)
val schema = new StructType()
.add("name",new StructType()
.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType))
.add("state",StringType)
.add("gender",StringType)
val df2 = spark.createDataFrame(
spark.sparkContext.parallelize(data2),schema)
df2.printSchema()
df2.show(false)
Yields below schema output. If you notice the column
name
is a struct
type which consists of columns firstname
, middlename
, lastname
.
root
|-- name: struct (nullable = true)
| |-- firstname: string (nullable = true)
| |-- middlename: string (nullable = true)
| |-- lastname: string (nullable = true)
|-- state: string (nullable = true)
|-- gender: string (nullable = true)
+----------------------+-----+------+
|name |state|gender|
+----------------------+-----+------+
|[James,, Smith] |OH |M |
|[Anna, Rose, ] |NY |F |
|[Julia, , Williams] |OH |F |
|[Maria, Anne, Jones] |NY |M |
|[Jen, Mary, Brown] |NY |M |
|[Mike, Mary, Williams]|OH |M |
+----------------------+-----+------+
Now, let’s select struct column.
df2.select("name").show(false)
This returns struct column name
as is.
+----------------------+
|name |
+----------------------+
|[James,, Smith] |
|[Anna, Rose, ] |
|[Julia, , Williams] |
|[Maria, Anne, Jones] |
|[Jen, Mary, Brown] |
|[Mike, Mary, Williams]|
+----------------------+
In order to get the specific column from a struct, you need to explicitly qualify.
df2.select("name.firstname","name.lastname").show(false)
This outputs firstname
and lastname
from the name struct column.
+---------+--------+
|firstname|lastname|
+---------+--------+
|James |Smith |
|Anna | |
|Julia |Williams|
|Maria |Jones |
|Jen |Brown |
|Mike |Williams|
+---------+--------+
In order to get all columns from struct column.
df2.select("name.*").show(false)
This yields below output.
+---------+----------+--------+
|firstname|middlename|lastname|
+---------+----------+--------+
|James |null |Smith |
|Anna |Rose | |
|Julia | |Williams|
|Maria |Anne |Jones |
|Jen |Mary |Brown |
|Mike |Mary |Williams|
+---------+----------+--------+
9. Complete Example
import org.apache.spark.sql.types.{StringType, StructType}
import org.apache.spark.sql.{Row, SparkSession}
object SelectExamples extends App{
val spark = SparkSession.builder
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
val data = Seq(("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
)
val columns = Seq("firstname","lastname","country","state")
import spark.implicits._
val df = data.toDF(columns:_*)
df.show(false)
df.select("firstname","lastname").show()
//Using Dataframe object name
df.select(df("firstname"),df("lastname")).show()
//Using col function
import org.apache.spark.sql.functions.col
df.select(col("firstname"),col("lastname")).show()
//Show all columns
df.select("*").show()
val columnsAll=df.columns.map(m=>col(m))
df.select(columnsAll:_*).show()
df.select(columns.map(m=>col(m)):_*).show()
//Show columns from list
val listCols= List("lastname","country")
df.select(listCols.map(m=>col(m)):_*).show()
//Show first few columns
df.select(df.columns.slice(0,3).map(m=>col(m)):_*).show(1)
//Show columns by index or position
df.select(df.columns(3)).show(3)
//Show columns from start and end index
df.select(df.columns.slice(2,4).map(m=>col(m)):_*).show(3)
//Show columns by regular expression
df.select(df.colRegex("`^.*name*`")).show()
df.select(df.columns.filter(f=>f.startsWith("first")).map(m=>col(m)):_*).show(3)
df.select(df.columns.filter(f=>f.endsWith("name")).map(m=>col(m)):_*).show(3)
//Show Nested columns
val data2 = Seq(Row(Row("James","","Smith"),"OH","M"),
Row(Row("Anna","Rose",""),"NY","F"),
Row(Row("Julia","","Williams"),"OH","F"),
Row(Row("Maria","Anne","Jones"),"NY","M"),
Row(Row("Jen","Mary","Brown"),"NY","M"),
Row(Row("Mike","Mary","Williams"),"OH","M")
)
val schema = new StructType()
.add("name",new StructType()
.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType))
.add("state",StringType)
.add("gender",StringType)
val df2 = spark.createDataFrame(spark.sparkContext.parallelize(data2),schema)
df2.printSchema()
df2.show(false)
df2.select("name").show(false)
df2.select("name.firstname","name.lastname").show(false)
df2.select("name.*").show(false)
}
This example is also available at Spark Scala GitHub Project for reference.
10. Conclusion
In this article, you have learned select() is a transformation function of the DataFrame and is used to select one or more columns, you have also learned how to select nested elements from the DataFrame.
Happy Learning !!
Related Articles:
- How to Add and Update DataFrame Columns in Spark
- How to Rename a DataFrame Column
- How to Drop DataFrame Column
- Spark Join Types
- Spark DataFrame groupBy()
- Spark Union() & UnionAll() Examples
- Spark Distinct Rows from DataFrame
val df = data.toDF(columns:_*)
df.select(columnsAll:_*).show()
df.select(columns.map(m=>col(m)):_*).show()
df.select(listCols.map(m=>col(m)):_*).show()
can you please elaborate why do we use (columns:_*) and map(m=>col(m)):_*)