You are currently viewing Spark cannot resolve given input columns

Exception in thread “main” org.apache.spark.sql.AnalysisException: cannot resolve <given-column> given input columns: [columns]; error message in Spark is typically encountered when referring to a column that doesn’t exists in Spark DataFrame/input data source

Advertisements

Below is a complete example of Exception in thread “main” org.apache.spark.sql.AnalysisException: cannot resolve <given-column> given input columns: [columns].

Spark cannot resolve given input columns

1. How to resolve the error cannot resolve given input columns

To resolve this error of cannot resolve given input columns in Spark, you should carefully check the column names that you are referencing in your SQL query or DataFrame operation. Make sure that the column names are spelled correctly and that they exist in the input data source.

Here are some additional troubleshooting steps that you can take:

  1. Check the schema of the input DataFrame or SQL table to make sure that the column names are correct.
  2. Verify that the input DataFrame or SQL table actually contains data. If it is empty, then there will be no columns to reference.
  3. Ensure that you have correctly specified the input data source and that it is properly loaded into your Spark application.
  4. If you are using Spark’s SQL module, try explicitly specifying the schema for the input data source, using the schema parameter in the read method.

By taking these steps, you should be able to resolve the cannot resolve given input columns error in Spark and successfully execute your SQL query or DataFrame operation.

2. Examples of the error cannot resolve given input columns

Here are some examples of when you might encounter the org.apache.spark.sql.AnalysisException: cannot resolve given input columns error:

Example 1: Incorrect column name

Suppose you have a DataFrame df with columns id, name, and age. If you try to run the following query, you will encounter the error:


// Import
import org.apache.spark.sql.SparkSession

// Create SparkSession
val spark:SparkSession = SparkSession.builder()
    .master("local[1]").appName("SparkByExamples.com")
    .getOrCreate()

// Create DataFrame
import spark.implicits._
val data = Seq((1, "John", 20), (2, "Jane", 25), (3, "Jim", 30))
val df = data.toDF("id", "name", "age")

// Return error as full_name column doens't present in DataFrame
df.select("id","full_name")
    .show()

In this case, the error message will say something like: “cannot resolve ‘full_name’ given input columns: [id, name, age]”.

The problem is that full_name is not a valid column name in the df DataFrame. To fix the error, you should use the correct column name, which is name.


val result = df.select("id", "name")

Example 2: DataFrame not loaded properly

Suppose you are trying to load a DataFrame from a CSV file using the read method in Spark SQL. If you provide an incorrect path to the CSV file, then you will get the error message:


// Load from CSV file
val df = spark.read.format("csv")
           .load("path/to/incorrect/file.csv")

The error message will say something like: “cannot resolve ‘column_0‘ given input columns: []”.

In this case, the problem is that the DataFrame was not loaded properly because the file path is incorrect. To fix the error, you should provide the correct path to the CSV file.


// Load from CSV file
val df = spark.read.format("csv")
      .load("path/to/correct/file.csv")

Example 3: DataFrame schema not specified

Suppose you are trying to load a DataFrame from a CSV file, but the CSV file does not have a header row. If you try to load the file without specifying the schema, then you will get the error message:


// Load with out header
val df = spark.read.format("csv").option("header", "false")
    .load("path/to/file.csv")

The error message will say something like: “cannot resolve ‘column_0’ given input columns: []“.

In this case, the problem is that the DataFrame schema was not specified, and Spark is trying to infer the schema from the input data, which it cannot do without a header row. To fix the error, you should explicitly specify the schema for the input data source.


// Create schema
val schema = StructType(Seq(StructField("id", IntegerType), StructField("name", StringType), StructField("age", IntegerType)))

// Use schema while reading a CSV file
val df = spark.read.format("csv").option("header", "false").schema(schema).load("path/to/file.csv")

3. Conclusion

In conclusion, the org.apache.spark.sql.AnalysisException: cannot resolve given input columns error occurs when there is a mismatch between the input data source and the SQL query or DataFrame operation. This can be caused by referencing a non-existent column, loading the input data source incorrectly, or failing to specify the schema for the input data source.

To fix this error, you should carefully check the column names that you are referencing, verify that the input data source contains data and is loaded correctly, and explicitly specify the schema for the input data source if necessary. By taking these steps, you can resolve the error and successfully execute your Spark SQL queries and DataFrame operations.

Related Articles

rimmalapudi

Data Engineer. I write about BigData Architecture, tools and techniques that are used to build Bigdata pipelines and other generic blogs.