You are currently viewing Apache Spark 3.5.0 Most Important Features

Apache Spark 3.5.0 was released by addressing 1,300 issues which includes several significant features and enhancements compared to the previous versions. These features and improvements made Apache Spark 3.5.0 a major release that addressed many pain points and limitations of earlier versions, providing better performance, and usability.

What are the key features and improvements released in Spark 3.5.0

Following are some of the key features and improvements

  • Spark Connect: This release extends the general availability of Spark Connect with support for Scala and Go clients, distributed training and inference support, and enhanced compatibility for Structured streaming.
  • PySpark and SQL Functionality: New functionality has been introduced in PySpark and SQL, including the SQL IDENTIFIER clause, named argument support for SQL function calls, SQL function support for HyperLogLog approximate aggregations, and Python user-defined table functions.
  • Distributed Training with DeepSpeed: The release simplifies distributed training with DeepSpeed, making it more accessible.
  • Structured Streaming: It introduces watermark propagation among operators and dropDuplicatesWithinWatermark operations in Structured Streaming, enhancing its capabilities.
  • English SDK: Apache Spark for English SDK integrates the extensive expertise of Generative AI in Apache Spark

Spark Connect

Spark Connect for Python client was released in Spark 3.4.0 which is a detached client-server architecture that enables remote access to Spark clusters. It leverages the DataFrame API and unresolved logical plans as the communication protocol. This decoupling of client and server components opens up opportunities for utilizing Spark and its extensive ecosystem from remote locations. For example, with Spark Connect users can connect IDEs, Notebooks, and modern data applications directly to Spark clusters from remote servers.

New Spark Connect features in Spark 3.5

  • In Spark 3.5.0, Spark Connect supports both Scala and GO clients.
  • It also adds more support for Python libraries like Pandas API, and PyTorch-based distributed ML Support. 
  • Also adds Structured Streaming support for both Python and Scala
  • A range of compatibility improvements between Spark native and Spark Connect clients across Python and Scala.

English SDK for Apache Spark

The English SDK for Apache Spark is a very powerful tool that can be used to translate plain English commands into PySpark objects, such as DataFrames. Its primary objective is to enhance the user-friendliness and accessibility of Spark. Apache Spark for English SDK integrates the extensive expertise of Generative AI in Apache Spark. With this you can now obtain results with straightforward, easy-to-understand English instructions, accessible to a broader audience.

apache spark 3.5.0

Apache Spark for English SDK can be installed using the pip command. Install pyspark-ai and other dependencies.


# Install pyspark-ai
%pip install pyspark-ai

# Install opanai
%pip install openai

# Install langchain
%pip install langchain

Improved PySpark Features

This release includes a multitude of significant updates and new features for Apache Spark in Python (PySpark). Following are some of the noticeable features:

  1. Support for positional parameters in Python’s sql() function (SPARK-44140) has been introduced, allowing for more flexible SQL operations.
  2. The sql() function in Spark now supports parameterized SQL (SPARK-41666), enhancing the versatility of SQL queries.
  3. Python user-defined table functions (UDTF) are now supported (SPARK-43797), expanding the range of operations that can be performed with user-defined functions.
  4. Users can set the Python executable for UDF and pandas function APIs in workers during runtime (SPARK-43574), enabling better integration with Python environments.
  5. The dir() function in pyspark.sql.dataframe.DataFrame has been updated to include columns (SPARK-43270), making it easier to explore the DataFrame structure.
  6. TimestampNTZType has been exposed in pyspark.sql.types (SPARK-43759), expanding support for timestamp data.
  7. Support for nested timestamp types (SPARK-43545) further enhances the handling of structured data.
  8. UserDefinedType can be used to create DataFrames from pandas DataFrames and convert to pandas (SPARK-43817), providing more flexibility in data interchange.
  9. The Pyspark Protobuf API now includes the descriptor binary option (SPARK-43799), extending support for binary data.
  10. Generic tuples can now be accepted as typing hints for Pandas UDF (SPARK-43886), allowing for more expressive type hints.
  11. A new array_prepend function has been added (SPARK-41233), enhancing array manipulation capabilities.
  12. The assertDataFrameEqual utility function has been introduced (SPARK-44061), facilitating data validation.

Improved Spark SQL Features

Improvements to Spark Core

Conclusion

I have covered a fraction of the features released in Apache Spark 3.5.0. For the complete list, please refer to Spark’s official documentation.

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium