Spark Streaming – Different Output modes explained

Let’s see differences between complete, append and update output modes (outputmode) in Spark Streaming. outputMode() describes what data is written to a data sink (console, Kafka e.t.c) when there is new data available in streaming input (Kafka, Socket, e.t.c)

Streaming – Append Output Mode

OutputMode in which only the new rows in the streaming DataFrame/Dataset will be written to the sink.

This is the default mode. Use append as output mode outputMode("append") when you want to output only new rows to the output sink.


  dF.writeStream
    .format("console")
    .outputMode("append")
    .start()
    .awaitTermination()

Streaming – Complete Output Mode

OutputMode in which all the rows in the streaming DataFrame/Dataset will be written to the sink every time there are some updates.

Use complete as output mode outputMode("complete") when you want to aggregate the data and output the entire results to sink every time. This mode is used only when you have streaming aggregated data. One example would be counting the words on streaming data and aggregating with previous data and output the results to sink.


val wordCountDF = df.select(explode(split(col("value")," ")).alias("word"))
    .groupBy("word").count()

wordCountDF.writeStream
           .format("console")
           .outputMode("complete")
           .start()
           .awaitTermination()

In case, if you use complete mode on non-aggregated stream, below exception would be thrown

“org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets”

Streaming – Update Output Mode

OutputMode in which only the rows that were updated in the streaming DataFrame/Dataset will be written to the sink every time there are some updates.

It is similar to the complete with one exception; update output mode outputMode("update") just outputs the updated aggregated results every time to data sink when new data arrives. but not the entire aggregated results like complete mode. If the streaming data is not aggregated then, it will act as append mode.


val wordCountDF = df.select(explode(split(col("value")," ")).alias("word"))
    .groupBy("word").count()

wordCountDF.writeStream
           .format("console")
           .outputMode("update")
           .start()
           .awaitTermination()

Conclusion:

In this article, you have learned what is output mode (outputmode) in Spark Streaming and differences between append, complete and update modes and how they affect the results while writing streaming data to data sink.

Next steps

You can also read articles Streaming JSON files from a folder and from TCP socket to know different ways of streaming. I would also recommend reading Spark Streaming + Kafka Integration and Structured Streaming with Kafka for more knowledge on structured streaming.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply