Apache Spark is (at the time of writing) the most popular “big” data processing library these days. While the Dataframe has a really nice DSL to express pipelined transformations, it does suffer from lack of type safety and is little too imperative IMO. The transformations tend to get hairy and cumbersome to read and reason about especially if there are multiple intermediate steps or operators.

Here I am presenting a way to decompose such pipelines (I use this term within the context of a single spark application having multiple transformations) so you get smaller steps: good for testing and improves…

Play is one of my favorite frameworks both from the perspective of ease of use, design and extensibility. I always used Akka Http (which has a very elegant DSL) and now that Akka powers Play, I decided to give it a chance and I was suitably impressed. There are tons of great features supported out of the box, other than a very intuitive model, including:

  • non-blocking and blazing fast
  • robust JSON support
  • websockets
  • pluggable architecture and many others

Play has multiple ways to declare an Action and all these different methods are really just instances of the ActionBuilder trait. This…

It is awkward in scala to work with ListenableFuture that is returned when you are using cassandra java driver to make asynchronous calls.

The gist below converts the ListenableFuture to scala Future that you can compose.

Many modern streaming applications need to provide statistics/aggregate data in realtime. I have see too many such applications that simply add Redis to the mix to perform such aggregations: last X values, rolling mean etc.

I propose a simpler mechanism to do this if you already using Kafka and KafkaStreams.

I will demonstrate below how to build a Restful service which can return the last 5 values sent to a specified topic and key. This is an extremely low latency query that could be used to

  • monitor kafka topics
  • allow interactive query on kafka
case class RawData(id:String, value:Double, ts:Long)
object KTableFromTopic…

In this post I will show how to use the very powerful flatMapWithState in Spark structured streaming to aggregate arbitrary state. To make things interesting, I chose to integrate a Java rule engine drools to map over streaming data. This is a pretty powerful paradigm. It enables the following:

  • business rules are external to the application so can be changed independently. Changes can be made by business analysts.
  • Extends the power of rules execution by leveraging the distributed execution model provided by Spark.

So, here are the details. I will be sharing the full source code on github in the…

Vishal Chawla

I am a recovering entrepreneur. I dabble in everything data related and am fluent in scala, spark, kafka, cassandra and flink.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store