Building Stateful Serverless Applications

Vipul Vaibhaw
4 min readOct 2, 2021

If your business deals with streaming data, you might have already been familiarised by the idea of Event-driven software architectures. This architecture is different from the traditional request-driven model. Whenever one wants to design a system where, let’s say, anomalies needs to be detected on fly in real-time; Event-driven architecture will be a good choice.

You might be wondering — What is an event? It is a very valid question. I want to clear your doubts first by saying that it is not a notification! An event is change in state. It can be anything from a mouse click(user generated) to a sensor output(external source). It can also come from system, like loading a program. A very good blog about — How a company built a distributed state machine?.

There are of-course certain benefits in using event-driven architecture but we won’t go in great detail in this article.

subscribe to my newsletter for more articles related to data engineering, artificial intelligence and distributed systems — https://distributeddataengines.substack.com/

Why Stateful Serverless Event-driven pipelines ?

Stateless Serverless Pipelines -

We need to understand that classically the stream processing engines are defined as a directed acyclic graph(DAG) of operators.

The classical use-cases which require stateless serverless setups are -

  1. Data Anonymization
  2. Counting events in a Tumbling Window of last 5 seconds (Using DataStream API — Apache Flink).
  3. Converting one timestamp to another.

The above mentioned use-cases can be stateless because they don’t need to remember anything for processing the next event.

Usually, the following architecture can be used to design a scalable stateless serverless architecture.

Stateful Serverless Pipelines -

Stateless Serverless Pipelines have stood the test of time. Amazing engineers are able to design Stateful Pipelines based on classical computational models using data stores which are attached to them. However, we need something more to achieve real time performance. Accessing database or I/O can be a bottleneck for real-time applications.

Let’s understand a use-case of a computational model which will benefit if one moves away from DAG.

The impact of tweets on Bitcoin Price

Stateful Computation Graph

We have seen a lot of instances where the price of bitcoin was affected by the tweet.

What if we built a realtime analytics engine which can predict whether to buy or sell the bitcoin based on tweets? I know this is unreasonable with respect to actual finance, but hey let us focus on engineering only. Shall we?

In order to build something like that, we need the following things -

  1. Word2vec — convert tweets to vectors so that the predictive model can take that as input.
  2. Num2vec — convert prices to vectors so that predictive model can use it.
  3. A predictive model — to predict whether to buy/sell.

Now, we need to have an ability update the predictive model if we are wrong. We need to update the word2vec model if needed. We need to pass feedback, have a loop in computation model.

Apache Flink provides Stateful APIs along with the classical DataStream and Table APIs which will help us in building a stateful serverless event-driven analytics engine.

Apache Flink — Stateful API

Handling state in serverless environment is a challenge. Apache Flink comes up with a new set of APIs to help us build serverless stateful event driven systems.

Stream Processing has some properties — compatibility, state management and performance. Apache Flink combines this property with FaaS(Functions-as-a-Service)’s property of being simple and lightweight.

Apache flink’s Stateful APIs provide you with superpowers to build computational models in event-driven systems in which multiple functions can do state-management and messaging with exactly-once semantics.

Checkout this link if you want to read more — https://statefun.io/

Final Thoughts

Please feel free to reach out to me if you are interested in building data systems at your organisation, I would be more than happy to help.

If you are liking my articles and want me to write more then feel free to buy me a coffee! 😁

I hope that you enjoyed this read, if so then please share this article with your friends, let’s build a solid community. I will be back next week with another well thought/researched article delivered straight in your inbox. See you!

You can connect with me on Twitter or LinkedIn.

Originally published at https://distributeddataengines.substack.com on October 2, 2021.

--

--

Vipul Vaibhaw

I am passionate about computer engineering. Building scalable distributed systems| Web3 | Data Engineering | Contact me — vaibhaw[dot]vipul[at]gmail[dot]com