It’s fair to say that the route of building an app and getting it out there is kind of gone. The old adage of “ideas are plentiful but it’s all about the execution” are valid. The execution aspect can be taken another level though, assume you have a competitor and it’s going to decided on one metric, time.
Time of Execution is the difference between making the sale or the customer going somewhere else.
The more streaming applications I work with the more fanatical I’ve become about reducing latency. With one basic question:
How can I make this end-to-end application faster?
Say, for example, we receive messages via HTTP (a web endpoint) via REST or what have you and that’s sent to some form of persistence at the other side. There’s a start point, some actions and an end goal. I can write this down in a diagram easy enough….
- Client sends message to endpoint address.
- Producer sends message to topic.
- Consumer processes the message.
- It’s persisted (that could be to a S3 bucket or a database for example).
From left to right everything has a time implication. I’ve put some basic millisecond times in my diagram, merely guesses.
- Client sends message to endpoint address. (200ms)
- Producer sends message to topic. (10ms)
- Consumer processes the message. (1500ms)
- It’s persisted (that could be to a S3 bucket or a database for example). (2500ms)
So from start to end we’re estimating the time to complete the entire process is 4210ms or just over four seconds. Now this is probably a worst case scenario number, things may work in our favour and the times are much faster. Log it and make a record of it, review the min/max/average times.
Perhaps if I put it in other terms, assume you lose $5 revenue for every lost transaction and you’re losing 20 a day due to the speed of response. That’s $35,600 a year…. some things will just not do.
What are the things I can change?
So what are the things that can be changed? Well in the instance above not a lot. If this was a Kafka setup I’d be using the Confluent REST API interface for my endpoint so nothing can be done there.
Topic management is done by Kafka itself, there are a few things we can do here in the tuning. Such as throwing as much RAM on the box as possible, reducing any form or disk write I/O (disk writes will slow things down).
The consumer is the one thing we have an amount of control over as there’s a good chance it’s been coded up by someone, regardless of whether it’s in house or outsourced. Know thy code, test the routine in isolation. Nice thing about functional languages like Clojure is I can test functions outside of frameworks if they are coded right. It’s data in and data out.
Persistence is an interesting one. It’s rarely in your control. If it’s a database system then, like the Kafka settings, you might have some leverage on the settings but that’s about it. When you get into cloud based storage like Amazon S3 then you are at the mercy of connection, bandwidth and the service at the end. You have very little control of what happens. The end point is usually where the most amount of latency will occur.
With AWS Kinesis you don’t have that kind of luxury, it’s pretty much set for you and that’s that. You can increase the shard count and expose more consumers in your 5 consumers per shard/per second/per 1mb of data but you’re scaling up and costing more in the long run. If you want total millisecond control, then it’s going to be Kafka I’d be going for.
That Was Too Simple…
Consumers are expected to do things from the simple like passing through a message into storage, like the previous example, and some things can be more complex. Perhaps cleaning a transformation in the consumer and an API call to a third party API to get an id from some of the message data. More milliseconds to think about and some interesting thoughts on the process.
Here’s my new process model.
We can assume the HTTP REST to topic remains the same, the consumer is doing the work here.
- Receive the message from the topic.
- Consumer starts processing.
- Do transformation 1, this might add a unique uuid and time to show our processing block.
- In parallel an API to get ids from a third party.
- Combine the results from the transform and getids into the outgoing map.
- Persist the results as previously done.
When functions split out (as they can do in the Onyx framework for example) the maximum latency within the graph workflow is going to be the node that takes the longest time to complete. In our case here it’s the getids function.
Let’s put some timings down and see things look.
I’ve amended the consumer, that’s where the changes have happened. So there’s 750ms from the inbound message being picked up, 200ms for the transform and 2500ms for the third party API call. Our 4210ms process now becomes 200 + 10 + 750 + 2500 + 2500 = 5960ms.
We can tune one thing, the transform function, but there’s not a lot we can do with the third party API call. Once again, like the persistence, this one is really out of our control as there’s bandwidth, connection and someone else’s server to take into account.
The Kafka Log Will Just Catch It All
Yes it will and you hope that the consumers will catch up while there’s some gaps in the message delivery coming in. What I’d like all in all is to know those consumers are processing as fast as they can.
It’s a fine mix of message size, message time to live (TTL) and consumer throughput. I’ve seen first hand consumers fail on a Friday and every broker has filled up and died by the Monday. (no 24/7 support, so not my problem) 🙂
Enough disk space to hand the worse case scenarios is vital. Speaking of which….
Expect the Worse Case Scenario
When using third party services, including AWS or any other cloud provider. Work on the rule that theses services do go down. What’s the backup plan? Do you requeue the message and go around again? What should happen to failed messages mid consumer, return to the queue or dump to a failed message topic?
Whenever there’s a third party involved you have to plan for the worst. Even services that I’ve written myself or have been involved in I still ask the question, “what happens if this goes down?”, and while the team might look at me with ten heads (remotely) because it’s never gone down…. well it doesn’t mean it never will. And the day it does my consumers won’t really know about it, they’ll just fail repeatedly.
You Can Control Your Latency
There are things you can control and things you can’t. From the outset think about the things you can control, think about how to measure each step in the process, is there a bottleneck, is it code that could be refactored? Does the docker container increase any latency over the network, what about your Marathon/Mesos deploys. The list is long and not everything needs dealing with, some things will be more critical than others.
One thing I’ve learned is every streaming application is different. You can learn from your previous gains and losses but it’s still a case of starting on paper and seeing what can be shaved off where and what’s not in your control.
Ultimately it’s all fun and it’s all learning.