Kafka acting up like a toddler is a symptom and not the cause, you’re doing something to it for it to act that way.
Strive to reduce latency. Remember the Rule of 72.
Use frameworks but remember there’s added latency so make sure you can tune it.
Log everything.
Time everything.
Know your message size and define the broker memory on “write throughput * 30” seconds accordingly.
Know your log retention size and time, if you don’t know it then it’s probably 156 hours.
Ensure your consumers can neatly die if they need to without wrecking the other consumers.
That niggle inside saying something ain’t right, heed it.
If another company can process thirty million messages a second, you can too.
Bonus: Tea Solves Everything