How to choose between Batch ingestion and Stream Ingestion

Posted by : Mohit Manna at Mar 9, 2023

When we choose streaming ingestion over batch ingestion, we should consider below points:

If I ingest the data in real time, can downstream system handle the rate of data flow?
Do I need ms real-time data? Or micro batch approach work, e.g. every minute or so?
What benefits do I realize by implementing streaming? if I get data in real time, what actions can I take on that data that would be an improvement upon batch?
Will streaming be costlier?
Are my streaming pipeline and system reliable and robust if infrastructure fails?
Should I use managed service (Kinesis, Google Pub/Sub, Dataflow) or stand up my own instances of Kafka,Flink, Spark, Pulsar etc? who will manage it? What are the costs and trade-offs?
If I’m deploying an ML model, what benefits do I have with online predictions and possibly continuous training?

As you can see, streaming-first might seem like a good idea, but it’s not always straightforward; extra costs and complexities inherently occur.