
When we choose streaming ingestion over batch ingestion, we should consider below points:
- If I ingest the data in real time, can downstream system handle the rate of data flow?
- Do I need ms real-time data? Or micro batch approach work, e.g. every minute or so?
- What benefits do I realize by implementing streaming? if I get data in real time, what actions can I take on that data that would be an improvement upon batch?
- Will streaming be costlier?
- Are my streaming pipeline and system reliable and robust if infrastructure fails?
- Should I use managed service (Kinesis, Google Pub/Sub, Dataflow) or stand up my own instances of Kafka,Flink, Spark, Pulsar etc? who will manage it? What are the costs and trade-offs?
- If I’m deploying an ML model, what benefits do I have with online predictions and possibly continuous training?
As you can see, streaming-first might seem like a good idea, but it’s not always straightforward; extra costs and complexities inherently occur.