How to choose between Batch ingestion and Stream Ingestion

Posted by : at

Category : data_engineering


When we choose streaming ingestion over batch ingestion, we should consider below points:

  1. If I ingest the data in real time, can downstream system handle the rate of data flow?
  2. Do I need ms real-time data? Or micro batch approach work, e.g. every minute or so?
  3. What benefits do I realize by implementing streaming? if I get data in real time, what actions can I take on that data that would be an improvement upon batch?
  4. Will streaming be costlier?
  5. Are my streaming pipeline and system reliable and robust if infrastructure fails?
  6. Should I use managed service (Kinesis, Google Pub/Sub, Dataflow) or stand up my own instances of Kafka,Flink, Spark, Pulsar etc? who will manage it? What are the costs and trade-offs?
  7. If I’m deploying an ML model, what benefits do I have with online predictions and possibly continuous training?

As you can see, streaming-first might seem like a good idea, but it’s not always straightforward; extra costs and complexities inherently occur.


Advertisement
About Mohit Manna

Hi, my name is Mohit Manna. I am a Data Engineer who knows about coding and other stuffs

Star
Tags