
Source: Fundamentals of Data Engineering Reis J
Storage stage touches many other stages of data engineering lifecycle like ingestion, transformation and serving etc. There are many storage solutions like S3 which is object storage and gives us query capabilities, Kafka has the ability to simultaneously ingest and store and query systems for message with object storage being a standard layer.
Key considerations while evaluating a storage systems:
- Is this storage solution compatible with the architecture’s required write and read speeds?
- Will storage create a bottleneck for downstream processes?
- Do you understand how this storage technology works? Are you utilizing the storage system optimally or committing unnatural acts? For instance, are you applying a high rate of random access updates in an object storage system? (This is an antipattern with significant performance overhead.)
- Will this storage system handle anticipated future scale? You should consider all capacity limits on the storage system: total available storage, read operation rate, write volume, etc.
- Will downstream users and processes be able to retrieve data in the required service-level agreement (SLA)
- Are you capturing metadata about schema evolution, data flows, data lineage, and so forth? Metadata has a significant impact on the utility of data. Metadata represents an investment in the future, dramatically enhancing discoverability and institutional knowledge to streamline future projects and architecture changes.
- Is this a pure storage solution (object storage), or does it support complex query patterns (i.e., a cloud data warehouse)?
- Is the storage system schema-agnostic (object storage)? Flexible schema (Cassandra)? Enforced schema (a cloud data warehouse)?
- How are you tracking master data, golden records data quality, and data lineage for data governance? (We have more to say on these in “Data Management”.)
- How are you handling regulatory compliance and data sovereignty? For example, can you store your data in certain geographical locations but not others?
We should also keep in mind the data access frequency :
Data access frequency decides the “temperature” of data. It can fall into either of these two categories:
- Hot data: retrieved many times a day . There can be a sub category lukewarm data. which can be accessed every month or week.
- Cold data: seldom queried. It is appropriate for archiving systems. This is cheaper than hot layer but when you need the data you will have to request in advance. The data is first moved in hot layer then it becomes available for you to access.