Compare Availability vs Durability vs Consistency #
Availability
- Definition: Will your application or data be accessible when you need it?
- Measured As: Uptime percentage over time (e.g., per month or year)
- Typical Targets:
- 99.95% → ~22 minutes of downtime/month
- 99.99% (4 9’s) → ~4.5 minutes/month
- 99.999% (5 9’s) → ~26 seconds/month
- How to Improve:
- Have multiple copies/instances
- Use multi-zone or multi-region deployments
- Add load balancers and health checks
- Implement replication and auto-failover
Durability
- Definition: Will your data still exist years from now, even through hardware failures or disasters?
- Measured As: Probability of data loss over time
- Typical Targets:
- 99.999999999% (11 9’s) durability
- Means: Store 1 million files for 10 million years, lose only 1 file
- Why It Matters:
- Once data is lost, it cannot be recovered
- Critical for financial data, medical records, backups, archives,
- How to Improve:
- Store multiple copies of data
- Distribute copies across zones and regions
Availability vs Durability
Concept | Focus | Metric Example | Goal |
---|---|---|---|
Availability | Data is accessible now | 99.99% (4 9’s) uptime | Avoid downtime |
Durability | Data is never lost | 99.999999999% (11 9’s) | Prevent permanent data loss |
What is Consistency?
- Scenario: You update your data. Should that update be visible in all replicas immediately, or is it okay if some replicas take a few seconds to catch up?
- Goal: Choose the right consistency model based on the balance between speed and data accuracy.
Examples of Consistency Models
1. Strong Consistency
- Definition: All replicas return the same, most recent value after a write
- How It Works: Synchronous replication — write must complete in all replicas before confirming success
- Use Case: Banking transactions, inventory systems
- Trade-offs:
- High integrity, but slower performance
2. Eventual Consistency
- Definition: All replicas will eventually reflect the latest value — but may show different values for a short period
- How It Works: Asynchronous replication — write completes in one node, and propagates later
- Use Case: Social media posts, product reviews, user feeds
- Trade-offs:
- High performance and scalability, but temporary inconsistency
- Suitable when speed is more important than immediate accuracy
Choose the right database based on your Consistency requirement
Consistency Model | Example Databases |
---|---|
Strong Consistency | Cloud Spanner, Azure SQL DB, Amazon Aurora, Amazon RDS |
Eventual Consistency | Amazon DynamoDB (default), Azure Cosmos DB (supports eventual) |
Compare RTO vs RPO #
RTO vs RPO – What’s the Difference?
- Scenario: If a system crashes or data is lost, your business wants to know: How soon can we recover? and How much data can we afford to lose?
- Goal: Understand and define Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for your applications
RTO (Recovery Time Objective)
- Definition: Maximum acceptable downtime after a failure
- Focus: How quickly can you restore service?
- Example:
- RTO = 1 hour → System must be back online within 1 hour
RPO (Recovery Point Objective)
- Definition: Maximum acceptable data loss (measured in time)
- Focus: How much data can you afford to lose?
- Example:
- RPO = 10 minutes → You can tolerate losing 10 minutes of data
RTO vs RPO
Metric | What It Measures | Example | Focus |
---|---|---|---|
RTO | Max downtime allowed | Recover in 1 hour | Time to recover |
RPO | Max data loss allowed | Lose max 10 minutes of data | Data protection |
Choosing the Right Strategy
RTO/RPO Requirement | Strategy Example |
---|---|
High RTO + High RPO | Scheduled backups, manual restore |
Medium RTO/RPO | Snapshot-based recovery |
Low RTO + Low RPO | Active-active replication with automatic failover |
Compare Data Formats vs Data Stores #
Types of Data Formats
- Structured: Tables, rows, and columns — Typically used by Relational databases (e.g., bank records)
- Semi-Structured: JSON, XML, key-value pairs — Typically used by NoSQL databases
- Unstructured: Files like images, audio, video, PDFs
Types of Data Stores
- Definition: Services that store, manage, and retrieve different types of data
- Types:
- Relational Databases – Structured transactional data
- NoSQL Databases – Flexible document, key-value, or graph data
- Analytical Databases – Petabyte-scale analytics and reporting
- Object/Block/File Storage – Used for backups, media, and other unstructured data
Which Data Format for which Data Store?
- Structured: Relational and Analytical
- Semi-Structured: NoSQL
- Unstructured: Object/Block/File Storage
Where is Structured Data stored? OLAP vs OLTP? #
Relational Databases (Structured)
- Tables with Rows & Columns: Fixed schema, strict relationships
- Used in
- OLTP: Fast reads/writes for high-volume transactions
- OLAP: Analytical queries over massive data (stored in columnar format)
OLTP (Online Transaction Processing):
- Small, frequent transactions – Ex: Transfer money
- Examples:
- Banking system – money transfers, balance checks
- E-commerce – order placement, payment updates
- Reservation systems – booking tickets, seat updates
OLAP (Online Analytical Processing):
- Run complex queries on historical data
- Examples:
- Sales trends by region over months
- Customer behavior analytics
- Financial reporting and dashboards
OLTP vs OLAP
Aspect | OLTP | OLAP |
---|---|---|
Purpose | Day-to-day transactions | Complex data analysis |
Query Type | Short, simple, real-time queries | Long-running, analytical queries |
Examples | Bank transfers, orders, bookings | Sales reports, user behavior analysis |
Terminology | Transactional Databases | Analytical Databases, Data Warehouse |
Data Structure | Row-based | Column-based |
Typical Architecture | One Large Node with standby (Some modern databases are Distributed) | Distributed |
Relational Databases in the Cloud
Type | AWS | Google Cloud | Azure |
---|---|---|---|
OLTP | Amazon Relational Database Service, Amazon Aurora | Cloud SQL, Cloud Spanner | Azure Database for MySQL and Azure Database for PostgreSQL, Azure SQL Database |
OLAP | Amazon Redshift | BigQuery | Azure Synapse Analytics |
Global vs Regional Relational OLTP Databases
Feature | Global Database | Regional Database |
---|---|---|
Definition | A database deployed across multiple regions, offering low-latency global access and automatic replication | A database deployed in a single region; all data and compute stay local |
Availability | Very High – Resilient to regional failures | High – Limited to availability zones in one region |
Cost | Higher due to multi-region storage and replication | Lower, with cost tied to a single region |
Example – AWS | Amazon Aurora Global Database | Amazon RDS |
Example – Azure | Azure SQL with geo-replication (creates a continuously synchronized, readable secondary database) | Azure SQL Database (single region), Azure Database (MySQL/PostgreSQL/..) |
Example – Google Cloud | Cloud Spanner (multi-region instance) | Cloud SQL |
Where is Semi-Structured Data stored? (NoSQL Databases) #
Why Semi-Structured Data?
- Scenario: Imagine building a product catalog or user profile where each record has different fields — some users have a Twitter handle, others do not
- Need: A flexible format that can evolve with your application without changing the database schema every time
- Definition: Data with some structure but not rigid like relational tables
- Examples: JSON documents, key-value pairs, graphs
- Use Cases: Profiles, product catalogs, ..
NoSQL Databases are used to store Semi-Structured Data
- NoSQL = Not Only SQL: Flexible schema, high performance, and horizontal scalability
- Designed For: Massive scale and rapid changes in data format
- Adaptable: App controls the schema instead of the database
Important NoSQL Database Types
- 1: Document Databases
- 2: Key-Value Databases
- 3: Graph Databases
- 4: Column-Family Databases
1: Document Databases
- Data stored as JSON-like documents
- Each document has a unique key
- Structure can vary from document to document
- Use Cases: Shopping cart, user profile, product catalog
- Managed Services: Amazon DynamoDB, Amazon DocumentDB , Azure Cosmos DB (SQL API), Google Cloud Firestore
"user_profiles": [
{
"id": 101,
"name": "Alice",
"email": "[email protected]",
"twitter": "@alice_dev"
},
{
"id": 102,
"name": "Bob",
"email": "[email protected]"
// Bob has no Twitter handle
}
]
"product_catalog": [
{
"id": "A1",
"name": "Smartphone",
"brand": "BrandX",
"camera_specs": "12MP",
"battery_life": "10h"
},
{
"id": "B2",
"name": "Laptop",
"brand": "BrandY",
"ram": "16GB",
"storage": "512GB SSD"
// No camera_specs for laptops
}
]
2: Key-Value Databases
- Data stored as a key and its corresponding value
- Very fast lookups by key
- Use Cases: Caching, session management
- Managed Services: Amazon DynamoDB, Azure Cosmos DB (Table API), Google Cloud Firestore
//session1
{
"key": "abc123",
"value": {
"userId": "u001",
"loginTime": "2050-07-24T10:00:00Z",
"role": "admin"
}
}
//session2
{
"key": "xyz789",
"value": {
"userId": "u002",
"loginTime": "2050-07-24T10:05:00Z",
"role": "viewer"
}
}
3: Graph Databases
- Data modeled using nodes and relationships (edges)
- Great for capturing and querying complex relationships
- Use Cases: Social networks, recommendation engines, fraud detection
- Managed Services: Amazon Neptune, Azure Cosmos DB Gremlin API, Spanner Graph
{
"nodes": [
{ "id": "u1", "name": "Ranga" },
{ "id": "u2", "name": "Ravi" },
{ "id": "u3", "name": "John" },
{ "id": "u4", "name": "Sathish" }
],
"edges": [
{ "from": "u1", "to": "u2", "label": "FRIEND" },
{ "from": "u2", "to": "u3", "label": "FRIEND" },
{ "from": "u3", "to": "u4", "label": "FRIEND" },
{ "from": "u4", "to": "u1", "label": "FRIEND" }
]
}
4: Column-Family Databases
- Data stored in rows and columns grouped into families
- Sparse format – rows don’t need to have all columns
- Use Cases: IoT data, time-series data, analytics
- Managed Services: Amazon Keyspaces (Cassandra), Azure Cosmos DB Cassandra API, Google Cloud Bigtable
Example
{
// Unique identifier for the row
// typically identifies a device, user, or service
"rowKey": "device123",
"columnFamilies": {
// First column family stores time-based log entries
"logs": {
// Timestamp as column name, log message as value
"2050-07-24T10:00:00Z": "Temperature: 32°C",
"2050-07-24T10:01:00Z": "Temperature: 33°C",
"2050-07-24T10:02:00Z": "Temperature: 34°C"
},
// Second column family stores system statuses
"status": {
// Same timestamp as column name, status message as value
"2050-07-24T10:00:00Z": "OK",
"2050-07-24T10:01:00Z": "OK",
"2050-07-24T10:02:00Z": "ALERT: Temp threshold exceeded"
}
}
}
Where is Unstructured Data stored? (File, Block, or Object storage) #
Why Unstructured Data?
- Scenario: Imagine building YouTube — videos, thumbnails, subtitles, and logs. All this content doesn’t fit into a table.
- Need: A way to store and retrieve large files like videos, images, documents, and logs — without predefined structure.
- Definition: Data without a fixed schema or format (e.g., audio, video, PDFs, images, binaries)
- Examples: Uploaded files, media content, backup archives, sensor logs
- Handled Using: File, block, or object storage based on access pattern and use case
Types of Storage for Unstructured Data
- 1: Block Storage
- 2: File Storage
- 3: Object Storage
1: Block Storage
- Low-level storage used like a hard disk
- High performance for structured workloads (e.g., VM disks, databases)
- Use Cases: OS disks, DB volumes
2: File Storage
- Shared file systems accessed over network using file paths
- Ideal for applications needing traditional file structure and shared access
- Use Cases: Team file shares, CMS systems
3: Object Storage
- Data stored as objects (data + metadata + unique ID)
- Accessed using REST APIs — no mounting required
- Scalable, cost-effective, and durable
- Use Cases: Media hosting, backups, logs, static websites
What is Block Storage? #
Why Block Storage in Cloud?
- Scenario: Imagine running a virtual machine or a database in the cloud. You need a reliable, fast disk that behaves like a physical hard drive.
- Block Storage: Provides raw storage volumes that can be attached to servers and used just like local disks.
- Goal: Attach virtual disks to compute resources like VMs, containers, and databases
Key Characteristics
- Raw Volumes: You format and mount it like a traditional disk
- Detachable and Reusable: Can be attached, detached, and re-attached to different servers
- Persistent: Data stays intact even if the VM is stopped or restarted
Use Cases
- Virtual Machines: OS and data disks for cloud-based servers
- Databases: High-speed transactional storage for SQL and NoSQL engines
Choice 1: Types of Block Storage
- Persistent Block Storage (e.g., Network Attached like EBS)
- Stored separately and connected over the network
- Retains data across VM stops, starts, or replacements
- Ideal for critical data like databases or file systems
- Temporary Block Storage (e.g., Instance Store)
- Physically attached to the VM host
- Very fast but data is lost if VM is terminated
- Best for temporary data like cache or scratch files
Choice 2: HDD (Hard Disk Drive) vs SSD (Solid State Drive)
- HDD (Hard Disk Drive)
- Transactional Performance: Lower – slower for frequent reads/writes
- Throughput: High – good for sequential access of large files
- Strength: Best for large, sequential data processing
- Use Cases:
- Big data workloads
- Log processing
- Backup or archival
- Cost: Lower – budget-friendly
- SSD (Solid State Drive)
- Transactional Performance: High – excellent for fast, frequent access
- Throughput: High – handles both small and large data well
- Strength: Great for small, random and sequential I/O
- Use Cases:
- Databases
- Web servers
- Operating system volumes
- Cost: Higher – but justified for high performance
Cloud Managed Services for Block Storage
- AWS: Amazon EBS, Instance Store
- Azure: Azure Managed Disks, Temporary Disks
- Google Cloud: Persistent Disks, Local SSDs
What is File Storage? #
Why File Storage?
- Scenario: Imagine a team working on shared documents, code files, or project reports. Everyone needs access to the same files, organized in folders.
- File Storage: A storage system that organizes data in a familiar folder and file format, accessible over a shared network.
What is File Storage?
- Definition: A way to store data in hierarchical structure – folders and files
- Goal: Share files easily between users or systems
- Access Method: Files accessed using standard protocols like NFS or SMB
- NFS: A protocol designed to share files over a network in Linux/Unix systems
- SMB: A protocol used mainly in Windows environments to share files
How File Storage Works
- Mounted Volumes: File storage is mounted like a network drive
- Shared Access: Multiple users or servers can read/write files
- Folders & Files: Organize data just like on your laptop
Benefits of File Storage
- Easy to Use: Familiar structure – folders, paths, extensions
- Shared Access: Ideal for collaboration
- Reliable: Supports backups, snapshots, and versioning
Cloud Managed Services for File Storage
- AWS: Amazon EFS (Elastic File System), Amazon FSx
- Azure: Azure Files
- Google Cloud: Filestore
What is Object Storage? #
Why Object Storage in Cloud?
- Scenario: Imagine millions of users uploading photos, videos, and documents through a website or mobile app — and accessing them anytime, from anywhere. Traditional file systems struggle to scale, manage access, or serve global traffic efficiently.
- Object Storage: Designed to store and retrieve large volumes of unstructured data (like media and backups) with high durability, scalability, and global accessibility.
- Goal: Store and retrieve any type of file at scale using a simple API
Key Characteristics
- Flat Structure: No folders — everything is stored in a bucket with a unique key
- Scalable: Can handle billions of files without performance loss
- Durable: Data is automatically replicated across multiple zones or regions
- Access via HTTP APIs: Easy to integrate with applications, websites, and mobile apps
Use Cases
- Backup and Archiving: Reliable, long-term storage
- Web Content: Store and serve images, videos, documents
- Big Data and Analytics: Ingest large files for processing
- Application Storage: Store logs, exports, reports
Choice 1: Versioning
- Purpose: Keep multiple versions of an object to protect against accidental deletes or overwrites
- Use Case: Restore a previous version of a file or track changes over time
Choice 2: Storage Tiers/Storage Class
- Purpose: Store data in the most cost-effective tier based on how often it is accessed
- Balance Performance and Price: Pay more for speed when needed, save more when data is rarely used
- Example Tiers
- Standard or Hot: For frequently accessed data
- Infrequent Access or Cool: For data accessed less often
- Archive: For long-term storage with slower retrieval
Choice 3: Lifecycle Rules
- Purpose: Automatically transition or delete objects based on age or other characteristics
- Use Case: Move old files to cheaper storage (e.g., archive) or delete them after a set period to save costs
Cloud Managed Services for Object Storage
- AWS: Amazon S3 (Standard, IA, Glacier, , ..)
- Azure: Azure Blob Storage (Hot, Cool, Archive , ..)
- Google Cloud: Cloud Storage (Standard, Nearline, Coldline, Archive, ..)
Interesting Thing to Know
- Amazon Glacier Service: Amazon Glacier was launched as a standalone service for long-term storage with very low cost.
- Optimized for Archiving: Great for backups, compliance data, and archived content that can wait minutes or hours to be retrieved.
- Overtime Integrated into S3: Part of Amazon S3 as a storage class
- Simple Management: Use S3 interface to store data in Glacier – no need to manage a new service.
- All Clouds Offer Archive Storage as a Storage Class/Tier: AWS (Glacier), Azure (Archive), GCP (Coldline, Archive)
What is Hybrid Storage? #
Why Hybrid Storage?
- Scenario: Imagine storing large datasets which need to be accessed on-premises — some frequently accessed, some rarely touched. Keeping all of it in the cloud may be slow. Keeping it all on-prem may limit scalability.
- Hybrid Storage: Combines on-premises and cloud storage to balance cost, speed, and control.
What is Hybrid Storage?
- Definition: A storage solution that bridges local (on-premises) storage with cloud storage
- Goal: Enable seamless data access and movement between on-prem and cloud environments
- Use Cases:
- Gradual cloud migration
- Burst storage for peak workloads
- Archive and backup to cloud
Benefits of Hybrid Storage
- Scalability: Cloud extends your capacity without new hardware
- Cost Efficiency: Store only hot data locally, cold data in the cloud
- Performance: Local access for critical data
- Data Protection: Cloud backup improves disaster recovery
Cloud Managed Services for Hybrid Storage
- AWS: AWS Storage Gateway
- Azure: Azure File Sync
- Google Cloud: Google Cloud Filestore (with Hybrid Connectivity)
What is the need for Data Analytics? #
Why Data Analytics?
- Scenario: You have tons of raw data — purchases, transactions, sensor readings. But without analysis, it’s just noise.
- Data Analytics: The process of analyzing raw data to extract useful insights
- Goal: Make data-driven decisions to improve business outcomes
Data Sources Can Include
- Transactions: Purchases, payments, logs
- Sensor & IoT Data: Temperature, pressure, GPS
- External Feeds: Weather, stock prices, social media
- Internal Systems: CRM, HR, finance systems
Key Benefits of Data Analytics
- Uncover Trends: Understand customer behavior, market shifts, and product performance
- Identify Weaknesses: Spot bottlenecks, inefficiencies, or risks early
- Improve Outcomes: Enhance efficiency, customer satisfaction, and profitability
Why Data Analytics Workflow?
- Scenario: You collect tons of raw data from multiple sources. But raw data is not useful until it's cleaned, processed, and visualized.
- Workflow: Follows a clear step-by-step path from raw data to business insights.
Data Ingestion
- Goal: Collect raw data from multiple sources
- Sources: Websites, IoT sensors, apps, logs, transactions
- Modes:
- Batch: Data loaded periodically
- Stream: Data ingested in real-time (e.g. user clicks, weather sensors)
Data Processing
- Goal: Make data usable for analysis
- Clean: Remove duplicates and errors
- Filter: Eliminate irrelevant or outlier data
- Transform: Convert to a consistent format or structure
- Aggregate: Combine data for summaries or insights
Data Storage
- Where: Store in a data warehouse
- Goal: Centralize data for easy access and future analysis
Data Querying
- What: Run SQL-like queries to analyze trends, identify patterns
Data Visualization
- Why: Charts and dashboards make insights easier to understand
- Impact: Helps leadership spot trends, outliers, and make informed decisions
Cloud Managed Services for Data Analytics
Stage | AWS | Azure | Google Cloud |
---|---|---|---|
Streaming Ingestion | Amazon Kinesis | Azure Event Hubs | Pub/Sub |
Data Processing (ETL / Data Prep) | AWS Glue | Azure Data Factory | Dataflow, Dataprep |
Data Warehouse and Querying | Amazon Redshift | Azure Synapse Analytics | BigQuery |
Data Visualization | Amazon QuickSight | Power BI | Looker |
Compare Data Warehouse vs Data Lake #
The 3Vs of Big Data
- Volume: Massive datasets — from terabytes to petabytes to exabytes
- Variety: Mix of structured (tables), semi-structured (JSON), and unstructured (videos, logs)
- Velocity: Speed of data arrival — batch (hourly/daily) or real-time (streams)
What if the Data We're NOT Capturing Becomes Valuable Later?
- Businesses may miss future insights if they only store processed data
- Storing raw data now gives the flexibility to run AI, ML, or new analytics later
- A Data Lake solves this — store everything today, use what you need tomorrow
Data Warehouse vs Data Lake
- Data Warehouse
- Stores processed, structured data optimized for fast SQL queries
- Used for business intelligence, dashboards, and reporting
- Examples: Teradata, Amazon Redshift, Google BigQuery, Azure Synapse Analytics
- Data Lake
- Stores raw, unprocessed data — compressed and cost-efficient
- Can handle any format (CSV, JSON, images, audio, logs, etc.)
- Supports on-demand exploration, AI/ML, and analytics workflows
- Built on object storage
- Examples: Amazon S3, Google Cloud Storage, Azure Data Lake Storage Gen2
How They Work Together
- Data Lake holds everything — raw logs, events, images, clickstreams, ...
- Data Warehouse pulls from the lake — after processing
- Modern Tools: Services like Google BigQuery, Azure Synapse Analytics, and Amazon Athena can query data directly from the data lake — no need to move or duplicate
Cloud Managed Services for Big Data Storage and Analytics
-
AWS:
- Storage: Amazon S3
- Warehouse: Amazon Redshift
- Query-over-lake: Amazon Athena
-
Azure:
- Storage: Azure Data Lake Storage Gen2
- Warehouse: Azure Synapse Analytics
- Query-over-lake: Synapse Serverless SQL
-
Google Cloud:
- Storage: Google Cloud Storage
- Warehouse: BigQuery
- Query-over-lake: BigQuery External Tables