Compare Availability vs Durability vs Consistency #


Availability

  • Definition: Will your application or data be accessible when you need it?
  • Measured As: Uptime percentage over time (e.g., per month or year)
  • Typical Targets:
    • 99.95% → ~22 minutes of downtime/month
    • 99.99% (4 9’s) → ~4.5 minutes/month
    • 99.999% (5 9’s) → ~26 seconds/month
  • How to Improve:
    • Have multiple copies/instances
    • Use multi-zone or multi-region deployments
    • Add load balancers and health checks
    • Implement replication and auto-failover

Durability

  • Definition: Will your data still exist years from now, even through hardware failures or disasters?
  • Measured As: Probability of data loss over time
  • Typical Targets:
    • 99.999999999% (11 9’s) durability
    • Means: Store 1 million files for 10 million years, lose only 1 file
  • Why It Matters:
    • Once data is lost, it cannot be recovered
    • Critical for financial data, medical records, backups, archives,
  • How to Improve:
    • Store multiple copies of data
    • Distribute copies across zones and regions

Availability vs Durability

Concept Focus Metric Example Goal
Availability Data is accessible now 99.99% (4 9’s) uptime Avoid downtime
Durability Data is never lost 99.999999999% (11 9’s) Prevent permanent data loss

What is Consistency?

  • Scenario: You update your data. Should that update be visible in all replicas immediately, or is it okay if some replicas take a few seconds to catch up?
  • Goal: Choose the right consistency model based on the balance between speed and data accuracy.

Examples of Consistency Models

1. Strong Consistency

  • Definition: All replicas return the same, most recent value after a write
  • How It Works: Synchronous replication — write must complete in all replicas before confirming success
  • Use Case: Banking transactions, inventory systems
  • Trade-offs:
    • High integrity, but slower performance

2. Eventual Consistency

  • Definition: All replicas will eventually reflect the latest value — but may show different values for a short period
  • How It Works: Asynchronous replication — write completes in one node, and propagates later
  • Use Case: Social media posts, product reviews, user feeds
  • Trade-offs:
    • High performance and scalability, but temporary inconsistency
    • Suitable when speed is more important than immediate accuracy

Choose the right database based on your Consistency requirement

Consistency Model Example Databases
Strong Consistency Cloud Spanner, Azure SQL DB, Amazon Aurora, Amazon RDS
Eventual Consistency Amazon DynamoDB (default), Azure Cosmos DB (supports eventual)

Compare RTO vs RPO #


RTO vs RPO – What’s the Difference?

  • Scenario: If a system crashes or data is lost, your business wants to know: How soon can we recover? and How much data can we afford to lose?
  • Goal: Understand and define Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for your applications

RTO (Recovery Time Objective)

  • Definition: Maximum acceptable downtime after a failure
  • Focus: How quickly can you restore service?
  • Example:
    • RTO = 1 hour → System must be back online within 1 hour

RPO (Recovery Point Objective)

  • Definition: Maximum acceptable data loss (measured in time)
  • Focus: How much data can you afford to lose?
  • Example:
    • RPO = 10 minutes → You can tolerate losing 10 minutes of data

RTO vs RPO

Metric What It Measures Example Focus
RTO Max downtime allowed Recover in 1 hour Time to recover
RPO Max data loss allowed Lose max 10 minutes of data Data protection

Choosing the Right Strategy

RTO/RPO Requirement Strategy Example
High RTO + High RPO Scheduled backups, manual restore
Medium RTO/RPO Snapshot-based recovery
Low RTO + Low RPO Active-active replication with automatic failover


Compare Data Formats vs Data Stores #


Types of Data Formats

  • Structured: Tables, rows, and columns — Typically used by Relational databases (e.g., bank records)
  • Semi-Structured: JSON, XML, key-value pairs — Typically used by NoSQL databases
  • Unstructured: Files like images, audio, video, PDFs

Types of Data Stores

  • Definition: Services that store, manage, and retrieve different types of data
  • Types:
    • Relational Databases – Structured transactional data
    • NoSQL Databases – Flexible document, key-value, or graph data
    • Analytical Databases – Petabyte-scale analytics and reporting
    • Object/Block/File Storage – Used for backups, media, and other unstructured data

Which Data Format for which Data Store?

  • Structured: Relational and Analytical
  • Semi-Structured: NoSQL
  • Unstructured: Object/Block/File Storage

Where is Structured Data stored? OLAP vs OLTP? #


Relational Databases (Structured)

  • Tables with Rows & Columns: Fixed schema, strict relationships
  • Used in
    • OLTP: Fast reads/writes for high-volume transactions
    • OLAP: Analytical queries over massive data (stored in columnar format)

OLTP (Online Transaction Processing):

  • Small, frequent transactions – Ex: Transfer money
  • Examples:
    • Banking system – money transfers, balance checks
    • E-commerce – order placement, payment updates
    • Reservation systems – booking tickets, seat updates

OLAP (Online Analytical Processing):

  • Run complex queries on historical data
  • Examples:
    • Sales trends by region over months
    • Customer behavior analytics
    • Financial reporting and dashboards

OLTP vs OLAP

Aspect OLTP OLAP
Purpose Day-to-day transactions Complex data analysis
Query Type Short, simple, real-time queries Long-running, analytical queries
Examples Bank transfers, orders, bookings Sales reports, user behavior analysis
Terminology Transactional Databases Analytical Databases, Data Warehouse
Data Structure Row-based Column-based
Typical Architecture One Large Node with standby (Some modern databases are Distributed) Distributed

Relational Databases in the Cloud

Type AWS Google Cloud Azure
OLTP Amazon Relational Database Service, Amazon Aurora Cloud SQL, Cloud Spanner Azure Database for MySQL and Azure Database for PostgreSQL, Azure SQL Database
OLAP Amazon Redshift BigQuery Azure Synapse Analytics

Global vs Regional Relational OLTP Databases

Feature Global Database Regional Database
Definition A database deployed across multiple regions, offering low-latency global access and automatic replication A database deployed in a single region; all data and compute stay local
Availability Very High – Resilient to regional failures High – Limited to availability zones in one region
Cost Higher due to multi-region storage and replication Lower, with cost tied to a single region
Example – AWS Amazon Aurora Global Database Amazon RDS
Example – Azure Azure SQL with geo-replication (creates a continuously synchronized, readable secondary database) Azure SQL Database (single region), Azure Database (MySQL/PostgreSQL/..)
Example – Google Cloud Cloud Spanner (multi-region instance) Cloud SQL

Where is Semi-Structured Data stored? (NoSQL Databases) #


Why Semi-Structured Data?

  • Scenario: Imagine building a product catalog or user profile where each record has different fields — some users have a Twitter handle, others do not
  • Need: A flexible format that can evolve with your application without changing the database schema every time
  • Definition: Data with some structure but not rigid like relational tables
  • Examples: JSON documents, key-value pairs, graphs
  • Use Cases: Profiles, product catalogs, ..

NoSQL Databases are used to store Semi-Structured Data

  • NoSQL = Not Only SQL: Flexible schema, high performance, and horizontal scalability
  • Designed For: Massive scale and rapid changes in data format
  • Adaptable: App controls the schema instead of the database

Important NoSQL Database Types

  • 1: Document Databases
  • 2: Key-Value Databases
  • 3: Graph Databases
  • 4: Column-Family Databases

1: Document Databases

  • Data stored as JSON-like documents
  • Each document has a unique key
  • Structure can vary from document to document
  • Use Cases: Shopping cart, user profile, product catalog
  • Managed Services: Amazon DynamoDB, Amazon DocumentDB , Azure Cosmos DB (SQL API), Google Cloud Firestore
"user_profiles": [
  {
    "id": 101,
    "name": "Alice",
    "email": "[email protected]",
    "twitter": "@alice_dev"
  },
  {
    "id": 102,
    "name": "Bob",
    "email": "[email protected]"
    // Bob has no Twitter handle
  }
]

"product_catalog": [
  {
    "id": "A1",
    "name": "Smartphone",
    "brand": "BrandX",
    "camera_specs": "12MP",
    "battery_life": "10h"
  },
  {
    "id": "B2",
    "name": "Laptop",
    "brand": "BrandY",
    "ram": "16GB",
    "storage": "512GB SSD"
    // No camera_specs for laptops
  }
]

2: Key-Value Databases

  • Data stored as a key and its corresponding value
  • Very fast lookups by key
  • Use Cases: Caching, session management
  • Managed Services: Amazon DynamoDB, Azure Cosmos DB (Table API), Google Cloud Firestore
//session1
{
  "key": "abc123",
  "value": {
    "userId": "u001",
    "loginTime": "2050-07-24T10:00:00Z",
    "role": "admin"
  }
}

//session2
{
  "key": "xyz789",
  "value": {
    "userId": "u002",
    "loginTime": "2050-07-24T10:05:00Z",
    "role": "viewer"
  }
}

3: Graph Databases

  • Data modeled using nodes and relationships (edges)
  • Great for capturing and querying complex relationships
  • Use Cases: Social networks, recommendation engines, fraud detection
  • Managed Services: Amazon Neptune, Azure Cosmos DB Gremlin API, Spanner Graph
{
  "nodes": [
    { "id": "u1", "name": "Ranga" },
    { "id": "u2", "name": "Ravi" },
    { "id": "u3", "name": "John" },
    { "id": "u4", "name": "Sathish" }
  ],
  "edges": [
    { "from": "u1", "to": "u2", "label": "FRIEND" },
    { "from": "u2", "to": "u3", "label": "FRIEND" },
    { "from": "u3", "to": "u4", "label": "FRIEND" },
    { "from": "u4", "to": "u1", "label": "FRIEND" }
  ]
}

4: Column-Family Databases

  • Data stored in rows and columns grouped into families
  • Sparse format – rows don’t need to have all columns
  • Use Cases: IoT data, time-series data, analytics
  • Managed Services: Amazon Keyspaces (Cassandra), Azure Cosmos DB Cassandra API, Google Cloud Bigtable

Example

{
  // Unique identifier for the row 
  // typically identifies a device, user, or service
  "rowKey": "device123",

  "columnFamilies": {
    
    // First column family stores time-based log entries
    "logs": {
      // Timestamp as column name, log message as value
      "2050-07-24T10:00:00Z": "Temperature: 32°C",
      "2050-07-24T10:01:00Z": "Temperature: 33°C",
      "2050-07-24T10:02:00Z": "Temperature: 34°C"
    },

    // Second column family stores system statuses
    "status": {
      // Same timestamp as column name, status message as value
      "2050-07-24T10:00:00Z": "OK",
      "2050-07-24T10:01:00Z": "OK",
      "2050-07-24T10:02:00Z": "ALERT: Temp threshold exceeded"
    }
  }
}

Where is Unstructured Data stored? (File, Block, or Object storage) #


Why Unstructured Data?

  • Scenario: Imagine building YouTube — videos, thumbnails, subtitles, and logs. All this content doesn’t fit into a table.
  • Need: A way to store and retrieve large files like videos, images, documents, and logs — without predefined structure.
  • Definition: Data without a fixed schema or format (e.g., audio, video, PDFs, images, binaries)
  • Examples: Uploaded files, media content, backup archives, sensor logs
  • Handled Using: File, block, or object storage based on access pattern and use case

Types of Storage for Unstructured Data

  • 1: Block Storage
  • 2: File Storage
  • 3: Object Storage

1: Block Storage

  • Low-level storage used like a hard disk
  • High performance for structured workloads (e.g., VM disks, databases)
  • Use Cases: OS disks, DB volumes

2: File Storage

  • Shared file systems accessed over network using file paths
  • Ideal for applications needing traditional file structure and shared access
  • Use Cases: Team file shares, CMS systems

3: Object Storage

  • Data stored as objects (data + metadata + unique ID)
  • Accessed using REST APIs — no mounting required
  • Scalable, cost-effective, and durable
  • Use Cases: Media hosting, backups, logs, static websites

What is Block Storage? #


Why Block Storage in Cloud?

  • Scenario: Imagine running a virtual machine or a database in the cloud. You need a reliable, fast disk that behaves like a physical hard drive.
  • Block Storage: Provides raw storage volumes that can be attached to servers and used just like local disks.
  • Goal: Attach virtual disks to compute resources like VMs, containers, and databases

Key Characteristics

  • Raw Volumes: You format and mount it like a traditional disk
  • Detachable and Reusable: Can be attached, detached, and re-attached to different servers
  • Persistent: Data stays intact even if the VM is stopped or restarted

Use Cases

  • Virtual Machines: OS and data disks for cloud-based servers
  • Databases: High-speed transactional storage for SQL and NoSQL engines

Choice 1: Types of Block Storage

  • Persistent Block Storage (e.g., Network Attached like EBS)
    • Stored separately and connected over the network
    • Retains data across VM stops, starts, or replacements
    • Ideal for critical data like databases or file systems
  • Temporary Block Storage (e.g., Instance Store)
    • Physically attached to the VM host
    • Very fast but data is lost if VM is terminated
    • Best for temporary data like cache or scratch files

Choice 2: HDD (Hard Disk Drive) vs SSD (Solid State Drive)

  • HDD (Hard Disk Drive)
    • Transactional Performance: Lower – slower for frequent reads/writes
    • Throughput: High – good for sequential access of large files
    • Strength: Best for large, sequential data processing
    • Use Cases:
      • Big data workloads
      • Log processing
      • Backup or archival
    • Cost: Lower – budget-friendly
  • SSD (Solid State Drive)
    • Transactional Performance: High – excellent for fast, frequent access
    • Throughput: High – handles both small and large data well
    • Strength: Great for small, random and sequential I/O
    • Use Cases:
      • Databases
      • Web servers
      • Operating system volumes
    • Cost: Higher – but justified for high performance

Cloud Managed Services for Block Storage

  • AWS: Amazon EBS, Instance Store
  • Azure: Azure Managed Disks, Temporary Disks
  • Google Cloud: Persistent Disks, Local SSDs

What is File Storage? #


Why File Storage?

  • Scenario: Imagine a team working on shared documents, code files, or project reports. Everyone needs access to the same files, organized in folders.
  • File Storage: A storage system that organizes data in a familiar folder and file format, accessible over a shared network.

What is File Storage?

  • Definition: A way to store data in hierarchical structure – folders and files
  • Goal: Share files easily between users or systems
  • Access Method: Files accessed using standard protocols like NFS or SMB
    • NFS: A protocol designed to share files over a network in Linux/Unix systems
    • SMB: A protocol used mainly in Windows environments to share files

How File Storage Works

  • Mounted Volumes: File storage is mounted like a network drive
  • Shared Access: Multiple users or servers can read/write files
  • Folders & Files: Organize data just like on your laptop

Benefits of File Storage

  • Easy to Use: Familiar structure – folders, paths, extensions
  • Shared Access: Ideal for collaboration
  • Reliable: Supports backups, snapshots, and versioning

Cloud Managed Services for File Storage

  • AWS: Amazon EFS (Elastic File System), Amazon FSx
  • Azure: Azure Files
  • Google Cloud: Filestore

What is Object Storage? #


Why Object Storage in Cloud?

  • Scenario: Imagine millions of users uploading photos, videos, and documents through a website or mobile app — and accessing them anytime, from anywhere. Traditional file systems struggle to scale, manage access, or serve global traffic efficiently.
  • Object Storage: Designed to store and retrieve large volumes of unstructured data (like media and backups) with high durability, scalability, and global accessibility.
  • Goal: Store and retrieve any type of file at scale using a simple API

Key Characteristics

  • Flat Structure: No folders — everything is stored in a bucket with a unique key
  • Scalable: Can handle billions of files without performance loss
  • Durable: Data is automatically replicated across multiple zones or regions
  • Access via HTTP APIs: Easy to integrate with applications, websites, and mobile apps

Use Cases

  • Backup and Archiving: Reliable, long-term storage
  • Web Content: Store and serve images, videos, documents
  • Big Data and Analytics: Ingest large files for processing
  • Application Storage: Store logs, exports, reports

Choice 1: Versioning

  • Purpose: Keep multiple versions of an object to protect against accidental deletes or overwrites
  • Use Case: Restore a previous version of a file or track changes over time

Choice 2: Storage Tiers/Storage Class

  • Purpose: Store data in the most cost-effective tier based on how often it is accessed
  • Balance Performance and Price: Pay more for speed when needed, save more when data is rarely used
  • Example Tiers
    • Standard or Hot: For frequently accessed data
    • Infrequent Access or Cool: For data accessed less often
    • Archive: For long-term storage with slower retrieval

Choice 3: Lifecycle Rules

  • Purpose: Automatically transition or delete objects based on age or other characteristics
  • Use Case: Move old files to cheaper storage (e.g., archive) or delete them after a set period to save costs

Cloud Managed Services for Object Storage

  • AWS: Amazon S3 (Standard, IA, Glacier, , ..)
  • Azure: Azure Blob Storage (Hot, Cool, Archive , ..)
  • Google Cloud: Cloud Storage (Standard, Nearline, Coldline, Archive, ..)

Interesting Thing to Know

  • Amazon Glacier Service: Amazon Glacier was launched as a standalone service for long-term storage with very low cost.
  • Optimized for Archiving: Great for backups, compliance data, and archived content that can wait minutes or hours to be retrieved.
  • Overtime Integrated into S3: Part of Amazon S3 as a storage class
  • Simple Management: Use S3 interface to store data in Glacier – no need to manage a new service.
  • All Clouds Offer Archive Storage as a Storage Class/Tier: AWS (Glacier), Azure (Archive), GCP (Coldline, Archive)

What is Hybrid Storage? #


Why Hybrid Storage?

  • Scenario: Imagine storing large datasets which need to be accessed on-premises — some frequently accessed, some rarely touched. Keeping all of it in the cloud may be slow. Keeping it all on-prem may limit scalability.
  • Hybrid Storage: Combines on-premises and cloud storage to balance cost, speed, and control.

What is Hybrid Storage?

  • Definition: A storage solution that bridges local (on-premises) storage with cloud storage
  • Goal: Enable seamless data access and movement between on-prem and cloud environments
  • Use Cases:
    • Gradual cloud migration
    • Burst storage for peak workloads
    • Archive and backup to cloud

Benefits of Hybrid Storage

  • Scalability: Cloud extends your capacity without new hardware
  • Cost Efficiency: Store only hot data locally, cold data in the cloud
  • Performance: Local access for critical data
  • Data Protection: Cloud backup improves disaster recovery

Cloud Managed Services for Hybrid Storage

  • AWS: AWS Storage Gateway
  • Azure: Azure File Sync
  • Google Cloud: Google Cloud Filestore (with Hybrid Connectivity)

What is the need for Data Analytics? #


Why Data Analytics?

  • Scenario: You have tons of raw data — purchases, transactions, sensor readings. But without analysis, it’s just noise.
  • Data Analytics: The process of analyzing raw data to extract useful insights
  • Goal: Make data-driven decisions to improve business outcomes

Data Sources Can Include

  • Transactions: Purchases, payments, logs
  • Sensor & IoT Data: Temperature, pressure, GPS
  • External Feeds: Weather, stock prices, social media
  • Internal Systems: CRM, HR, finance systems

Key Benefits of Data Analytics

  • Uncover Trends: Understand customer behavior, market shifts, and product performance
  • Identify Weaknesses: Spot bottlenecks, inefficiencies, or risks early
  • Improve Outcomes: Enhance efficiency, customer satisfaction, and profitability

Why Data Analytics Workflow?

  • Scenario: You collect tons of raw data from multiple sources. But raw data is not useful until it's cleaned, processed, and visualized.
  • Workflow: Follows a clear step-by-step path from raw data to business insights.

Data Ingestion

  • Goal: Collect raw data from multiple sources
  • Sources: Websites, IoT sensors, apps, logs, transactions
  • Modes:
    • Batch: Data loaded periodically
    • Stream: Data ingested in real-time (e.g. user clicks, weather sensors)

Data Processing

  • Goal: Make data usable for analysis
  • Clean: Remove duplicates and errors
  • Filter: Eliminate irrelevant or outlier data
  • Transform: Convert to a consistent format or structure
  • Aggregate: Combine data for summaries or insights

Data Storage

  • Where: Store in a data warehouse
  • Goal: Centralize data for easy access and future analysis

Data Querying

  • What: Run SQL-like queries to analyze trends, identify patterns

Data Visualization

  • Why: Charts and dashboards make insights easier to understand
  • Impact: Helps leadership spot trends, outliers, and make informed decisions

Cloud Managed Services for Data Analytics

Stage AWS Azure Google Cloud
Streaming Ingestion Amazon Kinesis Azure Event Hubs Pub/Sub
Data Processing (ETL / Data Prep) AWS Glue Azure Data Factory Dataflow, Dataprep
Data Warehouse and Querying Amazon Redshift Azure Synapse Analytics BigQuery
Data Visualization Amazon QuickSight Power BI Looker

Compare Data Warehouse vs Data Lake #


The 3Vs of Big Data

  • Volume: Massive datasets — from terabytes to petabytes to exabytes
  • Variety: Mix of structured (tables), semi-structured (JSON), and unstructured (videos, logs)
  • Velocity: Speed of data arrival — batch (hourly/daily) or real-time (streams)

What if the Data We're NOT Capturing Becomes Valuable Later?

  • Businesses may miss future insights if they only store processed data
  • Storing raw data now gives the flexibility to run AI, ML, or new analytics later
  • A Data Lake solves this — store everything today, use what you need tomorrow

Data Warehouse vs Data Lake

  • Data Warehouse
    • Stores processed, structured data optimized for fast SQL queries
    • Used for business intelligence, dashboards, and reporting
    • Examples: Teradata, Amazon Redshift, Google BigQuery, Azure Synapse Analytics
  • Data Lake
    • Stores raw, unprocessed data — compressed and cost-efficient
    • Can handle any format (CSV, JSON, images, audio, logs, etc.)
    • Supports on-demand exploration, AI/ML, and analytics workflows
    • Built on object storage
    • Examples: Amazon S3, Google Cloud Storage, Azure Data Lake Storage Gen2

How They Work Together

  • Data Lake holds everything — raw logs, events, images, clickstreams, ...
  • Data Warehouse pulls from the lake — after processing
  • Modern Tools: Services like Google BigQuery, Azure Synapse Analytics, and Amazon Athena can query data directly from the data lake — no need to move or duplicate

Cloud Managed Services for Big Data Storage and Analytics

  • AWS:

    • Storage: Amazon S3
    • Warehouse: Amazon Redshift
    • Query-over-lake: Amazon Athena
  • Azure:

    • Storage: Azure Data Lake Storage Gen2
    • Warehouse: Azure Synapse Analytics
    • Query-over-lake: Synapse Serverless SQL
  • Google Cloud:

    • Storage: Google Cloud Storage
    • Warehouse: BigQuery
    • Query-over-lake: BigQuery External Tables