Chapter 4 - Scaling Databases (Summary)

Storage Services #stateful-services #consistency #redundancy

Stateful services like storage have mechanisms for consistency and redundancy to avoid data loss
They use consensus protocols like Paxos for strong consistency or eventual consistency mechanisms
Tradeoffs involve consistency, complexity, security, latency, performance
Stateless services are preferred, state is kept only in stateful services

Key Point: Understand the tradeoffs involved in choosing consistency mechanisms for stateful services.

Strong vs Weak Consistency #strong-consistency #weak-consistency

In strong consistency, all parallel processes see accesses in the same order (sequentially)
Only one consistent state is observed across processes
In weak/eventual consistency, different processes may see variables in different states

Key Point: Strong consistency provides a single linear view, while eventual consistency allows temporary inconsistencies.

Types of Storage #database #sql #nosql #column-oriented #key-value #document #graph #file-storage #block-storage

SQL: Relational, with tables, keys; must have ACID properties
NoSQL: Does not have all SQL properties
Column-oriented: Data organized into columns (e.g. Cassandra, HBase)
Key-value: Data as key-value pairs, keys are hashable (e.g. Memcached, Redis)
Document: Key-value with larger values in formats like JSON, YAML (e.g. MongoDB)
Graph: Designed for efficient relationship storage (e.g. Neo4j, RedisGraph)
File Storage: Data stored in files/directories, like key-value
Block Storage: Data in fixed-size blocks, used in low-level storage systems

Key Point: Understand the different types of storage and their characteristics to choose the appropriate one for your requirements.

Replication #replication #partitioning #sharding #fault-tolerance #scalability #latency

Replication makes copies (replicas) of data on different nodes
Partitioning divides data into subsets, sharding distributes partitions across nodes
Enables fault-tolerance, higher storage capacity, throughput, and lower latency

Replica Distribution #replica-distribution #rack-awareness #data-center-awareness

Typical design: one replica on same rack, one on different rack/data center
Sharding provides storage scaling, memory scaling, parallel processing, and data locality

Key Point: Replication and sharding are key techniques for scalability, fault-tolerance, and performance in distributed systems.

Single-Leader Replication #single-leader #read-scalability #write-bottleneck

All writes occur on a single leader node, replicated to followers
Scales reads by increasing replicas, but writes are still bottlenecked
Examples: MySQL, Postgres replication configurations

Multi-Level Replication #multi-level-replication #consistency-delay

Multiple levels of followers, like a pyramid, to scale reads further
Each node replicates to followers it can handle
Tradeoff: consistency is further delayed down the levels

Key Point: Single-leader replication is simple but has limitations in write scalability and consistency.

Multi-Leader Replication #multi-leader #write-scalability #race-conditions

Multiple leader nodes, writes can occur on any leader
Each leader replicates writes to all other nodes
Introduces race conditions and consistency problems

Consistency Problems #consistency-problems #clock-skew #sequence-importance

Sequence is important for operations like UPDATE and DELETE
Using timestamps doesn't work due to clock skew (imperfect clock sync)
Different replicas may process the same operations in different orders

Key Point: Multi-leader replication enables write scalability but requires handling race conditions and consistency issues.

Leaderless Replication #leaderless #quorum #consensus

All nodes are equal, reads and writes can occur on any node
Introduces race conditions, handled using the concept of quorum
Quorum is the minimum number of nodes required for consensus
Examples: Cassandra, Dynamo, Riak, Voldemort

Quorum Configurations #quorum-configurations #write-quorum #read-quorum

For n nodes, read and write quorums can be set to n/2 + 1 for consistency
Otherwise, only eventual consistency, and UPDATE/DELETE may be inconsistent

Key Point: Leaderless replication requires careful quorum configuration to balance consistency and performance.

Key Takeaways

Consistency vs. Availability Tradeoff: Understand the tradeoffs involved in choosing consistency mechanisms (strong, eventual) and their impact on availability and performance.
Replication Strategies: Grasp the different replication strategies (single-leader, multi-leader, leaderless) and their advantages, limitations, and consistency implications.
Storage Types: Familiarize yourself with the different types of storage (SQL, NoSQL, column-oriented, key-value, document, graph, file, block) and their suitable use cases.
Scalability Techniques: Leverage techniques like replication, partitioning, and sharding to achieve fault-tolerance, increased storage capacity, throughput, and lower latency.
Distributed Systems Complexities: Be aware of complexities in distributed systems, such as race conditions, clock skew, and the need for consensus algorithms and quorum configurations.
Further Reading: Refer to resources like "Designing Data-Intensive Applications" by Martin Kleppmann for more in-depth coverage of topics like consensus algorithms, failover problems, and multi-leader replication.

References:

Partitioning Vs Sharding

Scaling Storage Capacity with Sharded Databases

As the database size grows beyond the capacity of a single host, consider using sharded storage solutions like HDFS or Cassandra, which are horizontally scalable and can support large storage capacities by adding more hosts. #sharded-storage #horizontal-scaling #large-clusters

Aggregating Events

Database writes are expensive to scale, so aim to reduce write rates through sampling and aggregation techniques. #reduce-writes #sampling #aggregation
Aggregating events combines multiple events into a single event, resulting in fewer database writes. #event-aggregation #reduced-writes
In multi-tier aggregation, each layer aggregates events from the previous tier, progressively reducing the number of hosts. #multi-tier #progressive-reduction

Batch and Streaming ETL

ETL (Extract, Transform, Load) is the process of copying data from sources to a destination system with different data representation. #etl
Batch processing refers to periodically processing data in batches, while streaming processes data in real-time as a continuous flow. #batch-processing #streaming-processing
An ETL pipeline consists of a Directed Acyclic Graph (DAG) of tasks, where nodes represent tasks, and ancestors are dependencies. #dag #tasks #dependencies

ETL Tools

Common batch tools include Airflow and Luigi. #batch-tools
Common streaming tools include Kafka, Flink, Flume, and Scribe. #streaming-tools

Messaging Terminology

Message broker: Translates messaging protocols between sender and receiver (e.g., Kafka, RabbitMQ). #message-broker
Event streaming: Continuous flow of events processed in real-time (e.g., Kafka). #event-streaming

Push vs. Pull

Push is better for lossy applications like live audio/video streaming, where failed data delivery is not resent. #push #lossy-applications
Pull is better when the consumer has a continuously high load, making an empty queue unlikely. #pull #high-load
Pull is also better when the consumer is firewalled or the dependency has frequent changes, reducing the number of push requests. #pull #firewalled #frequent-changes
Push can be more scalable when collecting data from many sources, avoiding the complexity of maintaining multiple crawlers. #push #data-collection #crawlers

Kafka vs. RabbitMQ

Kafka is more complex but provides a superset of RabbitMQ's capabilities and can replace RabbitMQ, but not vice versa. #kafka #rabbitmq #capabilities
Kafka is designed for scalability, reliability, and availability, with more complex setup and ZooKeeper dependency. #kafka #scalability #reliability #availability #zookeeper
Kafka provides durable message storage with replication across racks and data centers. #kafka #durability #replication

Lambda Architecture

A data-processing architecture running batch and streaming pipelines in parallel. #lambda-architecture
The fast pipeline trades off consistency and accuracy for lower latency using techniques like approximation algorithms, in-memory databases, and potentially no replication for faster processing. #fast-pipeline #approximation #in-memory-databases #no-replication
The slow pipeline uses MapReduce databases like Hive and Spark with HDFS, prioritizing consistency and accuracy over low latency. #slow-pipeline #mapreduce #consistency #accuracy

References:

Airflow Vs Luigi
Priority Queues in RabbitMQ

Denormalization

#ConsistentData - No duplicate data, ensuring consistency across tables.
#FasterInsertsAndUpdates - Only one table needs to be queried for inserts and updates.
#SmallerDatabaseSize - No duplicate data, leading to smaller tables and faster read operations.

Caching Strategies

#ImprovedPerformance - Caches use memory, which is faster and more expensive than disk-based databases.
#IncreasedAvailability - If the database is unavailable, cached data can still be served, albeit limited to what is cached.
#EnhancedScalability - Caches can serve frequently requested data, reducing load on the backend and improving scalability.

Read Strategies

#ReadFromCache - The application first reads from the cache.
#CacheMiss - On a cache miss, the application reads from the database and writes the data to the cache.
#ReadHeavyLoads - Cache-aside is best for read-heavy loads.

Write Strategies

#ConsistentCache - Every write goes through the cache and then to the database, ensuring cache consistency.
#SlowerWrites - Writes are slower since they happen on both the cache and database.

Caching as a Separate Service

#StatelessServices - Services are designed to be stateless, with requests randomly assigned to hosts, reducing the likelihood of cache hits on individual hosts.
#IndependentScaling - Caching services can scale independently of the services they serve, using optimized hardware or virtual machines.

Data Caching

#PrivateInformation - Private information, such as bank account details, must never be cached.
#RevalidateChangingData - Information like flight ticket or hotel room availability can be cached but should be revalidated against the origin server.

Cache Invalidation

#MaxAge - Setting a max-age value to control cache expiration.
#Fingerprinting - Fingerprinting files with hashes or query parameters to ensure correct file versions are served.

Cache Warming

#AdditionalComplexity - Implementing cache warming adds complexity, especially for large caching services with thousands of hosts.
#AdditionalTraffic - Cache warming generates additional traffic by querying services and databases to populate the cache.
#ServiceLoad - Services and databases may not be able to handle the load from cache warming.

Cache Strategies

Cache Strategy	Description	When to Use
Cache-Aside	Check cache first, on miss read from database and write to cache	Read-heavy workloads, simple implementation
Read-Through	Cache handles reads, fetches from database on miss and caches data	Read-heavy workloads, abstracted caching logic
Write-Through	Every write goes to cache and database (consistent but slower)	Data consistency is critical, read-heavy workloads
Write-Back/Behind	Write to cache, cache periodically flushes to database (faster)	Write-heavy workloads, data consistency not critical
Write-Around	Write directly to database, cache updated on miss	Write-heavy workloads, data integrity is critical

ℹ️ Optimize your system's performance and scalability with these database and caching insights!

🚀 Designing stateful vs. stateless services:

Stateful services are complex; designs lean towards stateless with shared stateful services.

💾 Storage categories:

Understand database types: SQL, NoSQL (column-oriented, key-value), Document, Graph, File, Block, Object.

🔍 Data storage decisions:

Choose between database and other storage categories.

🔄 Replication techniques:

Single-leader, multi-leader, leaderless replication, and more for scaling databases.

🔒 Sharding for scaling:

Essential when a database surpasses single-host storage capacity.

⚙️ Database optimization:

Minimize writes, aggregate events, utilize Lambda architecture, denormalize for read latency.

🔍 Query optimization:

Cache frequent queries, employ read strategies, understand cache tradeoffs.

🔃 Cache management:

Cache-aside for read-heavy loads, consider cache-through and cache-back strategies.

💡 Cache best practices:

Use dedicated caching service, cache public not private data, employ effective cache invalidation strategies.

🔍 Cache warming:

Speed up first user access, but be aware of disadvantages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter4_Notes.md

Chapter4_Notes.md

Chapter 4 - Scaling Databases (Summary)

Storage Services #stateful-services #consistency #redundancy

Strong vs Weak Consistency #strong-consistency #weak-consistency

Types of Storage #database #sql #nosql #column-oriented #key-value #document #graph #file-storage #block-storage

Replication #replication #partitioning #sharding #fault-tolerance #scalability #latency

Replica Distribution #replica-distribution #rack-awareness #data-center-awareness

Single-Leader Replication #single-leader #read-scalability #write-bottleneck

Multi-Level Replication #multi-level-replication #consistency-delay

Multi-Leader Replication #multi-leader #write-scalability #race-conditions

Consistency Problems #consistency-problems #clock-skew #sequence-importance

Leaderless Replication #leaderless #quorum #consensus

Quorum Configurations #quorum-configurations #write-quorum #read-quorum

Key Takeaways

Scaling Storage Capacity with Sharded Databases

Aggregating Events

Batch and Streaming ETL

ETL Tools

Messaging Terminology

Push vs. Pull

Kafka vs. RabbitMQ

Lambda Architecture

Denormalization

Caching Strategies

Read Strategies

Write Strategies

Caching as a Separate Service

Data Caching

Cache Invalidation

Cache Warming

Cache Strategies

Files

Chapter4_Notes.md

Latest commit

History

Chapter4_Notes.md

File metadata and controls

Chapter 4 - Scaling Databases (Summary)

Storage Services #stateful-services #consistency #redundancy

Strong vs Weak Consistency #strong-consistency #weak-consistency

Types of Storage #database #sql #nosql #column-oriented #key-value #document #graph #file-storage #block-storage

Replication #replication #partitioning #sharding #fault-tolerance #scalability #latency

Replica Distribution #replica-distribution #rack-awareness #data-center-awareness

Single-Leader Replication #single-leader #read-scalability #write-bottleneck

Multi-Level Replication #multi-level-replication #consistency-delay

Multi-Leader Replication #multi-leader #write-scalability #race-conditions

Consistency Problems #consistency-problems #clock-skew #sequence-importance

Leaderless Replication #leaderless #quorum #consensus

Quorum Configurations #quorum-configurations #write-quorum #read-quorum

Key Takeaways

Scaling Storage Capacity with Sharded Databases

Aggregating Events

Batch and Streaming ETL

ETL Tools

Messaging Terminology

Push vs. Pull

Kafka vs. RabbitMQ

Lambda Architecture

Denormalization

Caching Strategies

Read Strategies

Write Strategies

Caching as a Separate Service

Data Caching

Cache Invalidation

Cache Warming

Cache Strategies