Skip to content

Latest commit

 

History

History
251 lines (155 loc) · 13.9 KB

Chapter4_Notes.md

File metadata and controls

251 lines (155 loc) · 13.9 KB

Chapter 4 - Scaling Databases (Summary)

Storage Services #stateful-services #consistency #redundancy

  • Stateful services like storage have mechanisms for consistency and redundancy to avoid data loss
  • They use consensus protocols like Paxos for strong consistency or eventual consistency mechanisms
  • Tradeoffs involve consistency, complexity, security, latency, performance
  • Stateless services are preferred, state is kept only in stateful services

Key Point: Understand the tradeoffs involved in choosing consistency mechanisms for stateful services.

Strong vs Weak Consistency #strong-consistency #weak-consistency

  • In strong consistency, all parallel processes see accesses in the same order (sequentially)
  • Only one consistent state is observed across processes
  • In weak/eventual consistency, different processes may see variables in different states

Key Point: Strong consistency provides a single linear view, while eventual consistency allows temporary inconsistencies.

Types of Storage #database #sql #nosql #column-oriented #key-value #document #graph #file-storage #block-storage

  • SQL: Relational, with tables, keys; must have ACID properties
  • NoSQL: Does not have all SQL properties
  • Column-oriented: Data organized into columns (e.g. Cassandra, HBase)
  • Key-value: Data as key-value pairs, keys are hashable (e.g. Memcached, Redis)
  • Document: Key-value with larger values in formats like JSON, YAML (e.g. MongoDB)
  • Graph: Designed for efficient relationship storage (e.g. Neo4j, RedisGraph)
  • File Storage: Data stored in files/directories, like key-value
  • Block Storage: Data in fixed-size blocks, used in low-level storage systems

Key Point: Understand the different types of storage and their characteristics to choose the appropriate one for your requirements.

Replication #replication #partitioning #sharding #fault-tolerance #scalability #latency

  • Replication makes copies (replicas) of data on different nodes
  • Partitioning divides data into subsets, sharding distributes partitions across nodes
  • Enables fault-tolerance, higher storage capacity, throughput, and lower latency

Replica Distribution #replica-distribution #rack-awareness #data-center-awareness

  • Typical design: one replica on same rack, one on different rack/data center
  • Sharding provides storage scaling, memory scaling, parallel processing, and data locality

Key Point: Replication and sharding are key techniques for scalability, fault-tolerance, and performance in distributed systems.

Single-Leader Replication #single-leader #read-scalability #write-bottleneck

  • All writes occur on a single leader node, replicated to followers
  • Scales reads by increasing replicas, but writes are still bottlenecked
  • Examples: MySQL, Postgres replication configurations

Multi-Level Replication #multi-level-replication #consistency-delay

  • Multiple levels of followers, like a pyramid, to scale reads further
  • Each node replicates to followers it can handle
  • Tradeoff: consistency is further delayed down the levels

Key Point: Single-leader replication is simple but has limitations in write scalability and consistency.

Multi-Leader Replication #multi-leader #write-scalability #race-conditions

  • Multiple leader nodes, writes can occur on any leader
  • Each leader replicates writes to all other nodes
  • Introduces race conditions and consistency problems

Consistency Problems #consistency-problems #clock-skew #sequence-importance

  • Sequence is important for operations like UPDATE and DELETE
  • Using timestamps doesn't work due to clock skew (imperfect clock sync)
  • Different replicas may process the same operations in different orders

Key Point: Multi-leader replication enables write scalability but requires handling race conditions and consistency issues.

Leaderless Replication #leaderless #quorum #consensus

  • All nodes are equal, reads and writes can occur on any node
  • Introduces race conditions, handled using the concept of quorum
  • Quorum is the minimum number of nodes required for consensus
  • Examples: Cassandra, Dynamo, Riak, Voldemort

Quorum Configurations #quorum-configurations #write-quorum #read-quorum

  • For n nodes, read and write quorums can be set to n/2 + 1 for consistency
  • Otherwise, only eventual consistency, and UPDATE/DELETE may be inconsistent

Key Point: Leaderless replication requires careful quorum configuration to balance consistency and performance.

Key Takeaways

  1. Consistency vs. Availability Tradeoff: Understand the tradeoffs involved in choosing consistency mechanisms (strong, eventual) and their impact on availability and performance.

  2. Replication Strategies: Grasp the different replication strategies (single-leader, multi-leader, leaderless) and their advantages, limitations, and consistency implications.

  3. Storage Types: Familiarize yourself with the different types of storage (SQL, NoSQL, column-oriented, key-value, document, graph, file, block) and their suitable use cases.

  4. Scalability Techniques: Leverage techniques like replication, partitioning, and sharding to achieve fault-tolerance, increased storage capacity, throughput, and lower latency.

  5. Distributed Systems Complexities: Be aware of complexities in distributed systems, such as race conditions, clock skew, and the need for consensus algorithms and quorum configurations.

  6. Further Reading: Refer to resources like "Designing Data-Intensive Applications" by Martin Kleppmann for more in-depth coverage of topics like consensus algorithms, failover problems, and multi-leader replication.

References:

  1. Partitioning Vs Sharding

Scaling Storage Capacity with Sharded Databases

  • As the database size grows beyond the capacity of a single host, consider using sharded storage solutions like HDFS or Cassandra, which are horizontally scalable and can support large storage capacities by adding more hosts. #sharded-storage #horizontal-scaling #large-clusters

Aggregating Events

  • Database writes are expensive to scale, so aim to reduce write rates through sampling and aggregation techniques. #reduce-writes #sampling #aggregation
  • Aggregating events combines multiple events into a single event, resulting in fewer database writes. #event-aggregation #reduced-writes
  • In multi-tier aggregation, each layer aggregates events from the previous tier, progressively reducing the number of hosts. #multi-tier #progressive-reduction

Batch and Streaming ETL

  • ETL (Extract, Transform, Load) is the process of copying data from sources to a destination system with different data representation. #etl
  • Batch processing refers to periodically processing data in batches, while streaming processes data in real-time as a continuous flow. #batch-processing #streaming-processing
  • An ETL pipeline consists of a Directed Acyclic Graph (DAG) of tasks, where nodes represent tasks, and ancestors are dependencies. #dag #tasks #dependencies

ETL Tools

  • Common batch tools include Airflow and Luigi. #batch-tools
  • Common streaming tools include Kafka, Flink, Flume, and Scribe. #streaming-tools

Messaging Terminology

  • Message broker: Translates messaging protocols between sender and receiver (e.g., Kafka, RabbitMQ). #message-broker
  • Event streaming: Continuous flow of events processed in real-time (e.g., Kafka). #event-streaming

Push vs. Pull

  • Push is better for lossy applications like live audio/video streaming, where failed data delivery is not resent. #push #lossy-applications
  • Pull is better when the consumer has a continuously high load, making an empty queue unlikely. #pull #high-load
  • Pull is also better when the consumer is firewalled or the dependency has frequent changes, reducing the number of push requests. #pull #firewalled #frequent-changes
  • Push can be more scalable when collecting data from many sources, avoiding the complexity of maintaining multiple crawlers. #push #data-collection #crawlers

Kafka vs. RabbitMQ

  • Kafka is more complex but provides a superset of RabbitMQ's capabilities and can replace RabbitMQ, but not vice versa. #kafka #rabbitmq #capabilities
  • Kafka is designed for scalability, reliability, and availability, with more complex setup and ZooKeeper dependency. #kafka #scalability #reliability #availability #zookeeper
  • Kafka provides durable message storage with replication across racks and data centers. #kafka #durability #replication

Lambda Architecture

  • A data-processing architecture running batch and streaming pipelines in parallel. #lambda-architecture
  • The fast pipeline trades off consistency and accuracy for lower latency using techniques like approximation algorithms, in-memory databases, and potentially no replication for faster processing. #fast-pipeline #approximation #in-memory-databases #no-replication
  • The slow pipeline uses MapReduce databases like Hive and Spark with HDFS, prioritizing consistency and accuracy over low latency. #slow-pipeline #mapreduce #consistency #accuracy

References:

  1. Airflow Vs Luigi

  2. Priority Queues in RabbitMQ

Denormalization

  • #ConsistentData - No duplicate data, ensuring consistency across tables.
  • #FasterInsertsAndUpdates - Only one table needs to be queried for inserts and updates.
  • #SmallerDatabaseSize - No duplicate data, leading to smaller tables and faster read operations.

Caching Strategies

  • #ImprovedPerformance - Caches use memory, which is faster and more expensive than disk-based databases.
  • #IncreasedAvailability - If the database is unavailable, cached data can still be served, albeit limited to what is cached.
  • #EnhancedScalability - Caches can serve frequently requested data, reducing load on the backend and improving scalability.

Read Strategies

  • #ReadFromCache - The application first reads from the cache.
  • #CacheMiss - On a cache miss, the application reads from the database and writes the data to the cache.
  • #ReadHeavyLoads - Cache-aside is best for read-heavy loads.

Write Strategies

  • #ConsistentCache - Every write goes through the cache and then to the database, ensuring cache consistency.
  • #SlowerWrites - Writes are slower since they happen on both the cache and database.

Caching as a Separate Service

  • #StatelessServices - Services are designed to be stateless, with requests randomly assigned to hosts, reducing the likelihood of cache hits on individual hosts.
  • #IndependentScaling - Caching services can scale independently of the services they serve, using optimized hardware or virtual machines.

Data Caching

  • #PrivateInformation - Private information, such as bank account details, must never be cached.
  • #RevalidateChangingData - Information like flight ticket or hotel room availability can be cached but should be revalidated against the origin server.

Cache Invalidation

  • #MaxAge - Setting a max-age value to control cache expiration.
  • #Fingerprinting - Fingerprinting files with hashes or query parameters to ensure correct file versions are served.

Cache Warming

  • #AdditionalComplexity - Implementing cache warming adds complexity, especially for large caching services with thousands of hosts.
  • #AdditionalTraffic - Cache warming generates additional traffic by querying services and databases to populate the cache.
  • #ServiceLoad - Services and databases may not be able to handle the load from cache warming.

Cache Strategies

Cache Strategy Description When to Use
Cache-Aside Check cache first, on miss read from database and write to cache Read-heavy workloads, simple implementation
Read-Through Cache handles reads, fetches from database on miss and caches data Read-heavy workloads, abstracted caching logic
Write-Through Every write goes to cache and database (consistent but slower) Data consistency is critical, read-heavy workloads
Write-Back/Behind Write to cache, cache periodically flushes to database (faster) Write-heavy workloads, data consistency not critical
Write-Around Write directly to database, cache updated on miss Write-heavy workloads, data integrity is critical

ℹ️ Optimize your system's performance and scalability with these database and caching insights!

🚀 Designing stateful vs. stateless services:

  • Stateful services are complex; designs lean towards stateless with shared stateful services.

💾 Storage categories:

  • Understand database types: SQL, NoSQL (column-oriented, key-value), Document, Graph, File, Block, Object.

🔍 Data storage decisions:

  • Choose between database and other storage categories.

🔄 Replication techniques:

  • Single-leader, multi-leader, leaderless replication, and more for scaling databases.

🔒 Sharding for scaling:

  • Essential when a database surpasses single-host storage capacity.

⚙️ Database optimization:

  • Minimize writes, aggregate events, utilize Lambda architecture, denormalize for read latency.

🔍 Query optimization:

  • Cache frequent queries, employ read strategies, understand cache tradeoffs.

🔃 Cache management:

  • Cache-aside for read-heavy loads, consider cache-through and cache-back strategies.

💡 Cache best practices:

  • Use dedicated caching service, cache public not private data, employ effective cache invalidation strategies.

🔍 Cache warming:

  • Speed up first user access, but be aware of disadvantages.