- Stateful services like storage have mechanisms for consistency and redundancy to avoid data loss
- They use consensus protocols like Paxos for strong consistency or eventual consistency mechanisms
- Tradeoffs involve consistency, complexity, security, latency, performance
- Stateless services are preferred, state is kept only in stateful services
Key Point: Understand the tradeoffs involved in choosing consistency mechanisms for stateful services.
- In strong consistency, all parallel processes see accesses in the same order (sequentially)
- Only one consistent state is observed across processes
- In weak/eventual consistency, different processes may see variables in different states
Key Point: Strong consistency provides a single linear view, while eventual consistency allows temporary inconsistencies.
Types of Storage #database #sql #nosql #column-oriented #key-value #document #graph #file-storage #block-storage
- SQL: Relational, with tables, keys; must have ACID properties
- NoSQL: Does not have all SQL properties
- Column-oriented: Data organized into columns (e.g. Cassandra, HBase)
- Key-value: Data as key-value pairs, keys are hashable (e.g. Memcached, Redis)
- Document: Key-value with larger values in formats like JSON, YAML (e.g. MongoDB)
- Graph: Designed for efficient relationship storage (e.g. Neo4j, RedisGraph)
- File Storage: Data stored in files/directories, like key-value
- Block Storage: Data in fixed-size blocks, used in low-level storage systems
Key Point: Understand the different types of storage and their characteristics to choose the appropriate one for your requirements.
- Replication makes copies (replicas) of data on different nodes
- Partitioning divides data into subsets, sharding distributes partitions across nodes
- Enables fault-tolerance, higher storage capacity, throughput, and lower latency
- Typical design: one replica on same rack, one on different rack/data center
- Sharding provides storage scaling, memory scaling, parallel processing, and data locality
Key Point: Replication and sharding are key techniques for scalability, fault-tolerance, and performance in distributed systems.
- All writes occur on a single leader node, replicated to followers
- Scales reads by increasing replicas, but writes are still bottlenecked
- Examples: MySQL, Postgres replication configurations
- Multiple levels of followers, like a pyramid, to scale reads further
- Each node replicates to followers it can handle
- Tradeoff: consistency is further delayed down the levels
Key Point: Single-leader replication is simple but has limitations in write scalability and consistency.
- Multiple leader nodes, writes can occur on any leader
- Each leader replicates writes to all other nodes
- Introduces race conditions and consistency problems
- Sequence is important for operations like UPDATE and DELETE
- Using timestamps doesn't work due to clock skew (imperfect clock sync)
- Different replicas may process the same operations in different orders
Key Point: Multi-leader replication enables write scalability but requires handling race conditions and consistency issues.
- All nodes are equal, reads and writes can occur on any node
- Introduces race conditions, handled using the concept of quorum
- Quorum is the minimum number of nodes required for consensus
- Examples: Cassandra, Dynamo, Riak, Voldemort
- For n nodes, read and write quorums can be set to n/2 + 1 for consistency
- Otherwise, only eventual consistency, and UPDATE/DELETE may be inconsistent
Key Point: Leaderless replication requires careful quorum configuration to balance consistency and performance.
-
Consistency vs. Availability Tradeoff: Understand the tradeoffs involved in choosing consistency mechanisms (strong, eventual) and their impact on availability and performance.
-
Replication Strategies: Grasp the different replication strategies (single-leader, multi-leader, leaderless) and their advantages, limitations, and consistency implications.
-
Storage Types: Familiarize yourself with the different types of storage (SQL, NoSQL, column-oriented, key-value, document, graph, file, block) and their suitable use cases.
-
Scalability Techniques: Leverage techniques like replication, partitioning, and sharding to achieve fault-tolerance, increased storage capacity, throughput, and lower latency.
-
Distributed Systems Complexities: Be aware of complexities in distributed systems, such as race conditions, clock skew, and the need for consensus algorithms and quorum configurations.
-
Further Reading: Refer to resources like "Designing Data-Intensive Applications" by Martin Kleppmann for more in-depth coverage of topics like consensus algorithms, failover problems, and multi-leader replication.
References:
- As the database size grows beyond the capacity of a single host, consider using sharded storage solutions like HDFS or Cassandra, which are horizontally scalable and can support large storage capacities by adding more hosts. #sharded-storage #horizontal-scaling #large-clusters
- Database writes are expensive to scale, so aim to reduce write rates through sampling and aggregation techniques. #reduce-writes #sampling #aggregation
- Aggregating events combines multiple events into a single event, resulting in fewer database writes. #event-aggregation #reduced-writes
- In multi-tier aggregation, each layer aggregates events from the previous tier, progressively reducing the number of hosts. #multi-tier #progressive-reduction
- ETL (Extract, Transform, Load) is the process of copying data from sources to a destination system with different data representation. #etl
- Batch processing refers to periodically processing data in batches, while streaming processes data in real-time as a continuous flow. #batch-processing #streaming-processing
- An ETL pipeline consists of a Directed Acyclic Graph (DAG) of tasks, where nodes represent tasks, and ancestors are dependencies. #dag #tasks #dependencies
- Common batch tools include Airflow and Luigi. #batch-tools
- Common streaming tools include Kafka, Flink, Flume, and Scribe. #streaming-tools
- Message broker: Translates messaging protocols between sender and receiver (e.g., Kafka, RabbitMQ). #message-broker
- Event streaming: Continuous flow of events processed in real-time (e.g., Kafka). #event-streaming
- Push is better for lossy applications like live audio/video streaming, where failed data delivery is not resent. #push #lossy-applications
- Pull is better when the consumer has a continuously high load, making an empty queue unlikely. #pull #high-load
- Pull is also better when the consumer is firewalled or the dependency has frequent changes, reducing the number of push requests. #pull #firewalled #frequent-changes
- Push can be more scalable when collecting data from many sources, avoiding the complexity of maintaining multiple crawlers. #push #data-collection #crawlers
- Kafka is more complex but provides a superset of RabbitMQ's capabilities and can replace RabbitMQ, but not vice versa. #kafka #rabbitmq #capabilities
- Kafka is designed for scalability, reliability, and availability, with more complex setup and ZooKeeper dependency. #kafka #scalability #reliability #availability #zookeeper
- Kafka provides durable message storage with replication across racks and data centers. #kafka #durability #replication
- A data-processing architecture running batch and streaming pipelines in parallel. #lambda-architecture
- The fast pipeline trades off consistency and accuracy for lower latency using techniques like approximation algorithms, in-memory databases, and potentially no replication for faster processing. #fast-pipeline #approximation #in-memory-databases #no-replication
- The slow pipeline uses MapReduce databases like Hive and Spark with HDFS, prioritizing consistency and accuracy over low latency. #slow-pipeline #mapreduce #consistency #accuracy
References:
- #ConsistentData - No duplicate data, ensuring consistency across tables.
- #FasterInsertsAndUpdates - Only one table needs to be queried for inserts and updates.
- #SmallerDatabaseSize - No duplicate data, leading to smaller tables and faster read operations.
- #ImprovedPerformance - Caches use memory, which is faster and more expensive than disk-based databases.
- #IncreasedAvailability - If the database is unavailable, cached data can still be served, albeit limited to what is cached.
- #EnhancedScalability - Caches can serve frequently requested data, reducing load on the backend and improving scalability.
- #ReadFromCache - The application first reads from the cache.
- #CacheMiss - On a cache miss, the application reads from the database and writes the data to the cache.
- #ReadHeavyLoads - Cache-aside is best for read-heavy loads.
- #ConsistentCache - Every write goes through the cache and then to the database, ensuring cache consistency.
- #SlowerWrites - Writes are slower since they happen on both the cache and database.
- #StatelessServices - Services are designed to be stateless, with requests randomly assigned to hosts, reducing the likelihood of cache hits on individual hosts.
- #IndependentScaling - Caching services can scale independently of the services they serve, using optimized hardware or virtual machines.
- #PrivateInformation - Private information, such as bank account details, must never be cached.
- #RevalidateChangingData - Information like flight ticket or hotel room availability can be cached but should be revalidated against the origin server.
- #MaxAge - Setting a
max-age
value to control cache expiration. - #Fingerprinting - Fingerprinting files with hashes or query parameters to ensure correct file versions are served.
- #AdditionalComplexity - Implementing cache warming adds complexity, especially for large caching services with thousands of hosts.
- #AdditionalTraffic - Cache warming generates additional traffic by querying services and databases to populate the cache.
- #ServiceLoad - Services and databases may not be able to handle the load from cache warming.
Cache Strategy | Description | When to Use |
---|---|---|
Cache-Aside | Check cache first, on miss read from database and write to cache | Read-heavy workloads, simple implementation |
Read-Through | Cache handles reads, fetches from database on miss and caches data | Read-heavy workloads, abstracted caching logic |
Write-Through | Every write goes to cache and database (consistent but slower) | Data consistency is critical, read-heavy workloads |
Write-Back/Behind | Write to cache, cache periodically flushes to database (faster) | Write-heavy workloads, data consistency not critical |
Write-Around | Write directly to database, cache updated on miss | Write-heavy workloads, data integrity is critical |
ℹ️ Optimize your system's performance and scalability with these database and caching insights!
🚀 Designing stateful vs. stateless services:
- Stateful services are complex; designs lean towards stateless with shared stateful services.
💾 Storage categories:
- Understand database types: SQL, NoSQL (column-oriented, key-value), Document, Graph, File, Block, Object.
🔍 Data storage decisions:
- Choose between database and other storage categories.
🔄 Replication techniques:
- Single-leader, multi-leader, leaderless replication, and more for scaling databases.
🔒 Sharding for scaling:
- Essential when a database surpasses single-host storage capacity.
⚙️ Database optimization:
- Minimize writes, aggregate events, utilize Lambda architecture, denormalize for read latency.
🔍 Query optimization:
- Cache frequent queries, employ read strategies, understand cache tradeoffs.
🔃 Cache management:
- Cache-aside for read-heavy loads, consider cache-through and cache-back strategies.
💡 Cache best practices:
- Use dedicated caching service, cache public not private data, employ effective cache invalidation strategies.
🔍 Cache warming:
- Speed up first user access, but be aware of disadvantages.