Note
These are personal study notes and summaries based on Designing Data-Intensive Applications by Martin Kleppmann, structured for my own learning.
Part 1: Foundations of Data Systems
Chapter 1 Summary: Reliable, Scalable, and Maintainable Applications
The first chapter lays the foundation for understanding data systems. In particular, it discusses three main concerns: reliability, scalability, and maintainability.
A reliable system is one that performs the function the user expects, tolerates faults, maintains good performance, and is secure. Failures may arise from hardware breakdowns, software bugs, or operator mistakes. To mitigate these issues, systems can be designed to be fault-tolerant, which helps mask some failures from the user.
A scalable system can handle an increasing load. Load parameters describe a systemâs performance quantitatively and include metrics such as latency, throughput, and response time. These metrics can be analyzed using averages, percentiles, and tail latencies. Two common approaches to scaling are horizontal scaling (adding more machines) and vertical scaling (upgrading existing machines). Systems can also be designed to be elastic, adapting dynamically to changes in demand. Stateless systems are generally easier to distribute, while distributing stateful systems is more complex.
A maintainable system is easy for engineers and operations teams to work with. To ensure systems remain maintainable over time, designers focus on how operable, simple, and adaptable they are. Operability can be improved by providing visibility into system behavior, automating tasks, minimizing dependence on individual machines, maintaining good documentation, implementing self-healing mechanisms, and ensuring predictable behavior. Complexity can be reduced through abstraction, making it easier to evolve systems over time.
Key Terms
- Faults: Things that go wrong in a system.
- Fault-tolerant/resilient system: A system that can anticipate faults and deal with them.
- Hardware faults: A fault due to hardware on machines, e.g. hard disk crash, faulty RAM.
- Software errors: A fault due to software, e.g. bug, fault process, service slowdown.
- Human errors: A fault due to humans being unreliable, e.g. configuration error, accidental deletion.
- Telemetry: Used to gather data on the use and performance of applications and their components, e.g. performance metrics, error rates.
- Load parameters: Metrics used to describe the load on a system.
- Throughput: The amount of work or requests a system can handle within a given timeframe or a job on a dataset of a certain size.
- Response time: The time between a client sending a request and receiving a response.
- Latency: The duration that a request is waiting to be handled.
- Outliers: An abnormal value, one that is significantly different from other values.
- Average response time: In practice, this is the arithmetic mean of response times. Not a good metric for understanding typical response time.
- Percentile: A number that describes the value that a given percent of values are lower than. e.g. p50 (50th percentile, aka median), p95 and p99.
- Tail latencies: A small number of values that take longer to process.
- SLAs: Service Level Agreement, contract that defines expected performance and availability of service.
- SLOs: Service Level Objectives, measurable targets used as part of SLAs.
- Head of line blocking: When a small number of requests hold up the system.
- Tail-latency amplification: The chance of getting a slow response time increases if the same user performs multiple calls to the service.
- Vertical scaling: Scaling up, that is, moving to more performant hardware.
- Horizontal scaling: Scaling out, that is, distributing load across multiple machines. aka shared-nothing architecture.
- Elastic system: A system that can automatically scale when an increased load is detected.
Quotes nâ Tips
- âWe need to think of response time not as a single number, but as a distribution of values that you can measure.â p. 20
- âLatency and response time are often used synonymously, but they are not the same. The response time is what the client sees: besides the actual time to process the request it includes network delays and queueing delays.â p. 21
- âAccidental complexity arises if it is not inherent in the problem that the software solves but arises only from implementation⊠The best tool we have for removing accidental complexity is abstraction.â p. 25
- âEase with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and abstractions.â p. 26
Chapter 2 Summary: Data Models and Query Languages
The second chapter delves into data models and query languages. It compares relational models to document models, introduces query languages for data, and explores graph-data models.
Relational Model Versus Document Model
In a relational model data is organized into relations and each relation is an unordered collection of tuples. In SQL this is equivalent to tables and rows.
In a document model data is stored in the form of documents. The documents themselves are flexible and semi-structured e.g. JSON, XML
Relational models power much of the web today, the model has dominated for 25-30 years. Early 2010âs marked the birth of NoSQL driven by a need for greater scalability, large datasets, and need for more throughput. There was also an object-relational mismatch, an awkward translation between the way objects are defined in code and in a relational database. This mismatch often forced developers to bridge the gap manually, or rely on ORM frameworks.
Document databases often offer performance benefits by storing related data together, making access more efficient in some scenarios. Relational databases on the other hand have support for joins including many-to-many and many-to-one relationships.
Determining the need for data relationship and schema flexibility can help point us to the right data model for the specific use case. Looking at storage locality can be another indicator, for example, there is a performance advantage for accessing the entire document with its related data, but there is a cost as well, since an update requires the full document.
Query Languages for Data
Imperative code tells the computer to perform a certain operation in a certain order. Declarative code specifies the data you want, such as SQL. The latter is more concise and easier to use, optimize and parallelize.
MapReduce is a type of programming model for large amounts of data in bulk across many machines, it is in-between declarative and imperative. The logic of the query is expressed with snippets of code which are called repeatedly by the framework. It is composed of map (aka collect), and reduce (aka fold or inject), which are functions that exist in functional programming.
Graph-like Data Models
A graph consists of two main parts, vertices and edges. Vertices represent nodes or entities, and edges represent relationships.
You can store graph data in various ways. One way is to use a property-graph. A vertex represented by: unique identifier, outgoing edges, incoming edges, and key-value properties. An edge is represented by a unique identifier, vertex start (tail vertex), vertex end (head vertex), label, and key-value properties. Any vertex can connect to any other vertex, unlike a network model. You can efficiently traverse the graph, and store varied info through the use of labels.
You can query graphs in SQL in some databases through a recursive common table expression. Some databases let you query data via a Cypher query language, which is a declarative query language created for Neo4j.
The triple-store data model is the same as the property graph conceptually but uses different words. Data is stored in three part statements: (subject, predicate, object)
, Some older data models such as Datalog, formed the foundation for other query languages and included three part statements: predicate(subject, object)
. SPARQL is another query langauge that was used for RDF (Resource Description Network) as part of semantic web, and predates the use of Cypher.
Key Terms
- Polyglot persistence: The idea that an application can benefit from using more than one datastore, rather than relying on a single database.
- Impedance mismatch: A disconnect between models.
- Locality: The predictable behaviour of how programs access data in a localized manner either through accessing nearby data or by reusing the same data.
- Normalization: A process of reducing data redundancy and improving data integrity, with removing duplication being one of the key ideas.
- Hierarchical model: A tree-like structure of records with parent-child relationships.
- Network model: A generalization of a hierarchical model, which allows for a record to have multiple parents.
- Access path: A way of accessing a record by following a path from the root record along a chain of links.
- Foreign-Key/document reference: A relationship referenced by a key identifier, foreign key in the relational model and document reference in the latter. The identifier is resolved at read time by using a join or follow-up queries.
- Schema-on-read: The structure of the data is implicit and only interpreted when the data is read.
- Schema-on-write: The schema is explicit and database ensures all the written data conforms.
- Recursive common table expression: A way to query graph data in SQL via
WITH RECURSIVE
syntax.
Quotes nâ Tips
- âItâs not possible to say in general which data model leads to simpler app code, it depends on relationships between data items.â p. 43
Chapter 3 Summary: Storage and Retrieval
The third chapter explores how we store data and how we find it again. In particular it explores what happens under the hood during storage and retrieval so that the operations can be optimized.
Data Structures that power your database
Many databases use append-only log files to store unordered key-value pairs efficiently. These work well with hash indexes, which use an in-memory map to link keys to byte offsets in the file. While this speeds up reads, it slows down writes since both the data and index must be updated. To manage disk space, files are split into segments, and periodic merging and compaction remove duplicates. Benefits include fast random writes, simple crash recovery, and reduced fragmentationâbut range queries are inefficient, and all keys must fit in memory.
SSTables and LSM-Trees offer another approach. SSTables store sorted key-value pairs, one per segment file, and use merge-sort for compaction, sparse indexes for lookups, and compressed blocks between index entries. Writes go to a memory-resident memtable and are flushed to disk as SSTables when full. Reads check the memtable first, then recent segments. A write-ahead log ensures durability in case of crashes.
B-Trees, the most common structure in relational databases, also store keys in sorted order but organize them into pages. These pages are organized hierarchically in a tree-like structure, allowing quick lookups through branching paths. This structure allows quick lookups, but updates may trigger page splits and multiple writes. Like LSM-Trees, B-Trees use a write-ahead log and rely on latches (lightweight locks) for concurrency control.
Advantages of B-Trees:
- Faster reads
- Mature and widely adopted
- Predictable, low-impact compaction
- Strong transactional support
Advantages of LSM-Trees:
- Faster writes
- High write throughput
- Lower write amplification
- Better compression
Some other indexing structures are secondary, clustered, covering, multi-column, and text/fuzzy search indexes.
There are also a few databases that focus on in-memory storage such as Memcache, slightly more durable ones like Redis, Couchbase, and more durable ones like VoltDB and MemSQL. One advantage of in-memory databases is that some models are easier to implement, such as Redis Priority Queues.
Transaction Processing or Analytics
OLTP: Online Transaction Processing is a system optimized for handling a large amount of small transactions including looking up records by key, using an index, and inserting/updating records. OLAP: Online Analytic Processing is a system optimized for access patterns that scan huge numbers of records, reading a few columns, and calculate aggregates. It is typically used for analysis of historical data.
Property | OLTP | OLAP |
---|---|---|
Main read pattern | Small number of records per query, fetched by key | Aggregate over large number of records |
Main write pattern | Random-access, low latency writes from user input | Bulk import or ETL event stream |
Primarily used by | End user/customer, web-app | Internal analyst |
What data represents | Latest state of data | History of events over time |
Dataset size | Gigabytes or Terabytes | Terabytes to Petabytes |
A Data Warehouse is a separate database that analysts can query without affecting OLTP, it contains read-only copy of data from all OLTP systems. Many data warehouse use a star-schema.
Column-Oriented Storage
Most OLTP database storages are laid out in a row-oriented way. In a document database an entire document is stored in a sequence of bytes. That isnât efficient for data analysis so storages are laid out in a document-oriented way (e.g. Parquet). Compression is easier in document-oriented storage, for example using bitmap encoding. It is also good for making efficient use of CPU cycles. Sort order doesnât matter but order can be imposed, however the data needs to be sorted on entire row at a time. Benefit of sorting is that data can be compressed much more efficiently, mainly on the first key, and a small effect on second and third. Data warehouses are typically optimized for large query reads so writes are more difficult, though LSM Trees can make this easier. Some databases utilize materialized aggregates, materialized views, and data cubes to make certain queries faster, at the cost of flexibility.
Key Terms
- Log: Append-only data file with a sequence of records
- Index: An additional structure that is derived from primary data
- Compaction: The process of throwing away duplicate keys in a log, keeping the most recent update for each key.
- Merging: Combining multiple segments.
- LSM-Tree: Log Structured Merge-Tree is a data structure that uses SSTables, memtables, and compaction.
- SSTables: Sorted String Tables, where the sequence of key-value pairs are sorted by keys and appear only once in a segment file.
- memtable: An in-memory tree.
- Bloom filter: An in-memory efficient data structure for approximating the contents of a set, good for discovering if a key exists or not.
- Write amplification: One write to a database causes multiple writes to the disk over the lifetime of the DB.
- Transaction: A group of reads and writes that form a logical unit.
- Transaction processing: To make low latency reads and writes.
- Batch processing: To run periodically.
- ETL: Extract-Transform-Load, a process of getting data into a data warehouse. Data is extracted, cleaned up, transformed and loaded.
- Star schema: aka dimensional modeling, composed of fact tables and dimension tables.
- Snowflake schema: Like star schemas but broken down into further dimensions, more normalized but harder to work with.
- Row-oriented: Where all the values from one row are stored next to each other.
- Column-oriented: Where all the values from each column are stored together.
- Bitmap encoding: Taking a column with n distinct values and turning it into n separate bitmaps. It uses one bitmap for each value and one bit for each row.
- Vectorized processing: Operating on chunks of compressed column data directly.
- Materialized view: A precomputed query result.
- Materialized aggregates: A specific kind of materialized view, focused only on precomputed aggregate functions.
- Data cube: aka OLAP cube, a grid of aggregates grouped by different dimensions.
Quotes nâ Tips
- âA balanced B-Tree with n keys always has a depth of
O(log n)
.â p. 72 - âWell-chosen indexes speed up read queries, but every index slows down writes.â p. 78
- âCassandra and HBase use Bigtable model, so are row-oriented, but they do use column families.â p. 86
Chapter 4 Summary: Encoding and Evolution
Formats for Encoding Data
Programs work with (at least) two different representations:
- In memory - objects, structs, lists, hast tables etc. These are optimized for efficient access and manipulation by the CPU.
- Over network - Self-contained sequence of bytes. These are encoded.
- Human-Readable: JSON, XML, CSV
- Compact/Binary Variants:
- JSON Variants: BSON, MessagePack, BJSON, UBJSON
- XML Variants: WBXML, Fast Infoset
- Schema-Based: Apache Thrift (Facebook), Protocol Buffers (Google), Avro
The benefits of using a schema include compact data representation, serving as documentation, enabling type-checked code, and providing backward compatibility.
Modes of Dataflow
The most common ways data flows from one process to another are via databases, service calls, and asynchronous message passing.
When using a database, one process writes encoded data and another reads it at a later time (or multiple processes do). Different versioned tools may access the data, so backward and forward compatibility are important. Because data often lives longer than the code that uses it, itâs important to plan for how formats will evolve over time.
The most common way to communicate over a network is through a client-server model. A server exposes data via an API over the network, and a client connects to the server to make requests to the API. When HTTP is used as the underlying protocol for communicating with a service, it is called a web service. REST is the most common approach.
In asynchronous message-passing systems, messages are delivered to another system through a message broker. This broker can act as a buffer, redeliver messages, send data to multiple recipients, and remain decoupled from the recipient. Data flows one way: the sender does not wait for the message to be delivered. One process sends a message to a queue or topic, and the broker delivers it to consumers or subscribers. These systems typically do not enforce a specific data model, but backward and forward compatibility can make it easier for publishers and consumers to evolve independently.
Key Terms
- Rolling upgrade: aka staged rollout, deploying a new version to a few nodes at a time
- Backward compatibility: Newer code can read data that was written by old code.
- Forward compatibility: Older code can read data that was written by newer code.
- Encoding: aka serialization, aka marshalling â translation from the in-memory representation to a byte sequence.
- Decoding: aka parsing, aka deserialization, or marshalling â reverse of encoding.
- Schema evolution: Change of schema over time.
- Writerâs schema: Encode data using whatever version of the schema it knows about.
- Readerâs schema: Decode data in some schema.
- Service-Oriented Architecture: Microservices architecture â decomposing a large application into smaller services based on areas of functionality.
- RPC: Remote Procedure Call, a model that makes a request to a remote network service look the same as calling a function.
- Idempotence: A property ensuring that an operation can be performed multiple times without changing the result.
- Asynchronous: The sender doesnât wait for the message to be delivered.
Quotes nâ Tips
- âItâs generally a bad idea to use your languageâs built-in encoding for anything other than very transient purposes.â p. 114
- âAs long as people agree on what the format is, it often doesnât matter how pretty or efficient the format is. The difficulty of getting different organizations to agree on anything outweighs most other concerns.â p. 115
Part 2: Distributed Data
Chapter 5 Summary: Replication
We replicate data so that a system can continue to work even if some part of it fails, to scale out machines, and to get data closer to your users location. Data can be difficult to replicate because weâre changing data as weâre replicating. There are three algorithms for replication between nodes:
- Single-Leader
- Multi-Leader
- Leaderless
Leaders and Followers
In leader-based replication, each node maintains a copy of the data. All writes are directed to the leader, and then the changes are propagated to the replicas. This model is also referred to as active/passive replication. While writes always go to the leader, reads can be served by any replica.
Replication strategies are usually either synchronous â waiting for replicas to confirm â or asynchronous, which prioritizes speed over safety.
-
Synchronous replication ensures that a follower has an up-to-date copy of the data before a write is considered complete. The downside is that the leader must wait for acknowledgments from followers, which may delay availability.
-
Asynchronous replication allows the leader to continue accepting writes without waiting for followers. However, this means that if the leader fails before its changes are fully replicated, data loss is possible. In practice, systems often use a hybrid of both approaches to balance safety and performance.
High availability is essential because node failures are inevitable. If a follower crashes, it can recover by replaying the replication log and requesting any missed updates from the leader. However, if the leader fails, the recovery process becomes more complex and involve electing a new leader.
Four ways to implement replication logs:
-
Statement-based replication: Leader logs every write request (statement) that it executes and sends it to followers. This is generally not used anymore due to problems handling nondeterministic functions (
now()
), autoincrementing columns, and statements that have side effects. -
Write-ahead log (WAL) shipping: The system replicates the leaderâs low-level storage log, which is an append-only sequence of bytes. This method is tightly coupled to the storage engine.
-
Logical (row-based) replication: Replication is decoupled from storage log, we use different log formats e.g.
insert
contains new value,delete
has a unique identifier,update
has identifier and new data. This is easier for apps and provides backward compatibility for leaders/followers. -
Trigger-based replication: This is replication handled at the application layer. Itâs used for replicating a subset of data, one DB to another, or conflict resolution logic. Prone to bugs.
Problems with Replication Lag
When using replication, especially asynchronous replication, lag can introduce consistency issues. There are three important guarantees we often want to maintain for a good user experience:
-
Read-after-write consistency (also called read-your-writes): After a user submits an update (like editing a post), they should see their change immediately upon refreshing the page. This can be handled by reading from the leader or using metadata (like timestamps) to choose the right replica.
-
Monotonic reads: If a user reads data multiple times in a row, it should never appear to go âback in time.â In other words, once a user has seen updated information, they shouldnât later see an older version. This can be managed by sticking to the same replica for all reads in a session.
-
Consistent prefix reads: When multiple related changes happen in a particular order (like placing items in a cart before checking out), any reader should see those updates in the correct sequence. If the writes are spread across different shards or partitions, we need to ensure that readers see them in orderâpossibly by directing related writes to the same partition.
Transactions exist as a way for the database to provide consistency guarantees but they only exist in single-node storage, distributed databases donât use them.
Multi-Leader Replication
There are a few scenarios where multi-leader replication can be of benefit. In a multi-datacenter we can have a leader in each datacenter, each of which replicates to its own followers. Every write will be processed asynchronously so performance will be better, each datacenter can operate independently, and we get tolerance to networking issues. Clients with offline operations and collaborative-editing are two other use cases.
The biggest problem we have with this style of replication is that write conflicts can occur and this requires conflict resolution. This can be handled by:
- Using synchronous conflict detection, at the cost of losing advantages of multi-leader replication.
- Conflict avoidance, making sure writes go to the same leader, though changing leaders becomes complex.
- Coverging towards consistent state, e.g. last write wins, highest ID wins, merge values, record and resolve later.
- Custom conflict resolution, logic executed on-write and/or on-read. This applies at the level of individual row or document.
There are three common topologies:
- Circular topology: Each node replicates to an adjacent node in a loop, reducing load on any single node but increasing replication lag and risk of conflicts.
- Star topology: One central leader node connects to all others and forwards writes to them, simplifying synchronization but creating a single point of bottleneck or failure.
- All-to-all topology: Every node replicates directly with every other, offering the fastest convergence but becoming complex and unscalable as the number of nodes grows.
Leaderless Replication
Leaderless replication occurs when replicas can accept writes directly. It is used by databases such as Dynamo or Cassandra. Some implementations use a coordinator node, but itâs important to note that this is not the same as a leader and does not enforce write ordering.
In this type of system failover is not available. Read requests are sent to several nodes in parallel and versioning is used to determine which data is stale and which data is new. We can catch up on writes in two ways:
- Read Repair: Data is read from multiple replicas, stale responses are detected, newer value is written. The less frequent the reads, the more stale the data.
- Anti-entropy process: Data is kept up to date using a background process, writes arenât copied in any order and data may be delayed.
âIf there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes for each read. As long as w + r > n we expect to get up-to-date value when reading, because at least one of the r nodes weâre reading from must be up to date. Reads and writes that obey these r and w values are called quorum reads and writes. You can think of r and w as the minimum number of votes required for the read or write to be valid.â â Martin Kleppmann, Designing Data-Intensive Applications, p. 179
Choosing values below the quorum threshold may result in stale reads, but offers lower latency and better availability. Even with a quorum there are no guarantees and you may run into problems such as: writes may end up on different nodes, two writes can occur concurrently, reads and writes are concurrent and may end up on different replicas, and more edge cases. These type of databases are optimized for eventual consistency.
Because monitoring replication lag is difficult, the term âeventualâ consistency may need to be qualified or quantified in real-world systems. When a network interruption happens databases can use a sloppy quorum (store them temporarily elsewhere) or hinted handoff (sent back to appropriate nodes) to increase durability.
A few techniques for conflict resolution of writes:
- Last write wins: aka LWW, Discards conflicting concurrent writes in favor of the latest timestamp. It ensures convergence but may lose data, which is acceptable in cases like caching or when concurrent updates must be avoided.
- The âhappens-beforeâ relationship and concurrency: The âhappens-beforeâ relationship helps determine whether two operations are concurrent by analyzing causality (e.g., using version vectors or vector clocks).
Key Terms
- Replica: Each node that stores a copy of the data.
- Leader: A primary node that receives new data.
- Follower: aka Read replica, secondary, hot standby, a node that can receive a copy of the data and read requests.
- Replication Log: aka change stream, a sequential history of data modifications
- Synchronous: When the leader only reports success when the follower has confirmed it received the write.
- Asynchronous: When the leader doesnât wait for confirmation that the write was received.
- Semi-synchronous: When both synchronous and asynchronous followers are in use.
- Log sequence number: aka binlog coordinates, exact position of a leaderâs replication log.
- Failover: A failover is what happens when a follower needs to be promoted to be the new leader.
- Logical Log: Where replication is decoupled from storage log.
- Change Data Capture: aka CDC, sending content of a database to an external system.
- Trigger: Custom app code that is automatically executed when a data change occurs in a system.
- Read-scaling: Increase capacity for serving read requests by adding more followers.
- Eventual consistency: An effect whereby inconsistency is a temporary state and if you stop writes the followers will eventually catch-up.
- Replication Lag: The time delay between a write on one replica and that write being visible on another replica.
- Convergent conflict resolution: When all replicas must arrive at the same final value when all changes have been replicated.
- Replication topology: Describes the communication paths along which writes are propagated from one node to another.
- Quorum: The minimum number of nodes that must agree to a read or write request for it to be considered successful.
- Sloppy quorum: Allows writes and reads to proceed even if the quorum cannot be met by the intended replica nodes, by temporarily writing to other available nodes instead
- Tombstone: A special marker used in leaderless databases to indicate that a deleted item should not be resurrected by other replicas during synchronization.
- Version vector: A collection of version numbers from all replicas.
Quotes nâTips
- âIf your application requires strong consistency, you may be better off with a system that uses synchronous replication and a single leader.â p. 182
- âEventual consistency does not guarantee that a write will eventually be visible â only that if no new writes are made, eventually all replicas will converge to the same value.â p. 183