ch06
dz / designing_data_intensive_applications / ch06Summary
Chapter 6: Partitioning
Node Tree
- MPP
-
partitions
- break_data_up
- combined_with_replication
- document_partitioned_indexes
- key_range
- large_datasets_high_throughput
- main_reason_scalability
- rebalancing
- secondary_indexes
- sharding
- spread_data_load_evenly
- term_partitioned_indexes
Nodes
partitions | |
content | Partitions |
children | break_data_up, combined_with_replication, document_partitioned_indexes, key_range, large_datasets_high_throughput (when to use), main_reason_scalability, rebalancing, secondary_indexes, sharding (AKA), spread_data_load_evenly, term_partitioned_indexes |
break_data_up | |
content | Break data up |
parents | partitions |
sharding | |
content | Sharding |
parents | partitions |
large_datasets_high_throughput | |
content | For large datasets or very high throughput |
parents | partitions |
main_reason_scalability | |
content | Main reason: scalability |
parents | partitions |
combined_with_replication | |
content | Usually combined with replication |
children | leader_follower_part (example) |
parents | partitions |
leader_follower_part | |
content | Leader-follower: partition leader assignment to one node, followers to other |
children | node_more_than_one_part |
parents | combined_with_replication |
spread_data_load_evenly | |
content | Goal: spread data and query load evenly across nodes |
children | key_range_dist, partition_hash_keys, skewed_partitioning |
parents | partitions |
key_range_dist | |
content | Key-range distribution |
children | access_patterns_hotspots, compound_primary_key, keys_sorted_on_each_node, not_good_key_range, part_bounds_adapt_data, range_queries_inefficient |
parents | spread_data_load_evenly |
part_bounds_adapt_data | |
content | Partitions boundaries need to adapt to data |
parents | key_range_dist |
keys_sorted_on_each_node | |
content | Keep keys sorted on each node |
parents | key_range_dist |
skewed_partitioning | |
content | Skewed Partitioning |
children | hotspot_node, nodes_more_data_others |
parents | spread_data_load_evenly |
nodes_more_data_others | |
content | Some nodes have more data than others |
parents | skewed_partitioning |
hotspot_node | |
content | Hotspot node with high load |
parents | skewed_partitioning |
partition_hash_keys | |
content | Parition by hash of keys |
children | assign_part_hash_range, compound_primary_key, good_hash_func, range_queries_inefficient |
parents | spread_data_load_evenly |
access_patterns_hotspots | |
content | Certain access patterns can lead to hotspots |
parents | key_range_dist |
node_more_than_one_part | |
content | NOdes can store more than one partition |
parents | leader_follower_part |
good_hash_func | |
content | A good hash function takes skewed data and makes it uniformly distributed |
children | MD5 (example), fowler_noll_vo (example) |
parents | partition_hash_keys |
MD5 | |
content | MD5 (cassandra, mongodb) |
parents | good_hash_func |
fowler_noll_vo | |
content | Fowler-Noll-Vo (Voldemort) |
parents | good_hash_func |
assign_part_hash_range | |
content | Assign each partition a range of hashes |
children | consistent_hashing |
parents | partition_hash_keys |
consistent_hashing | |
content | Consistent hashing: ranges chosen pseudorandomly |
children | hash_partitioning (AKA) |
parents | assign_part_hash_range |
remarks | rarely used |
range_queries_inefficient | |
content | Range Queries Inefficient |
parents | key_range_dist, partition_hash_keys |
hash_partitioning | |
content | Hash Partitioning |
parents | consistent_hashing |
compound_primary_key | |
content | Compound Primary Key: compromise between key-range hash-range distribution (Cassandra) |
children | several_columns |
parents | key_range_dist, partition_hash_keys |
several_columns | |
content | Several Columns: only first column hashed and used to determine parition. The rest are concatenated index for sorting data in SSTables |
children | good_for_many_to_one |
parents | compound_primary_key |
good_for_many_to_one | |
content | Good for many-to-one relationships |
parents | several_columns |
secondary_indexes | |
content | Secondary Indexes |
children | document_partitioning_index, doesnt_map_neatly_part |
parents | partitions |
document_partitioning_index | |
content | Document Partitioning Index |
children | document_partitioned_index (AKA), local_index (AKA) |
parents | secondary_indexes |
local_index | |
content | Local Index |
children | global_index (vs), scatter_gather |
parents | document_partitioning_index |
global_index | |
content | Global Index |
children | faster_reads_slow_complicated_writes, term_partitioned |
parents | local_index |
doesnt_map_neatly_part | |
content | Doesn't map neatly to partitions |
parents | secondary_indexes |
scatter_gather | |
content | Scatter / Gather |
children | prone_to_tail_latency, query_all_combine (description) |
parents | local_index |
document_partitioned_index | |
content | Document partitioned index |
parents | document_partitioning_index |
query_all_combine | |
content | Query all partitions, combine results |
parents | scatter_gather |
prone_to_tail_latency | |
content | Prone to tail latency amplification |
parents | scatter_gather |
term_partitioned | |
content | term-partitioned |
children | partitions_global_index, up_to_date_dist_trans |
parents | global_index |
partitions_global_index | |
content | Partitions global index |
parents | term_partitioned |
up_to_date_dist_trans | |
content | Up-to-date index requires distributed transactions |
parents | term_partitioned |
faster_reads_slow_complicated_writes | |
content | Reads faster, writers slower and more complicated |
parents | global_index |
rebalancing | |
content | Rebalancing |
children | dont_do_hash_mod_n, dynamic_partitioning, fixed_num_parts, move_load_from_node (description), part_growth_prop_data, partitioning_proportional_nodes |
parents | partitions |
move_load_from_node | |
content | Move load from one node in cluster to another |
parents | rebalancing |
dont_do_hash_mod_n | |
content | Don't do hash mod N |
parents | rebalancing |
fixed_num_parts | |
content | Fixed number of partitions |
children | account_mismatched_hardware, dynamic_partitioning (vs), more_parts_than_nodes, not_good_key_range, partition_number |
parents | rebalancing |
more_parts_than_nodes | |
content | More partitions than nodes, move/steal partitions when new noded added |
parents | fixed_num_parts |
account_mismatched_hardware | |
content | Can even account for mismatched hardware |
parents | fixed_num_parts |
partition_number | |
content | Partition Number: too large and rebalancing is expensive, too small and there's too much overhead. |
parents | fixed_num_parts |
not_good_key_range | |
content | Not good for key-range partitioning |
children | dynamic_partitioning (solution) |
parents | fixed_num_parts, key_range_dist |
dynamic_partitioning | |
content | Dynamic Partitioning |
children | empty_db_single_part (caveat), merge_part_below_thresh, partition_number_adaptive, split_part_exceeds_size, suitable_key_range_hash_part |
parents | not_good_key_range, fixed_num_parts, rebalancing |
split_part_exceeds_size | |
content | Split partition that exceeds certain size |
children | merge_part_below_thresh (related) |
parents | dynamic_partitioning |
merge_part_below_thresh | |
content | Merge partition when it sinks below certain threshold |
parents | split_part_exceeds_size, dynamic_partitioning |
partition_number_adaptive | |
content | Number of partitions adapts to the total data volume |
parents | dynamic_partitioning |
suitable_key_range_hash_part | |
content | Suitable for key-range and hash-partioned data |
parents | dynamic_partitioning |
empty_db_single_part | |
content | An initialized empty database starts with a single partition. All writes initially processed by single node while other nodes are idle. |
parents | dynamic_partitioning |
partitioning_proportional_nodes | |
content | Partitioning proportional to nodes |
children | fixed_part_num_per_node |
parents | rebalancing |
fixed_part_num_per_node | |
content | Fixed number of partitions per node |
parents | partitioning_proportional_nodes |
part_growth_prop_data | |
content | Partition size grows proportional to data size |
children | part_shrinks_when_new_node |
parents | rebalancing |
part_shrinks_when_new_node | |
content | Partition size shrink when new node is added |
parents | part_growth_prop_data |
MPP | |
content | Massively parallel processing (MPP) |
children | parallel_query_exec |
parallel_query_exec | |
content | Parallel query execution |
parents | MPP |
key_range | |
content | Key Range |
children | split_if_part_too_long (description) |
parents | partitions |
split_if_part_too_long | |
content | Split into two subranges if partition gets too long |
parents | key_range |
document_partitioned_indexes | |
content | document partitioned indexes |
children | global, secondary_indexes_stored_same_part, term_partitioned_indexes (related) |
parents | partitions |
global | |
content | global |
parents | document_partitioned_indexes |
term_partitioned_indexes | |
content | Term partitioned indexes |
children | local, partitioned_separately |
parents | partitions, document_partitioned_indexes |
local | |
content | Local |
parents | term_partitioned_indexes |
partitioned_separately | |
content | Partitioned separately |
parents | secondary_indexes_stored_same_part, term_partitioned_indexes |
secondary_indexes_stored_same_part | |
content | Secondary indexes stored on same partition as primary key/value |
children | partitioned_separately (vs) |
parents | document_partitioned_indexes |