ch06
dz / designing_data_intensive_applications / ch06Summary
Chapter 6: Partitioning
Node Tree
-
partitions
- document_partitioned_indexes
- key_range
- large_datasets_high_throughput
- main_reason_scalability
- rebalancing
- secondary_indexes
- sharding
- spread_data_load_evenly
- term_partitioned_indexes
- break_data_up
- combined_with_replication
- MPP
Nodes
| partitions | |
| content | Partitions |
| children | document_partitioned_indexes, key_range, large_datasets_high_throughput (when to use), main_reason_scalability, rebalancing, secondary_indexes, sharding (AKA), spread_data_load_evenly, term_partitioned_indexes, break_data_up, combined_with_replication |
| break_data_up | |
| content | Break data up |
| parents | partitions |
| sharding | |
| content | Sharding |
| parents | partitions |
| large_datasets_high_throughput | |
| content | For large datasets or very high throughput |
| parents | partitions |
| main_reason_scalability | |
| content | Main reason: scalability |
| parents | partitions |
| combined_with_replication | |
| content | Usually combined with replication |
| children | leader_follower_part (example) |
| parents | partitions |
| leader_follower_part | |
| content | Leader-follower: partition leader assignment to one node, followers to other |
| children | node_more_than_one_part |
| parents | combined_with_replication |
| spread_data_load_evenly | |
| content | Goal: spread data and query load evenly across nodes |
| children | key_range_dist, partition_hash_keys, skewed_partitioning |
| parents | partitions |
| key_range_dist | |
| content | Key-range distribution |
| children | keys_sorted_on_each_node, not_good_key_range, part_bounds_adapt_data, range_queries_inefficient, access_patterns_hotspots, compound_primary_key |
| parents | spread_data_load_evenly |
| part_bounds_adapt_data | |
| content | Partitions boundaries need to adapt to data |
| parents | key_range_dist |
| keys_sorted_on_each_node | |
| content | Keep keys sorted on each node |
| parents | key_range_dist |
| skewed_partitioning | |
| content | Skewed Partitioning |
| children | hotspot_node, nodes_more_data_others |
| parents | spread_data_load_evenly |
| nodes_more_data_others | |
| content | Some nodes have more data than others |
| parents | skewed_partitioning |
| hotspot_node | |
| content | Hotspot node with high load |
| parents | skewed_partitioning |
| partition_hash_keys | |
| content | Parition by hash of keys |
| children | good_hash_func, range_queries_inefficient, assign_part_hash_range, compound_primary_key |
| parents | spread_data_load_evenly |
| access_patterns_hotspots | |
| content | Certain access patterns can lead to hotspots |
| parents | key_range_dist |
| node_more_than_one_part | |
| content | NOdes can store more than one partition |
| parents | leader_follower_part |
| good_hash_func | |
| content | A good hash function takes skewed data and makes it uniformly distributed |
| children | fowler_noll_vo (example), MD5 (example) |
| parents | partition_hash_keys |
| MD5 | |
| content | MD5 (cassandra, mongodb) |
| parents | good_hash_func |
| fowler_noll_vo | |
| content | Fowler-Noll-Vo (Voldemort) |
| parents | good_hash_func |
| assign_part_hash_range | |
| content | Assign each partition a range of hashes |
| children | consistent_hashing |
| parents | partition_hash_keys |
| consistent_hashing | |
| content | Consistent hashing: ranges chosen pseudorandomly |
| children | hash_partitioning (AKA) |
| parents | assign_part_hash_range |
| remarks | rarely used |
| range_queries_inefficient | |
| content | Range Queries Inefficient |
| parents | partition_hash_keys, key_range_dist |
| hash_partitioning | |
| content | Hash Partitioning |
| parents | consistent_hashing |
| compound_primary_key | |
| content | Compound Primary Key: compromise between key-range hash-range distribution (Cassandra) |
| children | several_columns |
| parents | partition_hash_keys, key_range_dist |
| several_columns | |
| content | Several Columns: only first column hashed and used to determine parition. The rest are concatenated index for sorting data in SSTables |
| children | good_for_many_to_one |
| parents | compound_primary_key |
| good_for_many_to_one | |
| content | Good for many-to-one relationships |
| parents | several_columns |
| secondary_indexes | |
| content | Secondary Indexes |
| children | document_partitioning_index, doesnt_map_neatly_part |
| parents | partitions |
| document_partitioning_index | |
| content | Document Partitioning Index |
| children | local_index (AKA), document_partitioned_index (AKA) |
| parents | secondary_indexes |
| local_index | |
| content | Local Index |
| children | global_index (vs), scatter_gather |
| parents | document_partitioning_index |
| global_index | |
| content | Global Index |
| children | faster_reads_slow_complicated_writes, term_partitioned |
| parents | local_index |
| doesnt_map_neatly_part | |
| content | Doesn't map neatly to partitions |
| parents | secondary_indexes |
| scatter_gather | |
| content | Scatter / Gather |
| children | prone_to_tail_latency, query_all_combine (description) |
| parents | local_index |
| document_partitioned_index | |
| content | Document partitioned index |
| parents | document_partitioning_index |
| query_all_combine | |
| content | Query all partitions, combine results |
| parents | scatter_gather |
| prone_to_tail_latency | |
| content | Prone to tail latency amplification |
| parents | scatter_gather |
| term_partitioned | |
| content | term-partitioned |
| children | partitions_global_index, up_to_date_dist_trans |
| parents | global_index |
| partitions_global_index | |
| content | Partitions global index |
| parents | term_partitioned |
| up_to_date_dist_trans | |
| content | Up-to-date index requires distributed transactions |
| parents | term_partitioned |
| faster_reads_slow_complicated_writes | |
| content | Reads faster, writers slower and more complicated |
| parents | global_index |
| rebalancing | |
| content | Rebalancing |
| children | dont_do_hash_mod_n, dynamic_partitioning, fixed_num_parts, move_load_from_node (description), part_growth_prop_data, partitioning_proportional_nodes |
| parents | partitions |
| move_load_from_node | |
| content | Move load from one node in cluster to another |
| parents | rebalancing |
| dont_do_hash_mod_n | |
| content | Don't do hash mod N |
| parents | rebalancing |
| fixed_num_parts | |
| content | Fixed number of partitions |
| children | dynamic_partitioning (vs), more_parts_than_nodes, not_good_key_range, partition_number, account_mismatched_hardware |
| parents | rebalancing |
| more_parts_than_nodes | |
| content | More partitions than nodes, move/steal partitions when new noded added |
| parents | fixed_num_parts |
| account_mismatched_hardware | |
| content | Can even account for mismatched hardware |
| parents | fixed_num_parts |
| partition_number | |
| content | Partition Number: too large and rebalancing is expensive, too small and there's too much overhead. |
| parents | fixed_num_parts |
| not_good_key_range | |
| content | Not good for key-range partitioning |
| children | dynamic_partitioning (solution) |
| parents | key_range_dist, fixed_num_parts |
| dynamic_partitioning | |
| content | Dynamic Partitioning |
| children | empty_db_single_part (caveat), merge_part_below_thresh, partition_number_adaptive, split_part_exceeds_size, suitable_key_range_hash_part |
| parents | rebalancing, not_good_key_range, fixed_num_parts |
| split_part_exceeds_size | |
| content | Split partition that exceeds certain size |
| children | merge_part_below_thresh (related) |
| parents | dynamic_partitioning |
| merge_part_below_thresh | |
| content | Merge partition when it sinks below certain threshold |
| parents | dynamic_partitioning, split_part_exceeds_size |
| partition_number_adaptive | |
| content | Number of partitions adapts to the total data volume |
| parents | dynamic_partitioning |
| suitable_key_range_hash_part | |
| content | Suitable for key-range and hash-partioned data |
| parents | dynamic_partitioning |
| empty_db_single_part | |
| content | An initialized empty database starts with a single partition. All writes initially processed by single node while other nodes are idle. |
| parents | dynamic_partitioning |
| partitioning_proportional_nodes | |
| content | Partitioning proportional to nodes |
| children | fixed_part_num_per_node |
| parents | rebalancing |
| fixed_part_num_per_node | |
| content | Fixed number of partitions per node |
| parents | partitioning_proportional_nodes |
| part_growth_prop_data | |
| content | Partition size grows proportional to data size |
| children | part_shrinks_when_new_node |
| parents | rebalancing |
| part_shrinks_when_new_node | |
| content | Partition size shrink when new node is added |
| parents | part_growth_prop_data |
| MPP | |
| content | Massively parallel processing (MPP) |
| children | parallel_query_exec |
| parallel_query_exec | |
| content | Parallel query execution |
| parents | MPP |
| key_range | |
| content | Key Range |
| children | split_if_part_too_long (description) |
| parents | partitions |
| split_if_part_too_long | |
| content | Split into two subranges if partition gets too long |
| parents | key_range |
| document_partitioned_indexes | |
| content | document partitioned indexes |
| children | global, secondary_indexes_stored_same_part, term_partitioned_indexes (related) |
| parents | partitions |
| global | |
| content | global |
| parents | document_partitioned_indexes |
| term_partitioned_indexes | |
| content | Term partitioned indexes |
| children | local, partitioned_separately |
| parents | document_partitioned_indexes, partitions |
| local | |
| content | Local |
| parents | term_partitioned_indexes |
| partitioned_separately | |
| content | Partitioned separately |
| parents | term_partitioned_indexes, secondary_indexes_stored_same_part |
| secondary_indexes_stored_same_part | |
| content | Secondary indexes stored on same partition as primary key/value |
| children | partitioned_separately (vs) |
| parents | document_partitioned_indexes |