designing_data_intensive_applications/ch06

ch06

dz / designing_data_intensive_applications / ch06

Summary

Chapter 6: Partitioning

Node Tree

MPP
- parallel_query_exec
partitions

Nodes

partitions
content	Partitions
children	break_data_up, combined_with_replication, document_partitioned_indexes, key_range, large_datasets_high_throughput (when to use), main_reason_scalability, rebalancing, secondary_indexes, sharding (AKA), spread_data_load_evenly, term_partitioned_indexes

break_data_up
content	Break data up
parents	partitions

sharding
content	Sharding
parents	partitions

large_datasets_high_throughput
content	For large datasets or very high throughput
parents	partitions

main_reason_scalability
content	Main reason: scalability
parents	partitions

combined_with_replication
content	Usually combined with replication
children	leader_follower_part (example)
parents	partitions

leader_follower_part
content	Leader-follower: partition leader assignment to one node, followers to other
children	node_more_than_one_part
parents	combined_with_replication

spread_data_load_evenly
content	Goal: spread data and query load evenly across nodes
children	key_range_dist, partition_hash_keys, skewed_partitioning
parents	partitions

key_range_dist
content	Key-range distribution
children	access_patterns_hotspots, compound_primary_key, keys_sorted_on_each_node, not_good_key_range, part_bounds_adapt_data, range_queries_inefficient
parents	spread_data_load_evenly

part_bounds_adapt_data
content	Partitions boundaries need to adapt to data
parents	key_range_dist

keys_sorted_on_each_node
content	Keep keys sorted on each node
parents	key_range_dist

skewed_partitioning
content	Skewed Partitioning
children	hotspot_node, nodes_more_data_others
parents	spread_data_load_evenly

nodes_more_data_others
content	Some nodes have more data than others
parents	skewed_partitioning

hotspot_node
content	Hotspot node with high load
parents	skewed_partitioning

partition_hash_keys
content	Parition by hash of keys
children	assign_part_hash_range, compound_primary_key, good_hash_func, range_queries_inefficient
parents	spread_data_load_evenly

access_patterns_hotspots
content	Certain access patterns can lead to hotspots
parents	key_range_dist

node_more_than_one_part
content	NOdes can store more than one partition
parents	leader_follower_part

good_hash_func
content	A good hash function takes skewed data and makes it uniformly distributed
children	MD5 (example), fowler_noll_vo (example)
parents	partition_hash_keys

MD5
content	MD5 (cassandra, mongodb)
parents	good_hash_func

fowler_noll_vo
content	Fowler-Noll-Vo (Voldemort)
parents	good_hash_func

assign_part_hash_range
content	Assign each partition a range of hashes
children	consistent_hashing
parents	partition_hash_keys

consistent_hashing
content	Consistent hashing: ranges chosen pseudorandomly
children	hash_partitioning (AKA)
parents	assign_part_hash_range
remarks	rarely used

range_queries_inefficient
content	Range Queries Inefficient
parents	key_range_dist, partition_hash_keys

hash_partitioning
content	Hash Partitioning
parents	consistent_hashing

compound_primary_key
content	Compound Primary Key: compromise between key-range hash-range distribution (Cassandra)
children	several_columns
parents	key_range_dist, partition_hash_keys

several_columns
content	Several Columns: only first column hashed and used to determine parition. The rest are concatenated index for sorting data in SSTables
children	good_for_many_to_one
parents	compound_primary_key

good_for_many_to_one
content	Good for many-to-one relationships
parents	several_columns

secondary_indexes
content	Secondary Indexes
children	document_partitioning_index, doesnt_map_neatly_part
parents	partitions

document_partitioning_index
content	Document Partitioning Index
children	document_partitioned_index (AKA), local_index (AKA)
parents	secondary_indexes

local_index
content	Local Index
children	global_index (vs), scatter_gather
parents	document_partitioning_index

global_index
content	Global Index
children	faster_reads_slow_complicated_writes, term_partitioned
parents	local_index

doesnt_map_neatly_part
content	Doesn't map neatly to partitions
parents	secondary_indexes

scatter_gather
content	Scatter / Gather
children	prone_to_tail_latency, query_all_combine (description)
parents	local_index

document_partitioned_index
content	Document partitioned index
parents	document_partitioning_index

query_all_combine
content	Query all partitions, combine results
parents	scatter_gather

prone_to_tail_latency
content	Prone to tail latency amplification
parents	scatter_gather

term_partitioned
content	term-partitioned
children	partitions_global_index, up_to_date_dist_trans
parents	global_index

partitions_global_index
content	Partitions global index
parents	term_partitioned

up_to_date_dist_trans
content	Up-to-date index requires distributed transactions
parents	term_partitioned

faster_reads_slow_complicated_writes
content	Reads faster, writers slower and more complicated
parents	global_index

rebalancing
content	Rebalancing
children	dont_do_hash_mod_n, dynamic_partitioning, fixed_num_parts, move_load_from_node (description), part_growth_prop_data, partitioning_proportional_nodes
parents	partitions

move_load_from_node
content	Move load from one node in cluster to another
parents	rebalancing

dont_do_hash_mod_n
content	Don't do hash mod N
parents	rebalancing

fixed_num_parts
content	Fixed number of partitions
children	account_mismatched_hardware, dynamic_partitioning (vs), more_parts_than_nodes, not_good_key_range, partition_number
parents	rebalancing

more_parts_than_nodes
content	More partitions than nodes, move/steal partitions when new noded added
parents	fixed_num_parts

account_mismatched_hardware
content	Can even account for mismatched hardware
parents	fixed_num_parts

partition_number
content	Partition Number: too large and rebalancing is expensive, too small and there's too much overhead.
parents	fixed_num_parts

not_good_key_range
content	Not good for key-range partitioning
children	dynamic_partitioning (solution)
parents	fixed_num_parts, key_range_dist

dynamic_partitioning
content	Dynamic Partitioning
children	empty_db_single_part (caveat), merge_part_below_thresh, partition_number_adaptive, split_part_exceeds_size, suitable_key_range_hash_part
parents	not_good_key_range, fixed_num_parts, rebalancing

split_part_exceeds_size
content	Split partition that exceeds certain size
children	merge_part_below_thresh (related)
parents	dynamic_partitioning

merge_part_below_thresh
content	Merge partition when it sinks below certain threshold
parents	split_part_exceeds_size, dynamic_partitioning

partition_number_adaptive
content	Number of partitions adapts to the total data volume
parents	dynamic_partitioning

suitable_key_range_hash_part
content	Suitable for key-range and hash-partioned data
parents	dynamic_partitioning

empty_db_single_part
content	An initialized empty database starts with a single partition. All writes initially processed by single node while other nodes are idle.
parents	dynamic_partitioning

partitioning_proportional_nodes
content	Partitioning proportional to nodes
children	fixed_part_num_per_node
parents	rebalancing

fixed_part_num_per_node
content	Fixed number of partitions per node
parents	partitioning_proportional_nodes

part_growth_prop_data
content	Partition size grows proportional to data size
children	part_shrinks_when_new_node
parents	rebalancing

part_shrinks_when_new_node
content	Partition size shrink when new node is added
parents	part_growth_prop_data

MPP
content	Massively parallel processing (MPP)
children	parallel_query_exec

parallel_query_exec
content	Parallel query execution
parents	MPP

key_range
content	Key Range
children	split_if_part_too_long (description)
parents	partitions

split_if_part_too_long
content	Split into two subranges if partition gets too long
parents	key_range

document_partitioned_indexes
content	document partitioned indexes
children	global, secondary_indexes_stored_same_part, term_partitioned_indexes (related)
parents	partitions

global
content	global
parents	document_partitioned_indexes

term_partitioned_indexes
content	Term partitioned indexes
children	local, partitioned_separately
parents	partitions, document_partitioned_indexes

local
content	Local
parents	term_partitioned_indexes

partitioned_separately
content	Partitioned separately
parents	secondary_indexes_stored_same_part, term_partitioned_indexes

secondary_indexes_stored_same_part
content	Secondary indexes stored on same partition as primary key/value
children	partitioned_separately (vs)
parents	document_partitioned_indexes