ch03
dz / designing_data_intensive_applications / ch03Summary
Chapter 3: Storage and Retrieval
Node Tree
-
storage_engine
- LSM_storage_engine
- index
-
log_structured
-
log
- append_only
-
break_log_segments
-
SSTables
- sorted_string_table
- uses_SSTable
- advantages
- each_key_appears_once
-
how_to_sort
- sort_in_memory
- sort_on_disk
- keep_kv_sorted_key
- memtable
- order_timing_strategies_compaction_merge
- merge_segments
-
SSTables
-
log
- page_oriented
- tradeoff
-
transaction
-
OLTP
- some_oltp_traits
-
storage_engine_schools_of_thought
- update_in_place
-
log_structured
-
log
- append_only
-
break_log_segments
-
SSTables
- sorted_string_table
- uses_SSTable
- advantages
- each_key_appears_once
-
how_to_sort
- sort_in_memory
- sort_on_disk
- keep_kv_sorted_key
- memtable
- order_timing_strategies_compaction_merge
- merge_segments
-
SSTables
-
log
- OLTP_ops
- commercial_transaction
- data_analytics
- rw_group_logical_unit
-
OLTP
- LSM_Trees
- covering_index
- data_warehouse
- fts_fuzzy
- keeping_everything_memory
- primary_key_index
Nodes
storage_engine | |
content | Storage Engine |
children | LSM_storage_engine, index, log_structured, page_oriented |
log_structured | |
content | log-structured |
children | log |
parents | storage_engine_schools_of_thought, storage_engine |
page_oriented | |
content | Page Oriented |
children | b_trees |
parents | storage_engine |
log | |
content | Log |
children | append_only, break_log_segments |
parents | log_structured |
append_only | |
content | append-only |
parents | log |
b_trees | |
content | B-Trees |
children | write_ahead_log, 4_level_tree, balanced, blocks_pages, btree_root, btree_vs_lsm_tree, fractal_trees (variation of b-trees (I think?)), identified_using_address, keep_kv_sorted_key, latches, most_widely_used_index_struct, secondary_key_index |
parents | sort_on_disk, page_oriented |
index | |
content | Index |
children | value, additional_struct_data_derived, efficient_data_struct_lookup, hash_index, incurs_overhead, key |
parents | storage_engine |
efficient_data_struct_lookup | |
content | Efficient data structure for lookup |
parents | index, tradeoff |
incurs_overhead | |
content | Incurs overhead, especially for writes |
parents | index, tradeoff |
tradeoff | |
content | Tradeoff |
children | efficient_data_struct_lookup, incurs_overhead |
additional_struct_data_derived | |
content | Additional structure derived from data |
parents | index |
hash_index | |
content | Hash Index |
children | compaction, kv_store |
parents | index |
kv_store | |
content | Key-Value store |
children | tombstone, bitcask, dictionary |
parents | hash_index |
bitcask | |
content | Bitcask |
children | keys_mem_values_disk |
parents | kv_store |
dictionary | |
content | Dictionary |
parents | kv_store |
compaction | |
content | Compaction |
children | throw_away_dup_keep_recent, merge_segments, order_timing_strategies_compaction_merge |
parents | hash_index |
keys_mem_values_disk | |
content | in-memory keys, values on disk with offsets |
parents | bitcask |
break_log_segments | |
content | Break log into segments |
children | SSTables, merge_segments |
parents | log |
throw_away_dup_keep_recent | |
content | Throw away duplicate keys, keep most recent |
parents | compaction |
tombstone | |
content | Tombstone |
children | deletion_record |
parents | kv_store |
deletion_record | |
content | Deletion record after removing key |
parents | tombstone |
merge_segments | |
content | Merge Segments |
children | most_recent_multi_segs, order_timing_strategies_compaction_merge |
parents | compaction, break_log_segments |
SSTables | |
content | SSTables |
children | sorted_string_table (decrypting the abbreviation), uses_SSTable, advantages, each_key_appears_once, how_to_sort, keep_kv_sorted_key (Similarity to SSTables), memtable, order_timing_strategies_compaction_merge (performance) |
parents | break_log_segments |
sorted_string_table | |
content | Sorted String Table |
parents | SSTables |
advantages | |
content | Advantages |
children | merging_segments_simple, no_need_all_keys_in_memory, records_grouped_compressed |
parents | SSTables |
each_key_appears_once | |
content | Each key appears only once |
children | most_recent_multi_segs |
parents | SSTables |
LSM_Trees | |
content | LSM-Trees |
children | term_dictionary, uses_SSTable, LSM_Trees_perf, LSM_storage_engine, btree_vs_lsm_tree, cascade_SSTables, log_structured_merge_tree (decrypting the abbreviation), secondary_key_index (also used with LSM-Trees) |
most_recent_multi_segs | |
content | Most recent key used when dealing with multiple segments |
parents | each_key_appears_once, merge_segments |
merging_segments_simple | |
content | Merging segments simple and efficient |
parents | advantages |
no_need_all_keys_in_memory | |
content | in-memory representation of keys no longer needed |
children | one_key_offset_every_few_kb |
parents | advantages |
one_key_offset_every_few_kb | |
content | One key offset every few kb is sufficient density |
children | sparse_in_mem_index (terminology) |
parents | no_need_all_keys_in_memory |
sparse_in_mem_index | |
content | sparse in-memory index |
parents | one_key_offset_every_few_kb, records_grouped_compressed |
records_grouped_compressed | |
content | Records can be grouped and compressed |
children | sparse_in_mem_index |
parents | advantages |
how_to_sort | |
content | How to sort? |
children | sort_in_memory, sort_on_disk |
parents | SSTables |
sort_in_memory | |
content | In memory |
children | memtable |
parents | sort_on_disk, how_to_sort |
sort_on_disk | |
content | On Disk |
children | sort_in_memory (in-memory sorting easier than on-disk), b_trees (b-trees main structure for on-disk sorting) |
parents | how_to_sort |
memtable | |
content | memtable |
children | db_crash_recent_writes_lost, in_mem_tree, memtable_read, memtable_write |
parents | SSTables, sort_in_memory |
in_mem_tree | |
content | in-memory tree |
children | AVL, redblack |
parents | memtable |
redblack | |
content | Red-Black tree |
parents | in_mem_tree |
AVL | |
content | AVL |
parents | in_mem_tree |
memtable_write | |
content | Write |
parents | memtable |
memtable_read | |
content | Read |
parents | memtable |
db_crash_recent_writes_lost | |
content | issue: if database crashes the most recent writes (still in memory) will be lost. |
children | crash_solution |
parents | memtable |
crash_solution | |
content | Solution: use append-only log for most recent writes |
parents | db_crash_recent_writes_lost |
log_structured_merge_tree | |
content | Log Structured Merge Tree |
parents | LSM_Trees |
LSM_storage_engine | |
content | LSM storage engine |
parents | LSM_Trees, storage_engine |
uses_SSTable | |
content | Uses SSTable |
parents | SSTables, LSM_Trees |
LSM_Trees_perf | |
content | Performance |
children | nonexist_keys_slow |
parents | LSM_Trees |
term_dictionary | |
children | term_desc (description), lucene |
parents | LSM_Trees |
lucene | |
content | Lucene |
parents | term_dictionary |
term_desc | |
content | Key: Term, Value: List of ID's of all the documents containing term (word) (postings list) |
children | kept_in_SSTable |
parents | term_dictionary |
kept_in_SSTable | |
content | Kept in SSTable-like sorted files |
parents | term_desc |
nonexist_keys_slow | |
content | Nonexistant key look-up slow |
children | bloom_filters (optimization solution) |
parents | LSM_Trees_perf |
bloom_filters | |
content | Bloom filters |
children | bloom_desc (description) |
parents | nonexist_keys_slow |
bloom_desc | |
content | Memory-efficient data structure for approximating the contents in a set |
parents | bloom_filters |
order_timing_strategies_compaction_merge | |
content | Strategies for the order and time of SSTables compaction and merge |
children | size_tiered, leveled |
parents | merge_segments, compaction, SSTables |
size_tiered | |
content | Size Tiered |
children | size_tiered_elab (Elaboration) |
parents | order_timing_strategies_compaction_merge |
size_tiered_elab | |
content | Newer/smaller merged into older/larger |
parents | size_tiered |
leveled | |
content | Leveled |
children | leveled_elab (elaboration) |
parents | order_timing_strategies_compaction_merge |
leveled_elab | |
content | Key range split into smaller tables. Old data moved int levels |
parents | leveled |
cascade_SSTables | |
content | Cascade of SSTables |
parents | LSM_Trees |
most_widely_used_index_struct | |
content | Most widely used indexing structure |
parents | b_trees |
keep_kv_sorted_key | |
content | Keeps key/value pairs sorted by key |
parents | SSTables, b_trees |
blocks_pages | |
content | Blocks/pages |
children | branching_factor, btree_root, fixed_size_page, leaf_page, orphan_page |
parents | b_trees |
fixed_size_page | |
content | Fixed size (4kb traditionally) |
parents | blocks_pages |
identified_using_address | |
content | identified using an on-disk address/location |
children | refs_make_tree |
parents | b_trees |
refs_make_tree | |
content | References make tree |
parents | identified_using_address |
btree_root | |
content | Root |
parents | b_trees, blocks_pages |
leaf_page | |
content | Leaf page: page containing individual keys |
parents | blocks_pages |
branching_factor | |
content | Branching factor: number of references to child pages in one page |
parents | blocks_pages |
balanced | |
content | balanced |
children | n_keys_depth |
parents | b_trees |
n_keys_depth | |
content | N-keys has depth O(n*log(n)) |
parents | balanced |
4_level_tree | |
content | 4-level tree of 4kb pages with branching factor of 500 can store up to 256TB |
parents | b_trees |
orphan_page | |
content | Orphan Page |
children | not_child_to_parent (description), orphan_caused_when (caused by) |
parents | blocks_pages |
not_child_to_parent | |
content | Not a child of any parent page |
parents | orphan_page |
orphan_caused_when | |
content | Can be caused when writing page split gets interrupted. |
parents | orphan_page |
write_ahead_log | |
content | Write-ahead log (WAL) |
children | redo_log (AKA (I think?)) |
parents | b_trees |
redo_log | |
content | redo log |
parents | write_ahead_log |
latches | |
content | latches |
children | lightweight_locks (description) |
parents | b_trees |
lightweight_locks | |
content | Lightweight locks for concurrent access |
parents | latches |
fractal_trees | |
content | fractal trees |
parents | b_trees |
btree_vs_lsm_tree | |
content | B-Trees vs LSM-Trees |
children | write_amplification, btree_vs, lsm_tree_vs, more_predictable |
parents | b_trees, LSM_Trees |
btree_vs | |
content | B-Tree |
children | faster_reads, key_exists_in_one_place, more_mature |
parents | btree_vs_lsm_tree |
lsm_tree_vs | |
content | LSM Tree |
children | better_compression, compaction_interference, faster_writes, higher_throughput_lower_wa, misconfigured_compaction |
parents | btree_vs_lsm_tree |
more_mature | |
content | More Mature |
parents | btree_vs |
faster_reads | |
content | faster reads |
parents | btree_vs |
write_amplification | |
content | Write Amplification |
children | higher_throughput_lower_wa, one_db_write_several_disk_writes (description) |
parents | btree_vs_lsm_tree |
one_db_write_several_disk_writes | |
content | One DB write yields mulitple disk writes over DB lifetime |
parents | write_amplification |
higher_throughput_lower_wa | |
content | Sustain higher throughput, lower write amplification |
parents | lsm_tree_vs, write_amplification |
better_compression | |
content | Better compression |
parents | lsm_tree_vs |
faster_writes | |
content | Faster Writes |
parents | lsm_tree_vs |
more_predictable | |
content | More predictable |
children | impacts_higer_percentiles, queries (more predictable queries) |
parents | btree_vs_lsm_tree, compaction_interference |
compaction_interference | |
content | Compaction sometimes interferes with read/write performance |
children | impacts_higer_percentiles, more_predictable (compaction makes r/w performance less predictable) |
parents | lsm_tree_vs |
impacts_higer_percentiles | |
content | Impacts higher percentiles |
parents | more_predictable, compaction_interference |
remarks | see previous chapters on percentiles |
queries | |
content | queries |
parents | more_predictable |
key_exists_in_one_place | |
content | Key exists in exactly one place in the index |
children | strong_transactional_semantics |
parents | btree_vs |
strong_transactional_semantics | |
content | Strong Transactional Semantics |
parents | key_exists_in_one_place |
misconfigured_compaction | |
content | Misconfigured compaction causes disks to fill up |
parents | lsm_tree_vs |
primary_key_index | |
content | Primary key index |
children | secondary_key_index |
secondary_key_index | |
content | Secondary key index |
children | avoids_duplication |
parents | primary_key_index, b_trees, LSM_Trees |
key | |
content | Key |
children | queries_for |
parents | index |
value | |
content | Value |
children | row |
parents | index |
queries_for | |
content | Queries For |
parents | key |
row | |
content | Row |
children | row_ref (or) |
parents | value |
row_ref | |
content | Ref to row, stored elsewhere |
children | heap_file |
parents | row |
heap_file | |
content | Heap File |
children | avoids_duplication, indexed_row_in_index |
parents | row_ref |
avoids_duplication | |
content | Avoids duplication when multiple secondary indexes are present |
parents | heap_file, secondary_key_index |
indexed_row_in_index | |
content | If hop from index to heap too expensive, store indexed row in index |
children | clustered_index, multi_column_index |
parents | heap_file |
clustered_index | |
content | Clustered Index |
parents | indexed_row_in_index |
clustered_nonclustered_compromise | |
content | Compromise between clustered and nonclustered |
parents | covering_index |
multi_column_index | |
content | Multi-Column Index |
children | concat_index, geospatial_data |
parents | indexed_row_in_index |
concat_index | |
content | Concatenated Index |
parents | multi_column_index |
covering_index | |
content | covering index, or index with included columns |
children | clustered_nonclustered_compromise |
geospatial_data | |
content | geospatial data |
parents | multi_column_index |
fts_fuzzy | |
content | Full-text search and fuzzy indexes |
children | similar_keys, fuzzy_querying, levenshtein_automation |
similar_keys | |
content | similar keys |
parents | fts_fuzzy |
fuzzy_querying | |
content | Fuzzy Querying |
parents | fts_fuzzy |
levenshtein_automation | |
content | levenshtein automation |
children | efficient_search_words_edit_distance (description) |
parents | fts_fuzzy |
efficient_search_words_edit_distance | |
content | Efficient search of words within edit distance |
parents | levenshtein_automation |
keeping_everything_memory | |
content | Keeping everything in memory |
children | NVM, anti_caching, in_memory_db |
in_memory_db | |
content | In-memory databses |
children | avoids_encoding_overhead, disk_durablity, implement_models_difficult_for_disks |
parents | keeping_everything_memory |
disk_durablity | |
content | Use disk as append-only log for durability |
parents | in_memory_db |
implement_models_difficult_for_disks | |
content | Can provide models that are difficult to implement with disk-based indexes |
children | redis_data_structs (examples) |
parents | in_memory_db |
redis_data_structs | |
content | Redis: priority queues and sets |
parents | implement_models_difficult_for_disks |
avoids_encoding_overhead | |
content | Performance advantage comes from avoiding overheads of encoding data in a form that can be written to disk |
parents | in_memory_db |
anti_caching | |
content | Anti-caching |
children | datasets_larger_than_mem (situation when to use it), evict_least_used |
parents | keeping_everything_memory |
datasets_larger_than_mem | |
content | Datasets larger than memory |
parents | anti_caching |
evict_least_used | |
content | Evict least used data from memory and write to disk |
parents | anti_caching |
NVM | |
content | NVM: non-volatile memory |
parents | keeping_everything_memory |
transaction | |
content | Transaction |
children | OLTP, commercial_transaction (business origins), data_analytics, rw_group_logical_unit (description) |
commercial_transaction | |
content | Commercial transaction |
parents | transaction |
rw_group_logical_unit | |
content | Group of reads/writes that form a logical unit |
parents | transaction |
OLTP | |
content | Online Transaction Processing (OLTP) |
children | some_oltp_traits, storage_engine_schools_of_thought, OLTP_ops |
parents | transaction |
data_analytics | |
content | Data Analytics |
children | business_intel |
parents | transaction |
business_intel | |
content | Business intelligence |
children | OLAP |
parents | data_analytics |
OLAP | |
content | Online analytic processing (OLAP) |
children | some_olap_traits, specialized_db_olap, column_oriented_storage |
parents | business_intel |
data_warehouse | |
content | Data Warehouse |
children | specialized_db_olap (description), star_schema, column_oriented_storage, doesnt_affect_user_facing_db, extract_transform_load, materialized_aggregates |
specialized_db_olap | |
content | specialized database for OLAP |
parents | OLAP, data_warehouse |
doesnt_affect_user_facing_db | |
content | Doesn't affect performance of user-facing OLTP databse |
parents | data_warehouse |
extract_transform_load | |
content | extract, transform, load |
parents | data_warehouse |
star_schema | |
content | Star Schema |
children | snowflake_schema, surround_like_star_rays (Where "star schema" gets its name), dimensional_modeling (AKA), fact_table |
parents | data_warehouse |
dimensional_modeling | |
content | Dimensional Modeling |
parents | star_schema |
fact_table | |
content | Fact Table |
children | dimension_tables |
parents | star_schema |
snowflake_schema | |
content | Snowflake schema |
children | dimensions_broken_into_subdimensions |
parents | star_schema |
dimension_tables | |
content | Dimension Tables |
children | surround_like_star_rays |
parents | fact_table |
surround_like_star_rays | |
content | Surround facts table like rays of a star |
parents | star_schema, dimension_tables |
OLTP_ops | |
content | OLTP operations |
children | slicing, dicing, drill_down |
parents | OLTP |
drill_down | |
content | Drill Down |
parents | OLTP_ops |
slicing | |
content | Slicing |
parents | OLTP_ops |
dicing | |
content | dicing |
parents | OLTP_ops |
column_oriented_storage | |
content | Column-oriented storage |
children | sort_order, sorted_different_ways, writing, access_few_columns_at_a_time, compresses_well |
parents | data_warehouse, OLAP |
compresses_well | |
content | Tends to compress well |
children | bitmap_encoding |
parents | column_oriented_storage |
access_few_columns_at_a_time | |
content | Queries tend to access only 3-5 columns at a time |
parents | column_oriented_storage |
bitmap_encoding | |
content | Bitmap Encoding |
parents | compresses_well |
dimensions_broken_into_subdimensions | |
content | Dimensions broken down into subdimensions |
parents | snowflake_schema |
sort_order | |
content | sort order |
children | helpful_for_col_compr |
parents | column_oriented_storage |
helpful_for_col_compr | |
content | Can help with compression of columns |
parents | sort_order |
writing | |
content | writing |
children | more_difficult |
parents | column_oriented_storage |
sorted_different_ways | |
content | sorted in different ways |
children | optimized_diff_queries |
parents | column_oriented_storage |
optimized_diff_queries | |
content | Optimized for different queries |
parents | sorted_different_ways |
more_difficult | |
content | more difficult |
children | lstm_trees (solution) |
parents | writing |
lstm_trees | |
content | LSTM Trees |
parents | more_difficult |
materialized_aggregates | |
content | Materialized Aggregates |
children | cache_common_queries, materialized_view |
parents | data_warehouse |
materialized_view | |
content | Materialized View |
children | datacube |
parents | materialized_aggregates |
cache_common_queries | |
content | Cache common aggregate queries (count, sum, avg, etc) |
parents | materialized_aggregates |
datacube | |
content | datacube/OLAP cube |
parents | materialized_view |
some_olap_traits | |
content | low volume, demanding queries, business analyists |
children | some_oltp_traits (contrasting) |
parents | OLAP |
some_oltp_traits | |
content | high volume, user-facing |
parents | OLTP, some_olap_traits |
storage_engine_schools_of_thought | |
content | storage engine schools of thought |
children | update_in_place, log_structured |
parents | OLTP |
update_in_place | |
content | Update in-place |
parents | storage_engine_schools_of_thought |