designing_data_intensive_applications/ch03

ch03

dz / designing_data_intensive_applications / ch03

Summary

Chapter 3: Storage and Retrieval

Node Tree

Nodes

storage_engine
content Storage Engine
children LSM_storage_engine, index, log_structured, page_oriented

log_structured
content log-structured
children log
parents storage_engine_schools_of_thought, storage_engine

page_oriented
content Page Oriented
children b_trees
parents storage_engine

log
content Log
children append_only, break_log_segments
parents log_structured

append_only
content append-only
parents log

b_trees
content B-Trees
children write_ahead_log, 4_level_tree, balanced, blocks_pages, btree_root, btree_vs_lsm_tree, fractal_trees (variation of b-trees (I think?)), identified_using_address, keep_kv_sorted_key, latches, most_widely_used_index_struct, secondary_key_index
parents sort_on_disk, page_oriented

index
content Index
children value, additional_struct_data_derived, efficient_data_struct_lookup, hash_index, incurs_overhead, key
parents storage_engine

efficient_data_struct_lookup
content Efficient data structure for lookup
parents index, tradeoff

incurs_overhead
content Incurs overhead, especially for writes
parents index, tradeoff

tradeoff
content Tradeoff
children efficient_data_struct_lookup, incurs_overhead

additional_struct_data_derived
content Additional structure derived from data
parents index

hash_index
content Hash Index
children compaction, kv_store
parents index

kv_store
content Key-Value store
children tombstone, bitcask, dictionary
parents hash_index

bitcask
content Bitcask
children keys_mem_values_disk
parents kv_store

dictionary
content Dictionary
parents kv_store

compaction
content Compaction
children throw_away_dup_keep_recent, merge_segments, order_timing_strategies_compaction_merge
parents hash_index

keys_mem_values_disk
content in-memory keys, values on disk with offsets
parents bitcask

break_log_segments
content Break log into segments
children SSTables, merge_segments
parents log

throw_away_dup_keep_recent
content Throw away duplicate keys, keep most recent
parents compaction

tombstone
content Tombstone
children deletion_record
parents kv_store

deletion_record
content Deletion record after removing key
parents tombstone

merge_segments
content Merge Segments
children most_recent_multi_segs, order_timing_strategies_compaction_merge
parents compaction, break_log_segments

SSTables
content SSTables
children sorted_string_table (decrypting the abbreviation), uses_SSTable, advantages, each_key_appears_once, how_to_sort, keep_kv_sorted_key (Similarity to SSTables), memtable, order_timing_strategies_compaction_merge (performance)
parents break_log_segments

sorted_string_table
content Sorted String Table
parents SSTables

advantages
content Advantages
children merging_segments_simple, no_need_all_keys_in_memory, records_grouped_compressed
parents SSTables

each_key_appears_once
content Each key appears only once
children most_recent_multi_segs
parents SSTables

LSM_Trees
content LSM-Trees
children term_dictionary, uses_SSTable, LSM_Trees_perf, LSM_storage_engine, btree_vs_lsm_tree, cascade_SSTables, log_structured_merge_tree (decrypting the abbreviation), secondary_key_index (also used with LSM-Trees)

most_recent_multi_segs
content Most recent key used when dealing with multiple segments
parents each_key_appears_once, merge_segments

merging_segments_simple
content Merging segments simple and efficient
parents advantages

no_need_all_keys_in_memory
content in-memory representation of keys no longer needed
children one_key_offset_every_few_kb
parents advantages

one_key_offset_every_few_kb
content One key offset every few kb is sufficient density
children sparse_in_mem_index (terminology)
parents no_need_all_keys_in_memory

sparse_in_mem_index
content sparse in-memory index
parents one_key_offset_every_few_kb, records_grouped_compressed

records_grouped_compressed
content Records can be grouped and compressed
children sparse_in_mem_index
parents advantages

how_to_sort
content How to sort?
children sort_in_memory, sort_on_disk
parents SSTables

sort_in_memory
content In memory
children memtable
parents sort_on_disk, how_to_sort

sort_on_disk
content On Disk
children sort_in_memory (in-memory sorting easier than on-disk), b_trees (b-trees main structure for on-disk sorting)
parents how_to_sort

memtable
content memtable
children db_crash_recent_writes_lost, in_mem_tree, memtable_read, memtable_write
parents SSTables, sort_in_memory

in_mem_tree
content in-memory tree
children AVL, redblack
parents memtable

redblack
content Red-Black tree
parents in_mem_tree

AVL
content AVL
parents in_mem_tree

memtable_write
content Write
parents memtable

memtable_read
content Read
parents memtable

db_crash_recent_writes_lost
content issue: if database crashes the most recent writes (still in memory) will be lost.
children crash_solution
parents memtable

crash_solution
content Solution: use append-only log for most recent writes
parents db_crash_recent_writes_lost

log_structured_merge_tree
content Log Structured Merge Tree
parents LSM_Trees

LSM_storage_engine
content LSM storage engine
parents LSM_Trees, storage_engine

uses_SSTable
content Uses SSTable
parents SSTables, LSM_Trees

LSM_Trees_perf
content Performance
children nonexist_keys_slow
parents LSM_Trees

term_dictionary
children term_desc (description), lucene
parents LSM_Trees

lucene
content Lucene
parents term_dictionary

term_desc
content Key: Term, Value: List of ID's of all the documents containing term (word) (postings list)
children kept_in_SSTable
parents term_dictionary

kept_in_SSTable
content Kept in SSTable-like sorted files
parents term_desc

nonexist_keys_slow
content Nonexistant key look-up slow
children bloom_filters (optimization solution)
parents LSM_Trees_perf

bloom_filters
content Bloom filters
children bloom_desc (description)
parents nonexist_keys_slow

bloom_desc
content Memory-efficient data structure for approximating the contents in a set
parents bloom_filters

order_timing_strategies_compaction_merge
content Strategies for the order and time of SSTables compaction and merge
children size_tiered, leveled
parents merge_segments, compaction, SSTables

size_tiered
content Size Tiered
children size_tiered_elab (Elaboration)
parents order_timing_strategies_compaction_merge

size_tiered_elab
content Newer/smaller merged into older/larger
parents size_tiered

leveled
content Leveled
children leveled_elab (elaboration)
parents order_timing_strategies_compaction_merge

leveled_elab
content Key range split into smaller tables. Old data moved int levels
parents leveled

cascade_SSTables
content Cascade of SSTables
parents LSM_Trees

most_widely_used_index_struct
content Most widely used indexing structure
parents b_trees

keep_kv_sorted_key
content Keeps key/value pairs sorted by key
parents SSTables, b_trees

blocks_pages
content Blocks/pages
children branching_factor, btree_root, fixed_size_page, leaf_page, orphan_page
parents b_trees

fixed_size_page
content Fixed size (4kb traditionally)
parents blocks_pages

identified_using_address
content identified using an on-disk address/location
children refs_make_tree
parents b_trees

refs_make_tree
content References make tree
parents identified_using_address

btree_root
content Root
parents b_trees, blocks_pages

leaf_page
content Leaf page: page containing individual keys
parents blocks_pages

branching_factor
content Branching factor: number of references to child pages in one page
parents blocks_pages

balanced
content balanced
children n_keys_depth
parents b_trees

n_keys_depth
content N-keys has depth O(n*log(n))
parents balanced

4_level_tree
content 4-level tree of 4kb pages with branching factor of 500 can store up to 256TB
parents b_trees

orphan_page
content Orphan Page
children not_child_to_parent (description), orphan_caused_when (caused by)
parents blocks_pages

not_child_to_parent
content Not a child of any parent page
parents orphan_page

orphan_caused_when
content Can be caused when writing page split gets interrupted.
parents orphan_page

write_ahead_log
content Write-ahead log (WAL)
children redo_log (AKA (I think?))
parents b_trees

redo_log
content redo log
parents write_ahead_log

latches
content latches
children lightweight_locks (description)
parents b_trees

lightweight_locks
content Lightweight locks for concurrent access
parents latches

fractal_trees
content fractal trees
parents b_trees

btree_vs_lsm_tree
content B-Trees vs LSM-Trees
children write_amplification, btree_vs, lsm_tree_vs, more_predictable
parents b_trees, LSM_Trees

btree_vs
content B-Tree
children faster_reads, key_exists_in_one_place, more_mature
parents btree_vs_lsm_tree

lsm_tree_vs
content LSM Tree
children better_compression, compaction_interference, faster_writes, higher_throughput_lower_wa, misconfigured_compaction
parents btree_vs_lsm_tree

more_mature
content More Mature
parents btree_vs

faster_reads
content faster reads
parents btree_vs

write_amplification
content Write Amplification
children higher_throughput_lower_wa, one_db_write_several_disk_writes (description)
parents btree_vs_lsm_tree

one_db_write_several_disk_writes
content One DB write yields mulitple disk writes over DB lifetime
parents write_amplification

higher_throughput_lower_wa
content Sustain higher throughput, lower write amplification
parents lsm_tree_vs, write_amplification

better_compression
content Better compression
parents lsm_tree_vs

faster_writes
content Faster Writes
parents lsm_tree_vs

more_predictable
content More predictable
children impacts_higer_percentiles, queries (more predictable queries)
parents btree_vs_lsm_tree, compaction_interference

compaction_interference
content Compaction sometimes interferes with read/write performance
children impacts_higer_percentiles, more_predictable (compaction makes r/w performance less predictable)
parents lsm_tree_vs

impacts_higer_percentiles
content Impacts higher percentiles
parents more_predictable, compaction_interference
remarks see previous chapters on percentiles

queries
content queries
parents more_predictable

key_exists_in_one_place
content Key exists in exactly one place in the index
children strong_transactional_semantics
parents btree_vs

strong_transactional_semantics
content Strong Transactional Semantics
parents key_exists_in_one_place

misconfigured_compaction
content Misconfigured compaction causes disks to fill up
parents lsm_tree_vs

primary_key_index
content Primary key index
children secondary_key_index

secondary_key_index
content Secondary key index
children avoids_duplication
parents primary_key_index, b_trees, LSM_Trees

key
content Key
children queries_for
parents index

value
content Value
children row
parents index

queries_for
content Queries For
parents key

row
content Row
children row_ref (or)
parents value

row_ref
content Ref to row, stored elsewhere
children heap_file
parents row

heap_file
content Heap File
children avoids_duplication, indexed_row_in_index
parents row_ref

avoids_duplication
content Avoids duplication when multiple secondary indexes are present
parents heap_file, secondary_key_index

indexed_row_in_index
content If hop from index to heap too expensive, store indexed row in index
children clustered_index, multi_column_index
parents heap_file

clustered_index
content Clustered Index
parents indexed_row_in_index

clustered_nonclustered_compromise
content Compromise between clustered and nonclustered
parents covering_index

multi_column_index
content Multi-Column Index
children concat_index, geospatial_data
parents indexed_row_in_index

concat_index
content Concatenated Index
parents multi_column_index

covering_index
content covering index, or index with included columns
children clustered_nonclustered_compromise

geospatial_data
content geospatial data
parents multi_column_index

fts_fuzzy
content Full-text search and fuzzy indexes
children similar_keys, fuzzy_querying, levenshtein_automation

similar_keys
content similar keys
parents fts_fuzzy

fuzzy_querying
content Fuzzy Querying
parents fts_fuzzy

levenshtein_automation
content levenshtein automation
children efficient_search_words_edit_distance (description)
parents fts_fuzzy

efficient_search_words_edit_distance
content Efficient search of words within edit distance
parents levenshtein_automation

keeping_everything_memory
content Keeping everything in memory
children NVM, anti_caching, in_memory_db

in_memory_db
content In-memory databses
children avoids_encoding_overhead, disk_durablity, implement_models_difficult_for_disks
parents keeping_everything_memory

disk_durablity
content Use disk as append-only log for durability
parents in_memory_db

implement_models_difficult_for_disks
content Can provide models that are difficult to implement with disk-based indexes
children redis_data_structs (examples)
parents in_memory_db

redis_data_structs
content Redis: priority queues and sets
parents implement_models_difficult_for_disks

avoids_encoding_overhead
content Performance advantage comes from avoiding overheads of encoding data in a form that can be written to disk
parents in_memory_db

anti_caching
content Anti-caching
children datasets_larger_than_mem (situation when to use it), evict_least_used
parents keeping_everything_memory

datasets_larger_than_mem
content Datasets larger than memory
parents anti_caching

evict_least_used
content Evict least used data from memory and write to disk
parents anti_caching

NVM
content NVM: non-volatile memory
parents keeping_everything_memory

transaction
content Transaction
children OLTP, commercial_transaction (business origins), data_analytics, rw_group_logical_unit (description)

commercial_transaction
content Commercial transaction
parents transaction

rw_group_logical_unit
content Group of reads/writes that form a logical unit
parents transaction

OLTP
content Online Transaction Processing (OLTP)
children some_oltp_traits, storage_engine_schools_of_thought, OLTP_ops
parents transaction

data_analytics
content Data Analytics
children business_intel
parents transaction

business_intel
content Business intelligence
children OLAP
parents data_analytics

OLAP
content Online analytic processing (OLAP)
children some_olap_traits, specialized_db_olap, column_oriented_storage
parents business_intel

data_warehouse
content Data Warehouse
children specialized_db_olap (description), star_schema, column_oriented_storage, doesnt_affect_user_facing_db, extract_transform_load, materialized_aggregates

specialized_db_olap
content specialized database for OLAP
parents OLAP, data_warehouse

doesnt_affect_user_facing_db
content Doesn't affect performance of user-facing OLTP databse
parents data_warehouse

extract_transform_load
content extract, transform, load
parents data_warehouse

star_schema
content Star Schema
children snowflake_schema, surround_like_star_rays (Where "star schema" gets its name), dimensional_modeling (AKA), fact_table
parents data_warehouse

dimensional_modeling
content Dimensional Modeling
parents star_schema

fact_table
content Fact Table
children dimension_tables
parents star_schema

snowflake_schema
content Snowflake schema
children dimensions_broken_into_subdimensions
parents star_schema

dimension_tables
content Dimension Tables
children surround_like_star_rays
parents fact_table

surround_like_star_rays
content Surround facts table like rays of a star
parents star_schema, dimension_tables

OLTP_ops
content OLTP operations
children slicing, dicing, drill_down
parents OLTP

drill_down
content Drill Down
parents OLTP_ops

slicing
content Slicing
parents OLTP_ops

dicing
content dicing
parents OLTP_ops

column_oriented_storage
content Column-oriented storage
children sort_order, sorted_different_ways, writing, access_few_columns_at_a_time, compresses_well
parents data_warehouse, OLAP

compresses_well
content Tends to compress well
children bitmap_encoding
parents column_oriented_storage

access_few_columns_at_a_time
content Queries tend to access only 3-5 columns at a time
parents column_oriented_storage

bitmap_encoding
content Bitmap Encoding
parents compresses_well

dimensions_broken_into_subdimensions
content Dimensions broken down into subdimensions
parents snowflake_schema

sort_order
content sort order
children helpful_for_col_compr
parents column_oriented_storage

helpful_for_col_compr
content Can help with compression of columns
parents sort_order

writing
content writing
children more_difficult
parents column_oriented_storage

sorted_different_ways
content sorted in different ways
children optimized_diff_queries
parents column_oriented_storage

optimized_diff_queries
content Optimized for different queries
parents sorted_different_ways

more_difficult
content more difficult
children lstm_trees (solution)
parents writing

lstm_trees
content LSTM Trees
parents more_difficult

materialized_aggregates
content Materialized Aggregates
children cache_common_queries, materialized_view
parents data_warehouse

materialized_view
content Materialized View
children datacube
parents materialized_aggregates

cache_common_queries
content Cache common aggregate queries (count, sum, avg, etc)
parents materialized_aggregates

datacube
content datacube/OLAP cube
parents materialized_view

some_olap_traits
content low volume, demanding queries, business analyists
children some_oltp_traits (contrasting)
parents OLAP

some_oltp_traits
content high volume, user-facing
parents OLTP, some_olap_traits

storage_engine_schools_of_thought
content storage engine schools of thought
children update_in_place, log_structured
parents OLTP

update_in_place
content Update in-place
parents storage_engine_schools_of_thought