lec15
dz / distributed_systems_MIT / lec15Summary
Lecture 15: Spark
Node Tree
-
spark
- HDFS
- driver
- fault_tolerance
- mapreduce_successor
- pagerank
- spark_exec
Nodes
| spark | |
| content | spark |
| children | HDFS, driver, fault_tolerance, mapreduce_successor, pagerank, spark_exec |
| mapreduce_successor | |
| content | Successor to MapReduce |
| children | generalizes_mapreduce |
| parents | spark |
| generalizes_mapreduce | |
| content | Generalizes map + reduce steps |
| parents | mapreduce_successor |
| pagerank | |
| content | Pagerank algorithm |
| children | difficult_in_MR, estimates_page_importance (description), spark_exec (demo: pagerank implemented in Spark) |
| parents | spark |
| estimates_page_importance | |
| content | Estimates importance of page |
| parents | pagerank |
| difficult_in_MR | |
| content | difficult to implement in MapReduce |
| parents | pagerank |
| driver | |
| content | driver: computer than runs program |
| parents | spark |
| spark_exec | |
| content | How does spark execute? |
| children | cache, collect, distinct, exec_looks_like, group_by_key, join, map_values, persist_data, readfile, reduce_by_key |
| parents | pagerank, spark |
| distinct | |
| content | distinct |
| children | info_all_workers (distinct is a wide operation) |
| parents | spark_exec, wide |
| readfile | |
| content | Read File |
| children | doesnt_read, lineage_graph |
| parents | spark_exec |
| lineage_graph | |
| content | lineage graph |
| children | doesnt_read, looks_at_lineage_graph |
| parents | readfile |
| doesnt_read | |
| content | Doesn't initialy read, only produces lineage graph |
| children | doesnt_process_data (AKA) |
| parents | readfile, lineage_graph |
| join | |
| content | join |
| parents | spark_exec |
| group_by_key | |
| content | group-by ke |
| parents | spark_exec |
| collect | |
| content | collect |
| parents | spark_exec |
| cache | |
| content | cache |
| children | persist_data |
| parents | spark_exec |
| persist_data | |
| content | Persist Data |
| parents | spark_exec, cache |
| reduce_by_key | |
| content | reduce by key |
| parents | spark_exec |
| map_values | |
| content | map values |
| parents | spark_exec |
| doesnt_process_data | |
| content | doesn't process data |
| parents | doesnt_read |
| exec_looks_like | |
| content | What does execution look like? |
| children | optimization, transformations |
| parents | spark_exec |
| transformations | |
| content | Transformations |
| children | narrow, wide |
| parents | exec_looks_like |
| narrow | |
| content | narrow |
| children | individual_workers, wide (vs) |
| parents | transformations |
| wide | |
| content | wide |
| children | distinct (distinct is wide transformation), expensive |
| parents | transformations, narrow |
| individual_workers | |
| content | Individual Workers |
| parents | narrow |
| expensive | |
| content | Expensive |
| parents | wide |
| info_all_workers | |
| content | Needs to know info from all workers |
| parents | distinct |
| optimization | |
| content | Optimization |
| children | looks_at_lineage_graph |
| parents | exec_looks_like |
| looks_at_lineage_graph | |
| content | Looks at lineage graph |
| parents | optimization, lineage_graph |
| fault_tolerance | |
| content | spark |
| children | driveer_not_replicated, failed_worker_wide_deps, input_assumed_ft, not_bulletproof, tolerate_common_errors |
| parents | spark |
| tolerate_common_errors | |
| content | Tolerate common errors |
| parents | fault_tolerance |
| HDFS | |
| content | HDFS |
| children | input_assumed_ft |
| parents | spark |
| input_assumed_ft | |
| content | Input assumed to be fault-tolerant via HDFS |
| parents | HDFS, fault_tolerance |
| not_bulletproof | |
| content | Doesn't have to be bullet-proof |
| children | driveer_not_replicated (for example) |
| parents | fault_tolerance |
| driveer_not_replicated | |
| content | Driver machine not replicated |
| parents | not_bulletproof, fault_tolerance |
| failed_worker_wide_deps | |
| content | Failed Worker, Wide dependencies |
| children | recompute_days_worth |
| parents | fault_tolerance |
| recompute_days_worth | |
| content | Can end up recomputing a days worth of computation |
| children | checkpoints (mitigations against) |
| parents | failed_worker_wide_deps |
| checkpoints | |
| content | Checkpoints for specific transformation |
| parents | recompute_days_worth |