spark

Tags: scala

Overview

Similar to MapReduce, processes large volumes of data efficiently
However, map reduce had issues:
- Iterative data development was difficult
- Interactive, ad hoc queries on the same set was hard
How does spark solve issues with iterative data processing?
- Uses K-means
- However, for MapReduce, we would need to perform k means multiple times via cascading jobs, and also mapreduce does not keep any intermediate data. Also incurs issues with disk IO, since mapreduce has to write to disk
There was a desire for a framework that abstracted over distributed memory, that could perform multiple operations, while also being low latency and fault tolerant, and explicitly was able to let users choose a specific working let to load into memory

Design

Relies on resilient distributed datasets (RDDs)

Abstraction of objects ithin the dataset
RDDs can be created in two ways, transforming an RDD or by reading data
A lineage graphs is kept to record the source of all RDDs
Implemented with
- List of partition objects that contains their own sets of data
- Iterator that traverses the data in the partition
- A list of worker nodes where the partition can be quickly accessed to ensure tasking scheduling
An RDD abstraction is similar to a cluster compute where only part of the data is on the node. Local nodes run small subsets and get joined later
Creation
- RDDs can be created from files within a distributed file system, or collections my parallelizing a collection, or from another RDD
- can also be built by altering the persistence of an existing RDD, since RDDs are by default lazy and emphemeral
Paritioning
- Spark automatically partitions the data inside RDDs and distributes it over a cluster of RAM over available nodes. Also provide scontrol to the user on how the data is partitioned, as well as whether the data needs to be shuffled
Vs distributed memory system
- In distributed memory systems, R/W ops are done on specific locations within a global address space. RDDs have more coarse and fine grained transformations
- DSMs use checkpointing, which is pretty costly for restoring data. Only lost RDD partitions need to be recomputed, and provides big data recovery that’s better in terms of spatial overheads
- RDDs are immutable, operations create new RDDs (in the lineage graph) which allow you to run backup tasks to mitigate stragglers (speculative exec)
- Systems can exploit data locality to schedule tasks based on their data locality, since they know more information
- You can also page/flush to disk for RDDs
- Spark only offers bulk writes, no random writes like distributed memory systems
Similar to a dataframe chunk

Dependencies

When an RDD is created, it’s relationship with the parent data forms a DAG, so we have either narrow or wide dependencies
- Narrow dependencies allow pipelined execution to compute all parent and child partitions on a group of machines in a single cluster. Users can apply any operations on a parent RDD on an element-by-element basis. Also easy partition recovery, you just rerun the past partition
- Wide dependencies are more complicated, since they also shuffle the data. Recovering means you need to recompute the whole thing
- Side benefit is that you can compress dags if they get too big via snapshots, and then replicate the dags across the cluster
Clustering can be done with Mesos, YARN, or all sorts of other stuff.

Scheduler

Scheduler is responsbile for turning RDDs into actions and creating the DAG
Fault tolerance retries are handled by the scheduler, who reschedules failed jobs

Actions

Spark builds logical plan for executoins via actions, which trigger the execution of that logical plan.
Actions are anything that returns a non-RDD value, such as count(), reduce(), collect(), and ~lookup~()

Driver - manager that orchestrates data sharding

Launches clusters of workers
Defines RDDs
Keeps lineage graph
Create execution pipelines
Creates further tasks in each stage and sends them to workers running in parallel
Schedules tasks
Shares some variables with all the worker nodes called shared variables

Worker nodes

Receive tasks from the driver
Performs assigned tasks
Stores partitions of the memory abstraction
Note that usually, within datacenters, network topology is taken into account by scheduling algorithms.

Transformations

Lazy operations by default, split into two forms: narrow and wide
Narrow transformations have each partition input map to a single partition output
- map, filter, join (kv pairs only, no shuffling on joins), union
Wide transformations have RDDs building multiple child RDDs
- reduceByKey, join (arbitrary), groupByKey

Shared Variables

Two types of shared variables:
- broadcast - variable that is kept on the driver, which then is cached on workers
- accumulators - variable is kept on the driver as the workers accumulate their own

Limited memory LRU evicts old RDDs in memory
Flushing to disk as well, users specificy a persistence priority
Checkpointing RDDs by compacting the graphs into smaller nodes

Design wisdom

RDD abstraction: immutable, shardable, lazy datastructures that can be recomputed

عجفت الغور

spark

Overview

Design

Relies on resilient distributed datasets (RDDs)

Dependencies

Scheduler

Actions

Driver - manager that orchestrates data sharding

Worker nodes

Transformations

Shared Variables

Design wisdom

Links to this note

spark

Overview

Design

Relies on resilient distributed datasets (RDDs)

Dependencies

Scheduler

Actions

Driver - manager that orchestrates data sharding

Worker nodes

Transformations

Shared Variables

Refinements

Design wisdom

Links to this note