عجفت الغور

spark

Tags: scala

Overview

  • Similar to MapReduce, processes large volumes of data efficiently
  • However, map reduce had issues:
    • Iterative data development was difficult
    • Interactive, ad hoc queries on the same set was hard
  • How does spark solve issues with iterative data processing?
    • Uses K-means
    • However, for MapReduce, we would need to perform k means multiple times via cascading jobs, and also mapreduce does not keep any intermediate data. Also incurs issues with disk IO, since mapreduce has to write to disk
  • There was a desire for a framework that abstracted over distributed memory, that could perform multiple operations, while also being low latency and fault tolerant, and explicitly was able to let users choose a specific working let to load into memory

Design

Relies on resilient distributed datasets (RDDs)

  • Abstraction of objects ithin the dataset
  • RDDs can be created in two ways, transforming an RDD or by reading data
  • A lineage graphs is kept to record the source of all RDDs
  • Implemented with
    • List of partition objects that contains their own sets of data
    • Iterator that traverses the data in the partition
    • A list of worker nodes where the partition can be quickly accessed to ensure tasking scheduling
  • An RDD abstraction is similar to a cluster compute where only part of the data is on the node. Local nodes run small subsets and get joined later
  • Creation
    • RDDs can be created from files within a distributed file system, or collections my parallelizing a collection, or from another RDD
    • can also be built by altering the persistence of an existing RDD, since RDDs are by default lazy and emphemeral
  • Paritioning
    • Spark automatically partitions the data inside RDDs and distributes it over a cluster of RAM over available nodes. Also provide scontrol to the user on how the data is partitioned, as well as whether the data needs to be shuffled
  • Vs distributed memory system
    • In distributed memory systems, R/W ops are done on specific locations within a global address space. RDDs have more coarse and fine grained transformations
    • DSMs use checkpointing, which is pretty costly for restoring data. Only lost RDD partitions need to be recomputed, and provides big data recovery that’s better in terms of spatial overheads
    • RDDs are immutable, operations create new RDDs (in the lineage graph) which allow you to run backup tasks to mitigate stragglers (speculative exec)
    • Systems can exploit data locality to schedule tasks based on their data locality, since they know more information
    • You can also page/flush to disk for RDDs
    • Spark only offers bulk writes, no random writes like distributed memory systems
  • Similar to a dataframe chunk

Dependencies

  • When an RDD is created, it’s relationship with the parent data forms a DAG, so we have either narrow or wide dependencies
    • Narrow dependencies allow pipelined execution to compute all parent and child partitions on a group of machines in a single cluster. Users can apply any operations on a parent RDD on an element-by-element basis. Also easy partition recovery, you just rerun the past partition
    • Wide dependencies are more complicated, since they also shuffle the data. Recovering means you need to recompute the whole thing
    • Side benefit is that you can compress dags if they get too big via snapshots, and then replicate the dags across the cluster
  • Clustering can be done with Mesos, YARN, or all sorts of other stuff.

Scheduler

  • Scheduler is responsbile for turning RDDs into actions and creating the DAG
  • Fault tolerance retries are handled by the scheduler, who reschedules failed jobs

Actions

  • Spark builds logical plan for executoins via actions, which trigger the execution of that logical plan.
  • Actions are anything that returns a non-RDD value, such as count(), reduce(), collect(), and ~lookup~()

Driver - manager that orchestrates data sharding

  • Launches clusters of workers
  • Defines RDDs
  • Keeps lineage graph
  • Create execution pipelines
  • Creates further tasks in each stage and sends them to workers running in parallel
  • Schedules tasks
  • Shares some variables with all the worker nodes called shared variables

Worker nodes

  • Receive tasks from the driver
  • Performs assigned tasks
  • Stores partitions of the memory abstraction
  • Note that usually, within datacenters, network topology is taken into account by scheduling algorithms.

Transformations

  • Lazy operations by default, split into two forms: narrow and wide
  • Narrow transformations have each partition input map to a single partition output
    • map, filter, join (kv pairs only, no shuffling on joins), union
  • Wide transformations have RDDs building multiple child RDDs
    • reduceByKey, join (arbitrary), groupByKey

Shared Variables

  • Two types of shared variables:
    • broadcast - variable that is kept on the driver, which then is cached on workers
    • accumulators - variable is kept on the driver as the workers accumulate their own

Refinements

  • Limited memory LRU evicts old RDDs in memory
  • Flushing to disk as well, users specificy a persistence priority
  • Checkpointing RDDs by compacting the graphs into smaller nodes

Design wisdom

  • RDD abstraction: immutable, shardable, lazy datastructures that can be recomputed

Links to this note