Apache Spark Architecture

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Sparks performance is up to 100 times faster in memory and 10 times faster on disk when compared to Hadoop. It is a general-purpose distributed computing engine used for processing and analyzing a large amount of data. Spark uses master/slave architecture i.e. one central coordinator and many distributed workers. Here, the central coordinator is called the driver.

Features of Spark

In Memory Computing

It increases the processing speed with in memory capability until they can’t fit.

Speed
Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing.

Reusability
The Spark code can be reused for batch-processing, join stream against historical data or run ad-hoc queries on stream state.

Advance Analytics
Spark not work only in ‘map’ & ‘reduce’ but it also suport the SQL Queries, Streaming data, Machine learning (ML), and Graph algorithms.

Power Caching
One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. You can mark an RDD to be persisted using the persist() or cache() methods. The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY.

Real Time Stream Processing
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.

Fault Tolerance
Spark has fault tolerance capability at different level like at data lever, node level, rdd level.

Lazy Evaluation
All the transformations we make in Spark RDD are Lazy in nature, that is it does not give the result right away rather a new RDD is created from the existing one. By this way it increases the efficiency of the system.

 

Spark Architecture

Apache Spark has a well-defined and layered architecture where all the spark components and layers are loosely coupled and integrated with various extensions and libraries. Apache Spark Architecture is based on two main abstractions-

  • Resilient Distributed Datasets (RDD)
  • Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDD)

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.

RDDs Stands for:

Resilient: Fault tolerant and is capable of rebuilding data on failure

Distributed: Distributed data among the multiple nodes in a cluster

Dataset: Collection of partitioned data with values

Apache Spark DAG: Directed Acyclic Graph

In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG, is a finite directed graph with no directed cycles. That is, it consists of finitely many vertices and edges (also called arcs), with each edge directed from one vertex to another, such that there is no way to start at any vertex v and follow a consistently-directed sequence of edges that eventually loops back to v again. Equivalently, a DAG is a directed graph that has a topological ordering, a sequence of the vertices such that every edge is directed from earlier to later in the sequence.

DAG is nothing but a graph which holds the track of operations applied on RDD.

Directed – Means which is directly connected from one node to another. This creates a sequence i.e. each node is in linkage from earlier to later in the appropriate sequence.

Acyclic – Defines that there is no cycle or loop available. Once a transformation takes place it cannot returns to its earlier position.

Graph – From graph theory, it is a combination of vertices and edges. Those pattern of connections together in a sequence is the graph.

Go to my next article Understanding SparkContext and Application’s driver process

Deep Understanding of SparkContext & Application’s Driver Process

What is spark Executor?

2 thoughts on “Apache Spark Architecture”

  1. Pingback: Lazy Evaluation in Apache Spark and its Advantage - Mycloudplace

  2. Pingback: pyspark dataframe | python spark dataframe with examples - Mycloudplace

Leave a Comment

Your email address will not be published. Required fields are marked *