Spark

Apache Spark™ is a unified analytics engine for large-scale data processing.

Overview

Feature

Speed: Run workloads 100x faster than original Hadoop MapReduce.
Compatibility: Written in Scala. But can also support Java, Python, R, and SQL
Generality: Combine SQL, streaming, and complex analytics.
Runs Everywhere: Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.
- You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes.
- Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.

Libraries

SQL and DataFrames

MLlib for machine learning

GraphX

Spark Streaming

Spark Modes

Links