Thursday, July 10, 2014

DataStax Spark


Apache Spark is a project designed to accelerate Hadoop and other big data applications through the use of an in-memory, clustered data engine.  It is paradigm shift from disk based map reduce process.

Spark is a fast and powerful engine for processing Hadoop data. It runs in Hadoop clusters through Hadoop YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both general data processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Speed is key!  Leveraging an efficient in-memory storage format, an optimistic execution engine and a cache-conscious memory layout, Apache Spark performs 100x faster than Hadoop.  Herez the performance metric, based on word count program running on both platforms,

DataStax Apache Spark support means certified Spark software now ships with DSE 4.5, and it is supported by DataStax. DSE 4.5 (release on 3rd July) provides high-availability features for Spark that ensure resilience and fail-over.

No comments:

Post a Comment