Professional Documents
Culture Documents
Apache Drill - provides low latency ad-hoc queries to many different data sources,
including nested data. Inspired by Google's Dremel, Drill is designed to scale to 10,000
servers and query petabytes of data in seconds.
Apache Hama - is a pure BSP (Bulk Synchronous Parallel) computing framework on top
of HDFS for massive scientific computations such as matrix, graph and network algorithms.
Akka - a toolkit and runtime for building highly concurrent, distributed, and fault
tolerant event-driven applications on the JVM.
ML-Hadoop - Hadoop implementation of Machine learning algorithms
Shark - is a large-scale data warehouse system for Spark designed to be compatible with
Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any
modification to the existing data or queries. Shark supports Hive's query language,
metastore, serialization formats, and user-defined functions, providing seamless integration
with existing Hive deployments and a familiar, more powerful option for new ones.
Apache Crunch - Java library provides a framework for writing, testing, and running
MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined
functions simple to write, easy to test, and efficient to run
Azkaban - batch workflow job scheduler created at LinkedIn to run their Hadoop Jobs
Apache Mesos - is a cluster manager that provides efficient resource isolation and sharing
across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark,
and other applications on a dynamically shared pool of nodes.
Druid - is open source infrastructure for Realtime Exploratory Analytics on Large
Datasets. The system uses an always-on, distributed, shared-nothing, architecture designed
for real-time querying and data ingestion. It leverages column-orientation and advanced
indexing structures to allow for cost effective, arbitrary exploration of multi-billion-row
tables with sub-second latencies.
Apache MRUnit - a Java library that helps developers unit test Apache Hadoop map
reduce jobs.
hiho - Hadoop Data Integration with various databases, ftp servers, salesforce.
Incremental update, dedup, append, merge your data on Hadoop
white-elephant - a Hadoop log aggregator and dashboard which enables visualization of
Hadoop cluster utilization across users.
Tachyon - a fault tolerant distributed file system enabling reliable file sharing at memoryspeed across cluster frameworks, such as Spark and MapReduce
HIPI - is a library for Hadoop's MapReduce framework that provides an API for
performing image processing tasks in a distributed computing environment
Cassovary -a simple big graph processing library for the JVM
Apache Helix - is a generic cluster management framework used for the automatic
management of partitioned, replicated and distributed resources hosted on a cluster of nodes
Summingbird -Streaming MapReduce with Scalding and Storm
Created By:-Samarjit Mahapatra
Hadoop Alternatives
Apache Spark- open source cluster computing system that aims to make data
analytics fast both fast to run and fast to write.
GraphLab - a redesigned fully distributed API, HDFS integration and a wide range of
new machine learning toolkits.
HPCC Systems- (High Performance Computing Cluster) is a massive parallel-processing
computing platform that solves Big Data problems.
Dryad- is investigating programming models for writing parallel and distributed
programs to scale from a small cluster to a large data-center.
Stratosphere above the cloud.
Storm - is a free and open source distributed realtime computation system. Storm makes
it easy to reliably process unbounded streams of data, doing for realtime processing what
Hadoop did for batch processing. Storm is simple, can be used with any programming
language, and is a lot of fun to use!
R3 - is a map reduce engine written in python using a redis backend.
Disco - is a lightweight, open-source framework for distributed computing based on
the MapReduce paradigm.
Phoenix - is a shared-memory implementation of Google's MapReduce model for dataintensive processing tasks.
Plasma - PlasmaFS is a distributed filesystem for large files, implemented in user space.
Plasma Map/Reduce runs the famous algorithm scheme for mapping and rearranging
large files. Plasma KV is a key/value database on top of PlasmaFS
Peregrine - is a map reduce framework designed for running iterative jobs across
partitions of data.
httpmr - A scalable data processing framework for people with web clusters.
sector/sphere - sector is a high performance, scalable, and secure distributed file system.
Sphere is a high performance parallel data processing engine that can process Sector data
files on the storage nodes with very simple programming interfaces.
Filemap - is a lightweight system for applying Unix-style file processing tools to large
amounts of data stored in files.
misco - is a distributed computing framework designed for mobile devices
MR-MPI is a library, which is an open-source implementation of MapReduce written
for distributed-memory parallel machines on top of standard MPI message passing
GridGain in-memory computing
MapReduce Alternatives
Hadoop Eco-System
designed to detect and handle failures at the application layer, so delivering a highlyavailable service on top of a cluster of computers, each of which may be prone to failures.
Bigtop - is a project for the development of packaging and tests of the Apache Hadoop
ecosystem.