You are on page 1of 32

Day 1-2

Big Data
Introduction
Big data is a broad term for data sets so large or complex that traditional data processing
applications are inadequate. Challenges include analysis, capture, data curation, search, sharing,
storage, transfer, visualisation, and information privacy. The term often refers simply to the use of
predictive analytics or other certain advanced methods to extract value from data, and seldom to a
particular size of data set. Accuracy in big data may lead to more confident decision making. And
better decisions can mean greater operational efficiency, cost reductions and reduced risk.
Analysis of data sets can find new correlations, to "spot business trends, prevent diseases, combat
crime and so on." Scientists, practitioners of media and advertising and governments alike regularly
meet difficulties with large data sets in areas including Internet search, finance and business
informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics,
connectomics, complex physics simulations, and biological and environmental research.
Data sets grow in size in part because they are increasingly being gathered by cheap and numerous
information-sensing mobile devices, aerial (remote sensing), software logs, cameras, microphones,
radio-frequency identification (RFID) readers, and wireless sensor networks. The world's
technological per-capita capacity to store information has roughly doubled every 40 months since
the 1980s; as of 2012, every day 2.5 exabytes of data were created; The challenge for large
enterprises is determining who should own big data initiatives that straddle the entire organisation.
Work with big data is necessarily uncommon; most analysis is of "PC size" data, on a desktop PC or
notebook that can handle the available data set.
Relational database management systems and desktop statistics and visualisation packages often
have difficulty handling big data. The work instead requires "massively parallel software running on
tens, hundreds, or even thousands of servers". What is considered "big data" varies depending on
the capabilities of the users and their tools, and expanding capabilities make Big Data a moving
target. Thus, what is considered to be "Big" in one year will become ordinary in later years. "For
some organisations, facing hundreds of gigabytes of data for the first time may trigger a need to
reconsider data management options. For others, it may take tens or hundreds of terabytes before
data size becomes a significant consideration."

Definition
Big data usually includes data sets with sizes beyond the ability of commonly used software tools to
capture, curate, manage, and process data within a tolerable elapsed time. Big data "size" is a
constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.
Big data is a set of techniques and technologies that require new forms of integration to uncover
large hidden values from large datasets that are diverse, complex, and of a massive scale.
In a 2001 research report and related lectures, META Group (now Gartner) analyst Doug Laney
defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume
(amount of data), velocity (speed of data in and out), and variety (range of data types and sources).
Gartner, and now much of the industry, continue to use this "3Vs" model for describing big data. In
2012, Gartner updated its definition as follows: "Big data is high volume, high velocity, and/or high
variety information assets that require new forms of processing to enable enhanced decision
making, insight discovery and process optimisation." Additionally, a new V "Veracity" is added by
some organisations to describe it.
If Gartners definition (the 3Vs) is still widely used, the growing maturity of the concept fosters a
more sound difference between big data and Business Intelligence, regarding data and their use:
A. Business Intelligence uses descriptive statistics with data with high information density to
measure things, detect trends etc.;
B. Big data uses inductive statistics and concepts from nonlinear system identification to infer
laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low
information density to reveal relationships, dependencies and perform predictions of outcomes
and behaviours.
A more recent, consensual definition states that "Big Data represents the Information assets
characterised by such a High Volume, Velocity and Variety to require specific Technology and
Analytical Methods for its transformation into Value".

Characteristics
Big data can be described by the following characteristics:

Volume
The quantity of data that is generated is very important in this context. It is the size of the data
which determines the value and potential of the data under consideration and whether it can actually
be considered Big Data or not. The name Big Data itself contains a term which is related to size
and hence the characteristic.

Variety
The next aspect of Big Data is its variety. This means that the category to which Big Data belongs
to is also a very essential fact that needs to be known by the data analysts. This helps the people,
who are closely analyzing the data and are associated with it, to effectively use the data to their
advantage and thus upholding the importance of the Big Data.

Velocity
The term velocity in the context refers to the speed of generation of data or how fast the data is
generated and processed to meet the demands and the challenges which lie ahead in the path of
growth and development.

Variability
This is a factor which can be a problem for those who analyse the data. This refers to the
inconsistency which can be shown by the data at times, thus hampering the process of being able to
handle and manage the data effectively.

Veracity
The quality of the data being captured can vary greatly. Accuracy of analysis depends on the
veracity of the source data.
Complexity
Data management can become a very complex process, especially when large volumes of data come
from multiple sources. These data need to be linked, connected and correlated in order to be able to
grasp the information that is supposed to be conveyed by these data. This situation, is therefore,
termed as the complexity of Big Data.
Factory work and Cyber-physical systems may have a 6C system:

1. Connection (sensor and networks),

2. Cloud (computing and data on demand),

3. Cyber (model and memory),

4. Content/context (meaning and correlation),

5. Community (sharing and collaboration), and

6. Customisation (personalisation and value).

In this scenario and in order to provide useful insight to the factory management and gain correct
content, data has to be processed with advanced tools (analytics and algorithms) to generate
meaningful information. Considering the presence of visible and invisible issues in an industrial
factory, the information generation algorithm has to be capable of detecting and addressing invisible
issues such as machine degradation, component wear, etc. in the factory floor.
Day 3 - 5
Hadoop
Introduction
Apache Hadoop was born out of a need to process an avalanche of big data. The web was
generating more and more information on a daily basis, and it was becoming very difficult to index
over one billion pages of content. In order to cope, Google invented a new style of data processing
known as MapReduce. A year after Google published a white paper describing the MapReduce
framework, Doug Cutting and Mike Cafarella, inspired by the white paper, created Hadoop to apply
these concepts to an open-source software framework to support distribution for the Nutch search
engine project. Given the original case, Hadoop was designed with a simple write-once storage
infrastructure.

Hadoop has moved far beyond its beginnings in web indexing and is now used in many industries
for a huge variety of tasks that all share the common theme of lots of variety, volume and velocity
of data both structured and unstructured. It is now widely used across industries, including
finance, media and entertainment, government, healthcare, information services, retail, and other
industries with big data requirements but the limitations of the original storage infrastructure
remain.

Hadoop is increasingly becoming the go-to framework for large-scale, data-intensive deployments.
Hadoop is built to process large amounts of data from terabytes to petabytes and beyond. With this
much data, its unlikely that it would fit on a single computer's hard drive, much less in memory.
The beauty of Hadoop is that it is designed to efficiently process huge amounts of data by
connecting many commodity computers together to work in parallel. Using the MapReduce model,
Hadoop can take a query over a dataset, divide it, and run it in parallel over multiple nodes.
Distributing the computation solves the problem of having data thats too large to fit onto a single
machine.

Hadoop Software
The Hadoop software stack introduces entirely new economics for storing and processing data at
scale. It allows organizations unparalleled flexibility in how theyre able to leverage data of all
shapes and sizes to uncover insights about their business. Users can now deploy the complete
hardware and software stack including the OS and Hadoop software across the entire cluster and
manage the full cluster through a single management interface.

Apache Hadoop includes a Distributed File System (HDFS), which breaks up input data and stores
data on the compute nodes. This makes it possible for data to be processed in parallel using all of
the machines in the cluster. The Apache Hadoop Distributed File System is written in Java and runs
on different operating systems.

Hadoop was designed from the beginning to accommodate multiple file system implementations
and there are a number available. HDFS and the S3 file system are probably the most widely used,
but many others are available, including the MapR File System.
How is Hadoop Different from Past Techniques?
A. Hadoop can handle data in a very fluid way. Hadoop is more than just a faster, cheaper
database and analytics tool. Unlike databases, Hadoop doesnt insist that you structure your
data. Data may be unstructured and schemaless. Users can dump their data into the framework
without needing to reformat it. By contrast, relational databases require that data be structured
and schemas be defined before storing the data.
B. Hadoop has a simplified programming model. Hadoops simplified programming model
allows users to quickly write and test software in distributed systems. Performing computation
on large volumes of data has been done before, usually in a distributed setting but writing
software for distributed systems is notoriously hard. By trading away some programming
flexibility, Hadoop makes it much easier to write distributed programs.
C. Because Hadoop accepts practically any kind of data, it stores information in far more diverse
formats than what is typically found in the tidy rows and columns of a traditional database.
Some good examples are machine-generated data and log data, written out in storage formats
including JSON, Avro and ORC.
D. The majority of data preparation work in Hadoop is currently being done by writing code in
scripting languages like Hive, Pig or Python.
E. Hadoop is easy to administer. Alternative high performance computing (HPC) systems allow
programs to run on large collections of computers, but they typically require rigid program
configuration and generally require that data be stored on a separate storage area network
(SAN) system. Schedulers on HPC clusters require careful administration and since program
execution is sensitive to node failure, administration of a Hadoop cluster is much easier.
F. Hadoop invisibly handles job control issues such as node failure. If a node fails, Hadoop makes
sure the computations are run on other nodes and that data stored on that node are recovered
from other nodes.
G. Hadoop is agile. Relational databases are good at storing and processing data sets with
predefined and rigid data models. For unstructured data, relational databases lack the agility and
scalability that is needed. Apache Hadoop makes it possible to cheaply process and analyze
huge amounts of both structured and unstructured data together, and to process data without
defining all structure ahead of time.

Hadoop Architecture
Hadoop framework includes following four modules:

Hadoop Common: These are Java libraries and utilities required by other Hadoop modules.
These libraries provides filesystem and OS level abstractions and contains the necessary
Java files and scripts required to start Hadoop.

Hadoop YARN: This is a framework for job scheduling and cluster resource management.

Hadoop Distributed File System (HDFS): A distributed file system that provides high-
throughput access to application data.

Hadoop MapReduce: This is YARN-based system for parallel processing of large data
sets.
Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
commodity hardware. It has many similarities with existing distributed file systems. However, the
differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is
designed to be deployed on low-cost hardware. HDFS provides high throughput access to
application data and is suitable for applications that have large data sets. HDFS relaxes a few
POSIX requirements to enable streaming access to file system data. HDFS was originally built as
infrastructure for the Apache Nutch web search engine project.

NameNode and DataNodes


HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master
server that manages the file system namespace and regulates access to files by clients. In addition,
there are a number of DataNodes, usually one per node in the cluster, which manage storage
attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data
to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a
set of DataNodes. The NameNode executes file system namespace operations like opening, closing,
and renaming files and directories. It also determines the mapping of blocks to DataNodes. The
DataNodes are responsible for serving read and write requests from the file systems clients. The
DataNodes also perform block creation, deletion, and replication upon instruction from the
NameNode.

The File System Namespace


HDFS supports a traditional hierarchical file organization. A user or an application can create
directories and store files inside these directories. The file system namespace hierarchy is similar to
most other existing file systems; one can create and remove files, move a file from one directory to
another, or rename a file. HDFS does not yet implement user quotas. HDFS does not support hard
links or soft links. However, the HDFS architecture does not preclude implementing these features.
NameNode maintains the file system namespace. Any change to the file system namespace or its
properties is recorded by the NameNode. An application can specify the number of replicas of a file
that should be maintained by HDFS. The number of copies of a file is called the replication factor
of that file. This information is stored by the NameNode.

Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each
file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of
a file are replicated for fault tolerance. The block size and replication factor are configurable per
file. An application can specify the number of replicas of a file. The replication factor can be
specified at file creation time and can be changed later. Files in HDFS are write-once and have
strictly one writer at any time.
The NameNode makes all decisions regarding replication of blocks. It periodically receives a
Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat
implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a
DataNode.
MapReduce
Hadoop MapReduce is a software framework for easily writing applications which process big
amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a
reliable, fault-tolerant manner. The term MapReduce actually refers to the following two different
tasks that Hadoop programs perform:
The Map Task: This is the first task, which takes input data and converts it into a set of data,
where individual elements are broken down into tuples (key/value pairs).
The Reduce Task: This task takes the output from a map task as input and combines those data
tuples into a smaller set of tuples. The reduce task is always performed after the map task.
Typically both the input and the output are stored in a file-system. The framework takes care of
scheduling tasks, monitoring them and re-executes the failed tasks.
The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per
cluster-node. The master is responsible for resource management, tracking resource consumption/
availability and scheduling the jobs component tasks on the slaves, monitoring them and re-
executing the failed tasks. The slaves TaskTracker execute the tasks as directed by the master and
provide task-status information to the master periodically.
The JobTracker is a single point of failure for the Hadoop MapReduce service which means if
JobTracker goes down, all running jobs are halted.

MapReduce Algorithm
1. Map function: Splitting and Mapping
2. Shuffle function: Merging and Sorting
3. Reduce function: Reduction

How does Hadoop work?

Stage 1
A user/application can submit a job to the Hadoop (a hadoop job client) for required process by
specifying the following items:

1. The location of the input and output files in the distributed file system.

2. The java classes in the form of jar file containing the implementation of map and reduce
functions.

3. The job configuration by setting different parameters specific to the job.

Stage 2
The Hadoop job client then submits the job (jar/executable etc) and configuration to the JobTracker
which then assumes the responsibility of distributing the software/configuration to the slaves,
scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Stage 3
The TaskTrackers on different nodes execute the task as per MapReduce implementation and output
of the reduce function is stored into the output files on the file system.

Why use Hadoop


A. Its cost effective. Apache Hadoop controls costs by storing data more affordably per terabyte
than other platforms. Instead of thousands to tens of thousands per terabyte, Hadoop delivers
compute and storage for hundreds of dollars per terabyte.
B. Its fault-tolerant. Fault tolerance is one of the most important advantages of using Hadoop.
Even if individual nodes experience high rates of failure when running jobs on a large cluster,
data is replicated across a cluster so that it can be recovered easily in the face of disk, node or
rack failures.
C. Its flexible. The flexible way that data is stored in Apache Hadoop is one of its biggest assets
enabling businesses to generate value from data that was previously considered too expensive to
be stored and processed in traditional databases. With Hadoop, you can use all types of data,
both structured and unstructured, to extract more meaningful business insights from more of
your data.
D. Its scalable. Hadoop is a highly scalable storage platform, because it can store and distribute
very large data sets across clusters of hundreds of inexpensive servers operating in parallel. The
problem with traditional relational database management systems (RDBMS) is that they cant
scale to process massive volumes of data.
Day 6-8
Pig

Introduction
Pig was initially developed at Yahoo Research around 2006 but moved into the Apache Software
Foundation in 2007. Apache Pig is a platform for analysing large data sets that consists of a high-
level language for expressing data analysis programs, coupled with infrastructure for evaluating
these programs. The salient property of Pig programs is that their structure is amenable to
substantial parallelisation, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of
Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the
Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin,
which has the following key properties:
Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly
parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations
are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimisation opportunities. The way in which tasks are encoded permits the system to
optimise their execution automatically, allowing the user to focus on semantics rather than
efficiency.
Extensibility. Users can create their own functions to do special-purpose processing.

Components of Pig
There are two major components of Pig:

PigLatin script language


A PigLatin program is a collection of statements. A statement can either be an operation or a
command. For example, to load data from a file, issue the LOAD operation with a file name as an
argument. A command could be an HDFS command used directly within PigLatin such as the ls
command to list, say, all files with an extension of txt in the current directory. The execution of a
statement does not necessarily immediately result in a job running on the Hadoop cluster. All
commands and certain operators, such as DUMP will start up a job, but other operations simply get
added to a logical plan. This plan gets compiled to a physical plan and executed once a data flow
has been fully constructed, such as when a DUMP or STORE statement is issued. Here are some of
the kinds of PigLatin statements. There are UDF statements that can be used to REGISTER a user-
defined function into PigLatin and DEFINE a short from for referring to it. It has been mentioned
that HDFS commands are a form of PigLatin statement. Other commands include MapReduce
commands and Utility commands. There are also diagnostic operators such as DESCRIBE that
works much like an SQL DESCRIBE to show you the schema of the data. The largest number of
operators fall under the relational operators category. Here the operators are used to LOAD data
into the program, to DUMP data to the screen, and to STORE data to disk. There are special
operators for filtering, grouping, sorting, combining and splitting data. The relational operators
produce relations as their output. A relation is a collection of tuples. Relations can be named using
an alias. For example, the LOAD operator produces a relation based on what it reads from the file
you pass it. You can assign this relation to an alias, say, x. If you DUMP the relation aliased by x by
issuing DUMP x, you get the collection of tuples, one for each row of input data.
PigLatin supports standard set of simple data types like integers and character arrays. It also
supports three complex types: the tuple, the bag and a map. PigLatin also supports functions. These
include eval functions, which take in multiple expressions and produce a single expression. You can
write your own eval, filter, load, store functions using PigLatins UDF mechanism. You write the
function in Java, package it into a jar, and register the jar using REGISTER statement. You also
have the option to give the function a shorter alias for referring to it by using the DEFINE
statement.

Execution Environment
There are two choices of execution environment; a local environment and distributed environment.
A local environment is good for testing when you do not have a full distributed Hadoop
environment deployed. You tell Pig to run in the local environment when you start Pigs command
line interpreter by passing it the -x local option. You tell Pig to run in a distributed environment by
passing -x mapreduce instead. Alternatively, you an start Pig command line interpreter without any
arguments and it will start it in the distributed environment. There are three different ways to run
Pig. You can run you PigLatin code as a script, just by passing the name of your script file to the pig
command. You can run it interactively through the grunt command line launched using Pig with no
script argument. Finally, you can call into Pig from within Java using Pigs embedded form.

Execution Modes

Local Mode
You invoke pig from your terminal as follows:
pig -x local
It will bring you to the grunt shell as seen below:
grunt>
By local mode it means that Pig will work on files available on your local file system and store your
results in your local file system once the analysis is done.

MapReduce Mode
You invoke pig from your terminal as follows:
pig OR pig -x MapReduce
It will bring you to the grunt shell as seen below:
grunt>
By MapReduce mode it means that Pig will work on files available on your HDFS and store your
results in your HDFS once the analysis is done.
Day 9-11

Hive

Introduction
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarisation, query, and analysis. Hive gives an SQL-like interface to query data stored in various
databases and file systems that integrate with Hadoop. The traditional SQL queries must be
implemented in the MapReduce Java API to execute SQL applications and queries over a
distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like Queries
(HiveQL) into the underlying Java API without the need to implement queries in the low-level Java
API. Since most of the data warehousing application work with SQL based querying language, Hive
supports easy portability of SQL-based application to Hadoop. While initially developed by
Facebook, Apache Hive is now used and developed by other companies such as Netflix and the
Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache
Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.

Features
Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file
systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL with
schema on read and transparently converts queries to MapReduce, Apache Tez and Spark jobs. All
three execution engines can run in Hadoop YARN. To accelerate queries, it provides indexes,
including bitmap indexes. Other features of Hive include:
Indexing to provide acceleration, index type including compaction and Bitmap index as of 0.10,
more index types are planned.
Different storage types such as plain text, RCFile, HBase, ORC, and others.
Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks
during query execution.
Operating on compressed data stored into the Hadoop ecosystem using algorithms including
DEFLATE, BWT, snappy, etc.
Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools.
Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark
jobs.
By default, Hive stores metadata in an embedded Apache Derby database, and other client/server
databases like MySQL can optionally be used.

Architecture
Major components of the Hive architecture are:

Megastore
Stores metadata for each of the tables such as their schema and location. It also includes the
partition metadata which helps the driver to track the progress of various data sets distributed over
the cluster. The data is stored in a traditional RDBMS format. The metadata helps the driver to keep
a track of the data and it is highly crucial. Hence, a backup server regularly replicates the data
which can be retrieved in case of data loss.

Driver
Acts like a controller which receives the HiveQL statements. It starts the execution of statement by
creating sessions and monitors the life cycle and progress of the execution. It stores the necessary
metadata generated during the execution of an HiveQL statement. The driver also acts as a
collection point of data or query result obtained after the Reduce operation.

Compiler
Performs compilation of the HiveQL query, which converts the query to an execution plan. This
plan contains the tasks and steps needed to be performed by the Hadoop MapReduce to get the
output as translated by the query. The compiler converts the query to an Abstract syntax tree (AST).
After checking for compatibility and compile time errors, it converts the AST to a directed acyclic
graph (DAG). DAG divides operators to MapReduce stages and tasks based on the input query and
data.

Optimiser
Performs various transformations on the execution plan to get an optimised DAG. Various
transformations can be aggregated together, such as converting a pipeline of joins by a single join,
for better performance. It can also split the tasks, such as applying a transformation on data before a
reduce operation, to provide better performance and scalability. However, the logic of
transformation used for optimisation used can be modified or pipelined using another optimiser.

Executor
After compilation and Optimisation, the Executor executes the tasks according to the DAG. It
interacts with the job tracker of Hadoop to schedule tasks to be run. It takes care of pipelining the
tasks by making sure that a task with dependency gets executed only if all other prerequisites are
run.

CLI, UI, and Thrift Server


Command Line Interface and UI (User Interface) allow an external user to interact with Hive by
submitting queries, instructions and monitoring the process status. Thrift server allows external
clients to interact with Hive just like how JDBC/ODBC servers do.

How Hive works


Hive on LLAP (Live Long and Process) makes use of persistent query servers with intelligent in-
memory caching to avoid Hadoops batch-oriented latency and provide as fast as sub-second query
response times against smaller data volumes, while Hive on Tez continues to provide excellent
batch query performance against petabyte-scale data sets.
The tables in Hive are similar to tables in a relational database, and data units are organized in a
taxonomy from larger to more granular units. Databases are comprised of tables, which are made up
of partitions. Data can be accessed via a simple query language and Hive supports overwriting or
appending data.
Within a particular database, data in the tables is serialized and each table has a corresponding
Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions
that determine how data is distributed within sub-directories of the table directory. Data within
partitions can be further broken down into buckets.
Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN,
CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT.
In addition, analysts can combine primitive data types to form complex data types, such as structs,
maps and arrays.
Day 12-14

Flume

Introduction
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating,
and moving large amounts of log data. It has a simple and flexible architecture based on streaming
data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and
recovery mechanisms. It uses a simple extensible data model that allows for online analytic
application.

Features
Flume lets Hadoop users ingest high-volume streaming data into HDFS for storage. Specifically,
Flume allows users to:

Stream data
Ingest streaming data from multiple sources into Hadoop for storage and analysis.

Insulate systems
Buffer storage platform from transient spikes, when the rate of incoming data exceeds the rate at
which data can be written to the destination.

Guarantee data delivery


Flume NG uses channel-based transactions to guarantee reliable message delivery. When a message
moves from one agent to another, two transactions are started, one on the agent that delivers the
event and the other on the agent that receives the event. This ensures guaranteed delivery semantics.

Scale horizontally
To ingest new data streams and additional volume as needed.

Anatomy of a Flume agent


Flume deploys as one or more agents, each contained within its own instance of the Java Virtual
Machine (JVM). Agents consist of three pluggable components: sources, sinks, and channels. An
agent must have at least one of each in order to run. Sources collect incoming data as events. Sinks
write events out, and channels provide a queue to connect the source and sink.

Sources
Put simply, Flume sources listen for and consume events. Events can range from newline-
terminated strings in stdout to HTTP POSTs and RPC calls it all depends on what sources the
agent is configured to use. Flume agents may have more than one source, but must have at least
one. Sources require a name and a type; the type then dictates additional configuration parameters.
On consuming an event, Flume sources write the event to a channel. Importantly, sources write to
their channels as transactions. By dealing in events and transactions, Flume agents maintain end-to-

end flow reliability. Events are not dropped inside a Flume agent unless the channel is explicitly
allowed to discard them due to a full queue.

Channels
Channels are the mechanism by which Flume agents transfer events from their sources to their
sinks. Events written to the channel by a source are not removed from the channel until a sink
removes that event in a transaction. This allows Flume sinks to retry writes in the event of a failure
in the external repository (such as HDFS or an outgoing network connection). For example, if the
network between a Flume agent and a Hadoop cluster goes down, the channel will keep all events
queued until the sink can correctly write to the cluster and close its transactions with the channel.
Channels are typically of two types: in-memory queues and durable disk-backed queues. In-
memory channels provide high throughput but no recovery if an agent fails. File or database-backed
channels, on the other hand, are durable. They support full recovery and event
replay in the case of agent failure.
Flume Agent
Sink
Sinks provide Flume agents pluggable output capability if you need to write to a new type
storage, just write a Java class that implements the necessary classes. Like sources, sinks
correspond to a type of output: writes to HDFS or HBase, remote procedure calls to other agents, or
any number of other external repositories. Sinks remove events from the channel in transactions and
write them to output. Transactions close when the event is successfully written, ensuring that all
events are committed to their final destination.

How Flume works


In one specific example, Flume is used to log manufacturing operations. When one run of product
comes off the line, it generates a log file about that run. Even if this occurs hundreds or thousands
of times per day, the large volume log file data can stream through Flume into a tool for same-day
analysis with Apache Storm or months or years of production runs can be stored in HDFS and
analysed by a quality assurance engineer using Apache Hive.

Flume components interact in the following way:


A flow in Flume starts from the Client.
The Client transmits the Event to a Source operating within the Agent.
The Source receiving this Event then delivers it to one or more Channels.
One or more Sinks operating within the same Agent drains these Channels.
Channels decouple the ingestion rate from drain rate using the familiar producer-consumer model
of data exchange.
When spikes in client side activity cause data to be generated faster than can be handled by the
provisioned destination capacity can handle, the Channel size increases. This allows sources to
continue normal operation for the duration of the spike.
The Sink of one Agent can be chained to the Source of another Agent. This chaining enables the
creation of complex data flow topologies.
Because Flumes distributed architecture requires no central coordination point. Each agent runs
independently of others with no inherent single point of failure, and Flume can easily scale
horizontally.

Data ingestion through Flume


Day 15-18

Sqoop

Introduction
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores
such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the
EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data
from Hadoop and export it into external structured datastores. Sqoop works with relational
databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB.

Features

Import sequential datasets from mainframe


Satisfies the growing need to move data from mainframe to HDFS.

Import direct to ORC Files


Improved compression and light-weight indexing for improved query performance.

Data imports
Moves certain data from external stores and EDWs into Hadoop to optimise cost-effectiveness of
combined data storage and processing.

Parallel data transfer


For faster performance and optimal system utilisation.

Fast data copies


From external systems into Hadoop.

Efficient data analysis


Improves efficiency of data analysis by combining structured data with unstructured data in a
schema-on-read data lake.

Load balancing
Mitigates excessive storage and processing loads to other systems.

How Sqoop works


Sqoop provides a pluggable mechanism for optimal connectivity to external systems. The Sqoop
extension API provides a convenient framework for building new connectors which can be dropped
into Sqoop installations to provide connectivity to various systems. Sqoop itself comes bundled
with various connectors that can be used for popular database and data warehousing systems.
Sqoop introspects the database to gather the necessary metadata for the data being imported.
A Map-only Hadoop job is submitted to cluster by Sqoop.
The Map-only job performs data transfer using the metadata captured.
The imported data is saved in a directory on HDFS based on the table being imported. User can
specify any alternative directory where the files should be populated.
By default these files contain comma delimited fields with new lines separating new records. User
can override the format in which data is copied over by explicitly specifying the field separator
and record terminator characters.
Sqoop supports different data formats for importing data. Sqoop also provides several other
options for tuning the import operation.

Sqoop Agent
Day 19

Hadoop Configuration
Hadoop is installed and configured in pseudo distributed mode. It is a distributed simulation on
single machine. Each Hadoop daemon such as hdfs, yarn, MapReduce etc., will run as a separate
java process. This mode is useful for development.
You can find all the Hadoop configuration files in the location $HADOOP_HOME/etc/hadoop. It
is required to make changes in those configuration files according to your Hadoop infrastructure.
In order to develop Hadoop programs in java, you have to reset the java environment variables in
hadoop-env.sh file by replacing JAVA_HOME value with the location of java in your system.
core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop instance,
memory allocated for the file system, memory limit for storing the data, and size of Read/Write
buffers.

Open the core-site.xml and add the following properties in between <configuration>,
</configuration> tags.
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data, namenode path,
and datanode paths of your local file systems. It means the place where you want to store the
Hadoop infrastructure.
yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following
properties in between the <configuration>, </configuration> tags in this file.
mapped-site.xml
This file is used to specify which MapReduce framework we are using. By default, Hadoop
contains a template of yarn-site.xml. First of all, it is required to copy the file from mapred-
site,xml.template to mapred-site.xml file using the following command.
Day 20-28

Project

Title
Web VDA [Web Visitor Data Analytics Using Big Data Ecosystem]

Project Description
Web VDA (Visitor Data Analytics) project is initiated to demonstrate capability on Big Data
Processing & Analytics to prospective clients who a looking for Big Data Analytics solutions to
understand their customers behaviour which will help them in acquiring new & retaining their
existing customers. This project involves analysing log data of the web visitors.
The analytics team is interested in understanding web activities of the site, which are the Referrer
URLs used to access the website. They will get lacs of record with columns mentioning IP, date,
timestamp, URL, page, browser used & other details related to users who have accessed web pages.
This project involves analysing semi-structured/unstructured data of the web visitors which is
mainly in form of log files. Big log data is first ingested in Hadoop Distributed File System using
scripting & Apache Flume utility. After ingesting, data will be cleaned & transformed using Apache
Pig/Hive utilities.
Cleansed structured data is then transferred to relational database system like MySQL.

Technologies Used

Operating System
Ubuntu 12/14.04 Server

Data Ingestion and Transfer


Apache Sqoop and Apache Flume

Data Storage
HDFS (Hadoop Distributed File System), MySQL

Data Analysis and Transformation


Apache Pig and Apache Hive

Weblog Server Location


http://bizmap.in/data/MMC_web.log
Analytics Required
Analytics team is targeting on the following:
The team will analyze all main variables of the data collected through data transformation- data
filtering & summarization to get an understanding of the dataset and to prepare it for further
analysis. Below are some of the data analytics requested in this project.
1. Which is the most viewed page on the web portal

2. Which is the most viewed products on the portal

3. Which is the most frequently used web browser

4. Generate a report with top 3 viewed products of year 2012 & 2011

5. Generate a report with top 3 IP addresses accessing portal in year 2012 & 2011

6. Generate 3 different report showing products accessed by top 3 IP addresses, reports


should have products names & their view count in descending order

7. Generate a report containing all products & their view counts in descending order

8. Generate a report containing all User IPs & their hit counts in descending order

You might also like