Summary of Hadoop concepts

Basics of Hadoop

Apache Hadoop is a software framework.
Hadoop is based on the concept of divide and conquer.
It uses distributed processing for computation and storage.

Users of Hadoop

Organizations uses Hadoop for many purposes, to state a few:

E-Commerce industry - For analyzing customer purchasing history and recommend products for customers.
Forecasting, pattern recognition and deep historical data analytics techniques can be used for weather forecasting and identify patterns of fraudulent activity and deep historical analysis of customer purchasing trend.
In image processing satellite images can stored and analyzed.
Hadoop can be used for predictive analytics for finding airline tickets and finding cheapest hotels in near locations.
Search engine will analyze individual browsing history and provide unique results customized to user.
Google Earth uses Hadoop.
Google uses Hadoop to store browser history.
Google uses Hadoop for analysis.
Google also uses Hadoop to make recommendations.
Amazon uses Hadoop to store purchase history of all of it's users. They also use hadoop analyze page history. Page history indicates what products are popular and popular products are suggested to customers.
Paypal uses Hadoop to ensure the security of customer data. A large number of sales records can be saved and analyzed. Hadoop finds patterns in the data, perhaps anomalies. Data anomalies help find and prevent fraud. Security is increased through the use of Hadoop.
Hadoop is also used throughout IT infrastructure to recognize cyber and malware attack patterns.
Hadoop has created new disciplines of jobs in data analytics, data mining, and predictive modeling.

The four V's of Big Data

Volume refers to the size of the data
Text files are small, perhaps a few kilobytes.
Sound files are a few megabytes.
Videos and full-length movies are a few gigabytes.
Over 72 hours of video is uploaded to You Tube every minute.
Velocity is the speed at which data flows.
Data velocity used to be slow as it was human generated.
Velocity has increased with machine data.
Machine data and data automation have accelerated velocity.
Data Velocity is expected to increase exponentially forever.
Volume refers to the size of the data
Text files are small, perhaps a few kilobytes.
Sound files are a few magabytes.
Videos and full-length movies are a few gigabytes.
Over 72 hours of video is uploaded to You Tube every minute.
Variety describes the different formats and sources of data.
Smartphones, tablets, etc. produce different types of data.
This data does not fit into traditional databases.
A variety of data can exist in many forms (streaming, etc.)
Machine-produced data adds to the variety.
Veracity defines the authenticity of data.
Data is not always clean.Actually, the opposite is true.
Data often needs to be validated and cleaned.
Incomplete data needs to be normalized.
Noise and abnormalities must be removed.

- The four V's have changed the rules.

- Old rules of data capture, storage, and analysis are obsolete.

- Whole new technologies have to be used (or invented)

- The traditional database will never go away.

- Legacy solutions will be augmented with new technology.

Sources of Big Data

The sources of data (of Big Data) have changed recently.
Traditional data is text based and stored in a database.
Traditional data is slow and human generated.
Traditional data is still "Big Data" - just a type of it.
New sources of data have joined traditional data.
Transactional data is a the most common legacy data.
Transactional data is kept in traditional databases.
This data is stored for traditional OLTP systems.
Transactional data can be transformed for OLAP systems.
OLAP systems are the source of predictive models.
New sources of data are not text based.
Sources include photos, movies, streams, etc.
Data can be crowdsourced. For example stackoverflow uses redis and nosql store to manage data and display questions to users.
New tools are used for new data and data sources.
Social networking sites are a major source of crowdsourcing.
Social data needs to be analyzed.
It always needs to be cleaned up.
Social data has very low data authenticity.
Most social data is noise and has little business value.
Only after clean up, social data provides business value.
Search data is at the heart of e-commerce.
Amazon, eBay, etc. are extensive producers of search data.
Search data is collected in real-time.
Search data is used for features like typeahead searching.
The volume of search data is HUGE.
Machine data is generated from nonhuman sources.
Medical devices and traffic devices are machine sources.
Machine data is very accurate, perhaps never wrong.
The velocity of machine data is huge.
Machine data is collected and stored by other machines.

Clustering

Hadoop runs on a cluster.
Traditional computing is nonclustered.
Processing and data storage are usually on a single machine.
Big Data needs the processing power of many machines.
Hadoop is a natural choice for Big Data processing.
A single machine is limited in its horsepower.
A cluster on many computers solves this task.
A cluster combines the CPU power of all machines.
This combined CPU power is focused on one task.
The largest Hadoop cluster has over 100,000 CPUs.

Core Hadoop contains three main components

The Hadoop Distributed File System (HDFS)
YARN, cluster management software.
MapReduce, a system for processing Big Data.

These three components work together to make Hadoop.

Server Farms

Hadoop clusters have two different kinds of nodes.
The name node is the boss. It controls the cluster.
The data nodes contain all the data.
The data nodes also do all the work.
Data nodes are independent of each other.

Testing Hadoop Installation

Hadoop is fairly straightforward to test. Small initial configuration will be required. Hadoop logs most of its activity.Knowing how to read the log files is helpful in debugging.
Hadoop installations need to have JAVA
Use the latest version from Sun Microsystems.
Hadoop keeps its config files in the "conf" folder.
Edit "conf/Hadoop-env.sh" to define JAVA_HOME.
Most installation commands are in the "$bin" folder. It contains all the documentation for setup script.
Hadoop can run in one of three modes -> 1. Local mode (standalone cluster) is the easiest one to learn 2. Pseudo distributed 3. Fully distributed.
First time before starting Hadoop, there are a couple of "configure once" steps.
Issue the command: $bin/hadoop namenode -format.
The script "$bin/start-all.sh" starts all Hadoop daemons.
On a single-node cluster, all daemons are on a single thread.
The startup script only takes a minute to run.
Monitor the console and log files for error messages.
Hadoop also has a web interface. This interface contains information about Hadoop execution.
The name node: http://localhost:50070
The job tracker: http://localhost:50030. Keeps track of jobs running on each of data nodes.
Ports are configurable depending on the installation needs.
Hadoop shutdown is straightforward.
Shutdown with the $bin/stop-all.sh command.
All daemons will shutdown.
Data nodes will be notified of the name node shutdown.
The task tracker in each data node will release all jobs.

First Hadoop Cluster

Hadoop clusters integrate MapReduce and HDFS.
A Hadoop cluster runs on a typical master/slave architecture.
The master controls all of the data management, communication & processing. In Hadoop master is called name node and slave is called data node which does all the work.

The name controls the whole cluster.
The name node does not contain any files itself.
The name node contains the job tracker.
The job tracker assigns the processing for MapReduce jobs to data nodes across the cluster.

Slave (Data) nodes hold all the data from HDFS.
Each data node has a task tracker.
The task tracker monitors all MapReduce processing.
Data nodes do not communicate with each other.

Hadoop by default is configured to run in the local standalone mode.
Hadoop may be configured in the pseudo- distributed mode.
Hadoop may be configured in the fully distributed mode.
When running in the fully distributed mode, data nodes need to be configured to communicate with the name node.
Pseudo-distributed mode: Hadoop runs each of its daemons in its own Java thread.

Data nodes and their associated task trackers are virtual. They exist in the same physical machine as name node.

The pseudo-distributed mode also differs from the local standalone mode.
HDFS uses the pseudo-distributed mode rather than the native file system.
In the pseudo-distributed mode, there will be a single master node.
The job tracker also monitors the progress of slave nodes.
For a single slave node, there will be a single task tracker.

When setting up Hadoop in the pseudo-distributed mode, three configuration changes need to be made.

Hadoop Echosystem

MapReduce has many supplement tools.
Apache Pig is an engine used for analyzing multiple data flows on Hadoop up to the M3 version; MapR is free.
Pig Latin gives developers the ability to develop their own functions.
Pig Latin also includes familiar operators for legacy data operations.
Pig Latin runs over Hadoop.
There are SQL-like tools
Apache Hive is a data warehousing infrastructure developed by Facebook.
Hive is an OLAP tool used for data aggregation.
Hive has a SQL-like language called HiveQL
Facebook has recently open-sourced Presto.
Presto is a SQL engine that is up to five times faster than Hive.
Scheduling software can also be found in the Hadoop Ecosystem.
Apache Oozie is a workflow scheduling system for MapReduce jobs.
LinkedIn also has a workflow manager called Azkaban.
Azkaban is a batch scheduler.
Azkaban has a user-friendly graphical user interface.

The Role of YARN

Hadoop YARN stands for "Yet Another Resource Negotiator"
YARN is the next generation of the MapReduce computing framework.
Yahoo and Facebook are two notable large clusters that use YARN.
The first version of Hadoop was built for large Big Data batch applications.
Basically, all usage patterns must use the same infrastructure.
This forced creation of silos to manage the mixed workload.
Job tracker manage cluster resources and job scheduling.
This design has limitations. Everything looks like a MapReduce and theoretical limit to size of cluster.

Hadoop 2 solved many of the problems of the original Hadoop release.
MapReduce was responsible for data processing as well as cluster resource management.
Now MapReduce is only responsible for data processing.
The YARN architecture includes a resource manager (RM)

YARN takes Hadoop beyond batch.
YARN allows all of the data to be stored in one place.
All types of applications (Batch, Application, Online, Streaming, etc.) can now run natively in Hadoop.
YARN is the layer that sits between the running applications and the HDFS2's redundant, reliable storage.
The benefits of YARN are many:
Cluster utilization is also greatly improved from Hadoop 1.
There is no theoretical limit to the size a cluster can grow.
YARN also makes the cluster more agile and able to share services.
The Hadoop environments is much more suitable for applications that can't wait for other batch jobs to finish.

Overview of Hadoop Storage, MapReduce, Pig, and Hive.

Core Hadoop keeps its data in what is called the Hadoop Distributed File System.
Data is stored as large blocks across different physical systems or nodes.
Data is also replicated in HDFS.
Data is redundantly distributed across several nodes.
HDFS can be standalone.
MapReduce is a programming model and software framework.
Inspired by the map and reduce functions commonly used in functional programming.
Computational processing occurs on both structured and unstructured data.
It is a core component of the Hadoop platform
Hive, an Apache project, is an OLAP framework built over Hadoop HDFS.
Hive is not part of core Hadoop
Its data structure is similar to traditional relational database with tables.
HiveQL is a query language similar to SQL
HiveQL supports DDL operations (creation/insertion of records).
In Hive, the basic structural unit is a table, which has an associated HDFS directory.
Each partition is further divided into buckets.
Hive is popularly used by organizations for data mining.
Hive is known for scalability, extensibility, and fault tolerance.
Apache Pig is a tool used for data extraction.
Pig is not part of core Hadoop and must be installed separately.
Multiple versions of Pig exists.
Users are like Pig as it's simple and has a look and feel of legacy tools
Pig and Hive are the two most popular data management tools in the Apache Hadoop Ecosystem.

Hadoop Distributed File System

One of the core pieces of Hadoop is the Hadoop Distributed File System.
File storage in HDFS is redundant.
Each file segment is stored in at least three places.
HDFS is designed to operate even if data nodes go down.
HDFS breaks its files down into blocks.
Each of these blocks is stored in multiple places across the Hadoop cluster.
This distribution of blocks means that there are multiple ways to reconstruct a file.
HDFS is a good fit for Big Data applications.
More machines mean more data storage capacity.

The name node manages all data and only needs to know a few things about the files such as name of the file, permissions and location of file blocks.
Name nodes control data nodes.
If an application needs to open a file, a request is made to the name node.
The name node retrieves the file, perhaps in parallel and assembles and returns to the client.
Client applications only deal with the name node.

Interacting with HDFS

First step is to start HDFS

user@namenode:hadoop$ bin/start-dfs.sh

First the name node is started.
Next the data nodes are started.

someone@anynode:hadoop$ bin/hadoop dfs -ls

If this is a new installation, nothing should be there.
No files are yet present in your Hadoop cluster
HDFS does not have the concept of a current working directory.
There may be a few initial directory contents.
These are system objects created in the installation process.
Each of these objects is created by a username.
Th name "Hadoop" is the system name in which Hadoop daemons were started.
The username "supergroup" is a group in which the members include the username that started the HDFS instance.
If it is not exist create your home directory:

Step 1 : someone@anynode:hadoop$ bin/hadoop dfs -mkdir /user

Step 2 : someone@anynode:hadoop$ bin/hadoop dfs -mkdir /user/person

Repeat this process with other user IDs as needed.
Output may look similar to below:

someone@anynode:hadoop$ bin/hadoop dfs -ls

Found 2 items

drwxr-xr-x -hadoop supergroup 0 2014-09-20 19:40 /hadoop

drwxr-xr-x -hadoop supergroup 0 2014-09-20 20:08 /tmp

Execute bin/hadoop dfs with no parameters to see all commands.
If you forget what a command does, just type:

bin/hadoop dfs -help commandName

The system will display a short form of explanation on what that command does.
For system status, a status report for the Hadoop Distributed File System can be obtained by the report command.

File Operations within the HDFS

The file structure of HDFS is different than other file systems.
HDFS assumes the default directory name.
Paths are considered to be relative to the base directory.
Data management within HDFS is simple.
Somewhat similar to DOS of FTP.
Use the put command to upload files.
The put command inserts files into HDFS.
Use the ls command to list files in HDFS.
The put command can also be used to move more than single files.
Entire folders can be uploaded with put
HDFS keeps track of file replicas.
HDFS keeps very little metadata about its files.
The command copyFromLocal can be used in place of put.
The put command on a file is either fully successful or fully failed.
The file is first copied to data nodes .
The data nodes monitor the transfer.
Files are never partially uploaded.
Data nodes indicate when transfer is successful.
Data can also be downloaded from HDFS.
An easy way is to use stdout
The cat command can also be used to pipe data.
Data can also be displayed data with cat.
The get command is the opposite of put
The copyToLocal command is the same as get

Using MapReduce with Pig and Hive

Pig was created to create queries for semistructured data
System logs or machine-generated reports are an example of semistructured data.
MapReduce can be complicated and troublesome.
The language of Pig "Pig Latin" was written precisely for this purpose.
Pig compiles the job into Mappers and Reducers.
Pig simplifies the MapReduce process.
Hive leverages the SQL skills developer may have and developer with SQL skills can learn Hive much faster than Pig.
Hive has its own query language called HiveQL
HiveQL can be mastered very quickly.
Hive is designed to have the look and feel of traditional SQL.
Traditional database management systems tend to be very fast.
Note that MapReduce and Hadoop are designed for high latency.
Hive is optimized to be read-only.
Does not by default support read/write operations.
Pig and Hive have the same purpose.
Hive is the tool for the SQL developer.
Both tools are important and will grow.
Hive and Pig will not replace each other.

MapReduce Life Cycle

Job Client is a Java class that submits a MapReduce job
The Job Client class is the primary interface for the MapReduce job.
Job Client provides methods to submit MapReduce jobs and track their progress.
There are many steps in MapReduce job submission.

Validating input & output specifications of MapReduce job
Computing input splits for MapReduce job. If needed initializing required accounting information for the distributed cache of the MapReduce job.
HDFS copying the MapReduce jar and related configuration.
Submitting MapReduce job and monitoring it's status to the Job tracker.

Job tracker is a Hadoop MapReduce service that assigns MapReduce tasks to data nodes on Hadoop cluster. It is initiated by Job Client.
Job tracker runs on the name node and contains algorithm that determines which data node receives which processing portion of MapReduce job.
Job tracker communicates with task trackers on each of data node.
Task tracker notifies health and death of data node so the Job tracker may reassign MapReduce tasks
Job tracker assigns tasks to where the data lives.This is one of pillars of hadoop to move processing near data rather than moving data to the processing.

Task trackers run on the data nodes responsible for communicating with Job tracker. Job tracker is single point of failure on hadoop MapReduce service. If Job tracker fails all running MapReduce jobs fail.
A task tracker monitors MapReduce processing with its own node.
Task trackers are not aware of other data nodes
Data nodes only communicate with the name node
Data nodes do not communicate with each other.

Mapping Life Cycle

Inputs from HDFS are fetched into local copy according to required input format.
Key-Value Pairs are created.
Specific job according to the Map function will be applied to all the key pairs.
Data will be sorted, aggregated and combined locally.
Results will be saved into local file system.
Once the task is done, Task tracker will be informed.
Task tracker informs the Job tracker.
Results will be passed on the Reduce phase by Job tracker.

Reducing Life Cycle

Three main steps - copy, sort and merge.
Reduce phase task initially fetches job from local file system.
All the copies from Map phase results from the tracker will be fetched and copied.
The copied results will be sorted based on key-value pairs.
Single set of key-value pairs will be created and merged with original sorted values.
Reduce function will be applied on consolidated key-value pairs.
Final results smaller than input data is stored on HDFS file system.

Understanding the MapReduce Data Flow

Inputs to a MapReduce job generally come from files that have been uploaded on HDFS.
Files in HDFS are evenly replicated across all data nodes.
Running a map reduce job it may involve running mapping tasks on many data nodes on hadoop cluster.
Each mapping task running in the tracker of each data node is identical. Mappers are not assigned "special" tasks.
Each mapper is generic and can process any file.
Each mapper takes the file to be mapped , saves in local and processes them.
After the mapping phase is completed, the mapped key value pairs must be exchanged between data nodes as to send to single reducer.
The tasks in reducing are allocated across to the same data nodes. This exchange is the only time data nodes communicate with each other.
Mappers perform the first real step.
Mappers start with a key-value pair
Each Mapper runs in its own instance known as a map task
Mappers do not share mapping tasks
When mapping tasks are complete, the job is sent to the Reduce phase.

Individual mapping tasks do not communicate with each other.
The client never passes data from one machine to another. All data movement is handled by hadoop framework through the Job tracker on the name node.
The status of the job can be reliably monitored.
Hadoop is always monitoring the status of the cluster as it relates to running map reduce job.
Hadoop is able to route processing away from data nodes if they become unreachable.

In the Reduce phase, a single instance of each reducer is instantiated for each reduce task.
The key value pairs are assigned to a reducer.
The reduce method receives a key and also a key used for iteration over all values associated with key.
Each reducer also receives OutputCollector and Reporter objects

ITSolvs

Pages

Monday, 15 June 2015

Summary of Hadoop concepts

Contributor: Venkat Mogili

No comments:

Post a Comment