profile for Gajendra D Ambi on Stack Exchange, a network of free, community-driven Q&A sites

Sunday, February 12, 2017

Big Data Hadoop

Hadoop is an opensource software for data management, thanks to apache software foundation which made it opensource from cloudera. This is to manage huge data sets and not small data sets. Why not small data sets? well If there is a 10 minute project and if you distribute to 10 people then it will take more time. So this is meant for big data as the name itself suggests.
It is an opensource framework overseen by apache software foundation for storing and processing huge data sets with a cluster or clusters of commodity hardware.
Hadoop  has got 2 core components.
1. HDFS - for storing data
2. MapReduce - process that data


HDFS
HDFS (HaDoop File System) is a specially for storing huge data sets with Streaming Access Pattern (SAP) mostly on commodity hardware. One thing to remember is it is "write once, ready any number of times, but do not try to change it once it is written to HDFS". Now why HDFS and why not any other file system?. HDFS has a default 64mb sector size and if you store a file of just 10mb then the other 54mb is not getting wasted unlike other file systems. The remaining free space of a sector will be used for other files.you can change this sector size to 128mb if you like.
This HDFS has 5 major services or demons or nodes.
1.     name node
2.    secondary namenode
3.    jobtracker
4.    data node
5.     task tracker
Here 1,2,3 are master services and 4,5 are slave services. All master demons can talk to each other. All slave demons can talk to each other but master to slave communication happens only between name node - data node and jobtracker - task tracker. 

name node
secondary_namenode
jobtracker
data node
task tracker
namenode
NA 
yes
yes
yes
no
secondary_namenode
yes
NA
yes
no
no
jobtracker
yes
yes
NA
no
yes
data node
yes
no
no
NA
yes
task tracker
no
no
yes
yes
NA

When a client wants to store data on HDFS it contacts the name node and requests for place to store this data. The name node creates metadata (inventory of data or data about data) for this data and informs the client on which systems of the hadoop cluster it can store the data and the client will contact those systems directly distributes this data across the systems in 64mb blocks.
what about redundancy?
Well HDFS by default will have 1+2 replications by default. So 3 copies of your data are available for you to access. Here name node maintains the metadata about where all these blocks are stored and all these data nodes report about the stored blocks to name node on a regular interval, just like a heartbeat. If name node has not received block report for certain block then it a copy of that data block is created on one of the data node.

What if the metadata is lost?
Well, your data is lost forever. It is important to make sure that your namenode is on a highly reliable
hardware and it is as failproof as it can get. Name node is a single point of failure (SPOOF) and thus maintaining fault tolerance of this is very important and to this VMware's FT comes to my mind.

MapReduce
Now mapreduce takes care of the data processing. You create a job (script or program ) to access your data and job tracker will track it. Job tracker will take this request. Job tracker will take this to name node and name node will give the relevant information from the metadata about the blocks it is trying to access and where they are stored. Job tracker will now assign this task to a task tracker. Task tracker will now does the task of accessing or getting the data as per the program that you have written. So this process is called map because you are mapping the data but not actually getting the data to work on it. There is something called as input splits. If you had input a data of 100mb then this input was split into 2 (because a block is of the size of 64mb) and hence you have 2 input splits. If your program is accessing this now then there will exactly be 2 maps. humber of input splits = number of maps.
What happens if there is a lost block?
The task tracker will inform the job tracker that it is unable to access the block and job tracker will assign this to another task tracker to access this data from another data node (remember that there are by default 3 copies of data ) and then we have a new map. All these task trackers give a heartbeat (every 3 seconds)back to job tracker to let them it know that they are alive and working. If for 30 seconds a task tracker hasn't sent it's heartbeat to job tracker then the same task will be handed over to another task tracker on another data node. Please note that the job tracker is smart enough to load balance the load of these jobs to the task trackers. Whichever task tracker is able to deliver fast will get more jobs. In this case of a 100mb data input finally all these maps will be combined and reduced to one (mapReduce) to give you the final output and thus the name mapReduce. The process or service which does the reducation of maps or mappers into one is called mapreducer. number of outpus = number of mapreducers. This reducer can be on any of the nodes. If the output gets generated by the mapreducer, then the task tracker will inform about this to the name node when it is sending it's heartbeat to the name node. Now the name node updates it's metadata about the final output file and it's location and client which requested this output reads the metadata to know the where about of this final output file and contacts that data node directly and gets the output.
What if the job tracker is lost or dead?
even though it is a SPOOF (single point of failure) you don't have to worry about data loss since the job will be disturbed but not the data. It is important to have a redundant server for this but not mandatory.