Wednesday, July 03, 2013

Notes Hadoop

Notes Hadoop
=============

References
Famous Websites and their Big Data
Hadoop Ecosystem
MapReduce
Pig Latin
Hive
Twitter Case Study




References
===========
http://searchcloudcomputing.techtarget.com/definition/Hadoop
http://radar.oreilly.com/2011/01/what-is-hadoop.html
http://www.computerworld.com/s/article/9224180/What_s_the_big_deal_about_Hadoop_?taxonomyId=9&pageNumber=3


Famous Websites and their Big Data
===================================
Facebook - data analytics build around Hive
LinkedIN - infrastructure build around Hadoop


Hadoop Ecosystem (See NotesHadoop)
===================================
Big Data - 3Vs (Volume, Velocity, Variety)

Hadoop

Hadoop Streaming - enables user to write Map function and Reduce function, in any language they want. This middleware component will make these functions work under the Hadoop ecosystem.

Sqoop - JDBC-based Hadoop to DB data movement facility. Can transfer from RDBMS to HDFS. Can transfer from HDFS to RDBMS.
- Use Case - Archiving Old Data. Using Sqoop, data from RDBMS can be easily push to Hadoop clusters. Storing data in Hadoop instead of using Tape archives is more cost effective, provide fast access when needed, use one single technology for old and new data hence single know-how.

Hive - Enables users to use SQL to operate on Hadoop data. Hive contains only a subset of standard SQL. May also be used to perform SQL joins with tables from different DB systems, eg on table from MySQL with another table from DB2 or even Spreadsheets.

Pig - "Apache Pig is a high-level procedural language for querying large semi-structured data sets using Hadoop and the MapReduce Platform.
Pig simplifies the use of Hadoop by allowing SQL-like queries to a distributed dataset." "instead of writing a separate MapReduce application, you can write a single script in Pig Latin that is automatically parallelized and distributed across a cluster. "

Fuse - a middleware that allow users to access HDFS using standard file system commands (in Linux).

Flume-ng (next generation) - enable a load ready file to be prepared and then transferred to RDBMS using the RDBMS high speed loaders. The functionality is covered by Sqoop.

Oozie - chaing together multiple Hadoop jobs.

HBASE - high performance key-value store.

All Open Source. Most of the components are Java based, does not mean users need to program in Java.


MapReduce (See NotesHadoop)
=============
- Message passing, data parallel, pipelined work. Higher level compared to traditional Shared Memory or Distributed Message Passing paradigms.
- programmer need to specify only Mapper and Reducer. Message passing handled by the implementation itself.


Pig Latin
==========
http://www.ibm.com/developerworks/library/l-apachepigdataquery/#list1
http://pig.apache.org/docs/r0.7.0/tutorial.html#Pig+Tutorial+File


Hive
=====
Ref: [1] http://hive.apache.org/docs/r0.9.0/

What is Hive?
- Is a data warehouse infrastructure built on top of Hadoop.
- Provides tools to enable easy data ETL, (Extract, Transform, Load)
- put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files.
- HiveQL easy for people familiar with SQL.
- Enable MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.
- Hive does not mandate read or written data be in the "Hive format"---there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats.

What Hive is NOT?
- Based on Hadoop, which is a batch processing system, Hive does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours.

In summary, low latency performance is not the top-priority of Hive's design principles. What Hive values most are scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats."


Twitter Case Study
===================
Ref: "Large-Scale Machine Learning at Twitter"; Jimmy Lin and Alek Kolcz

Hadoop - at the core of the infrastructure.
Hadoop Distributed File System (HDFS) - data from other DBs, application logs, etc are written real time or batch processed into HDFS.
Pig - analytics done using Pig, which is a high-level dataflow language. It compiles the Pig script into physical plans and executed on Hadoop.




No comments: