HADOOP: September 2013

Thursday, September 19, 2013

What are the domains we do projects on (BFSI)?

BFSI:

Banking, Financial services and Insurance (BFSI) is an industry term for companies that provide a range of such financial products/services such as universal banks.

Banking may include core banking, retail, private, corporate, investment, cards and the like. Financial Services may include stock-broking, payment gateways, mutual funds etc. Insurance covers both life and non-life.

This term is commonly used by information technology (IT)/Information technology enabled services (ITES)/business process outsourcing (BPO) companies and technical/professional services firms that manage data processing, application testing and software development activities in this domain.

source: http://en.wikipedia.org/wiki/BFSI

UDF in Hive..

Writing a UDF:

To illustrate the process of writing and using a UDF, we’ll write a simple UDF to trim

characters from the ends of strings. Hive already has a built-in function called trim, so

we’ll call ours strip. The code for the Strip Java class is shown in

A UDF for stripping characters from the ends of strings

package com.hadoopbook.hive;

import org.apache.commons.lang.StringUtils;

import org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.io.Text;

public class Strip extends UDF {

private Text result = new Text();

public Text evaluate(Text str) {

if (str == null) {

return null;

}

result.set(StringUtils.strip(str.toString()));

return result;

}

public Text evaluate(Text str, String stripChars) {

if (str == null) {

return null;

}

result.set(StringUtils.strip(str.toString(), stripChars));

return result;

}

A UDF must satisfy the following two properties:

1. A UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF.

2. A UDF must implement at least one evaluate() method.

To use the UDF in Hive, we need to package the compiled Java class in a JAR file and register the file with Hive:

ADD JAR /path/to/hive-examples.jar;

We also need to create an alias for the Java classname:

CREATE TEMPORARY FUNCTION strip AS 'com.hadoopbook.hive.Strip';

The TEMPORARY keyword here highlights the fact that UDFs are only defined for the duration of the Hive session (they are not persisted in the metastore). In practice, this means you need to add the JAR file, and define the function at the beginning of each script or session.

Note:

As an alternative to calling ADD JAR, you can specify—at launch time— a path where Hive looks for auxiliary JAR files to put on its classpath (including the MapReduce classpath). This technique is useful for automatically adding your own library of UDFs every time you run Hive.

There are two ways of specifying the path, either passing the --auxpath option to the hive command:

% hive --auxpath /path/to/hive-examples.jar or by setting the HIVE_AUX_JARS_PATH environment variable before invoking Hive. The auxiliary path may be a comma-separated list of JAR file paths or a directory containing JAR files.

Alternatively you can edit $HIVE_HOME/conf/hive-site.xml with a hive.aux.jars.path property. Either way you need to do this before starting hive

The UDF is now ready to be used, just like a built-in function:

hive> SELECT strip(' bee ') FROM dummy;

bee

hive> SELECT strip('banana', 'ab') FROM dummy;

nan

Notice that the UDF’s name is not case-sensitive:

hive> SELECT STRIP(' bee ') FROM dummy;

bee

Analysing JSON (JavaScript Object Notation) Document in Hive

Analysing JSON (JavaScript Object Notation) Document:

you can use TEXTFILE as the input and output format, then use a JSON SerDe to parse each JSON document as a record.

Example 1:

1: Create a test file in Json format

$ cat> jsont.txt

{"a" :10, "b" :11, "c" :15}

{"a" :20, "b" :21, "c" :25}

{"a" :30, "b" :31, "c" :35}

{"a" :40, "b" :41, "c" :45}

{"a" :50, "b" :51, "c" :55}

{"a" :60, "b" :61, "c" :65}

2: Create a hive table

hive> create table jsont1(str string);

3: Load data into hive table from local xml file

hive> load data local inpath 'jsont.txt' into table jsont1;

4: Create another table to extract the json data

hive>create table jsont2(a int, b int, c int);

5: Insert jsont1 table data into jsont2 table

hive>insert overwrite table jsont2 select get_json_object(str, '$.a'), get_json_object(str, '$.b'), get_json_object(str, '$.c') from jsont1;

Example2 :

$Cat>jsonex.txt

{ "top" : [

{"table":"user",

"data":{

"name":"John Doe","userid":"2036586","age":"74","code":"297994","status":1}},

{"table":"user",

"data":{

"name":"Mary Ann","userid":"14294734","age":"64","code":"142798","status":1}},

{"table":"user",

"data":{

"name":"Carl Smith","userid":"13998600","age":"36","code":"32866","status":1}},

{"table":"user",

"data":{

"name":"Anil Kumar":"2614012","age":"69","code":"208672","status":1}},

{"table":"user",

"data":{

"name":"Kim Lee","userid":"10471190","age":"53","code":"79365","status":1}}

]}

CREATE TABLE user (line string)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n'

STORED AS TEXTFILE

LAOD DATA LOCAL INPATH ‘jsonex.txt’ OVERWRITE INTO TABLE user;

SELECT get_json_object(col0, '$.name') as name, get_json_object(col0, '$.userid') as uid,

get_json_object(col0, '$.age') as age, get_json_object(col0, '$.code') as code,

get_json_object(col0, '$.status') as status

FROM

(SELECT get_json_object(user.line, '$.data') as col0

FROM user

WHERE get_json_object(user.line, '$.data') is not null) temp;

Note: A string like $.user.id means to take each record, represented by $, find the user key, which is assumed to be a JSON map in this case, and finally extract the value for the id key inside the user. This value for the id is used as the value for the user_id column.

Analyzing XML Data in Hive.. 9/10

Analyzing XML Data:

To load the xml data in local directory

$cat>xmltestfile.txt

<emp><ename>Kiran</ename><sal>10000</sal></emp>

<emp><ename>Seshu</ename><sal>20000</sal></emp>

<emp><ename>Ramu</ename><sal>30000</sal></emp>

<emp><ename>Rama</ename><sal>40000</sal></emp>

<emp><ename>Srinu</ename><sal>50000</sal></emp>

<emp><ename>Ravi</ename><sal>60000</sal></emp>

<emp><ename>Sandhya</ename><sal>70000</sal></emp>

To create hive table

Hive>create table xmldata(str string);

To load data into hive table from local xml file

hive> load data local inpath 'xmltestfile.txt' into table xmldata;

To create another table to extract the xml data

hive>create table xmld1(ename array<string>, sal array<string>);

To insert xmld table data into xmld1 table

hive>insert overwrite table xmld1 select xpath(str, 'emp/ename/text()'), xpath(str, 'emp/sal/text()') from xmldata;

To create another table to convert the data from array type to normal type

hive> create table xmld2(ename string, sal int);

To insert the data to xmld2 from xmld1

hive> insert overwrite table xmld2 select ename[0],sal[0] from xmld1;

XPath UDFs:

XPath expressions :

Ex:

hive> SELECT xpath(\'<a>b1b2</a>\',\'//@id\')

> FROM src LIMIT 1;

[foo","bar]

hive> SELECT xpath (\'<a>b1b2b3<c class="bb">c1</c>

<c>c2</c></a>\', \'a/*[@class="bb"]/text()\')

> FROM src LIMIT 1;

[b1","c1]

(The long XML string was wrapped for space.)

hive> SELECT xpath_double (\'<a>2<c>4</c></a>\', \'a/b + a/c\')

> FROM src LIMIT 1;

6.0

Weblog Data Analysis..9/10

Weblog data analysis:

Input text: hivelog.txt

89.151.85.133 - - [23/Jun/2009:10:39:11 +0300] "GET /movie/127Hours HTTP/1.1" 200 766

212.76.137.2 - - [23/Jun/2009:10:39:11 +0300] "GET /movie/BlackSwan HTTP/1.1" 200 766

74.125.113.104 - - [23/Jun/2009:10:39:11 +0300] "GET /movie/TheFighter HTTP/1.1" 200 766

212.76.137.2 - - [23/Jun/2009:10:39:11 +0300] "GET /movie/Inception HTTP/1.1" 200 766

127.0.0.1 - - [23/Jun/2009:10:39:11 +0300] "GET /movie/TrueGrit HTTP/1.1" 200 766

10.0.12.1 - - [23/Jun/2009:10:39:11 +0300] "GET /movie/WintersBone HTTP/1.1" 200 766

hive> CREATE TABLE hive_log (

host STRING,

identity STRING,

user STRING,

time STRING,

request STRING,

status STRING,

size STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" =

"([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)",

"output.format.string"="%1$s %2$s %3$s %4$s %5$s %6$s %7$s"

) STORED AS TEXTFILE;

Then you load the data from the log file to hive table:

Hive>load data local inpath ‘hivelog.txt ‘ into table hive_log;

A quick test will tell you if the data’s being correctly handled by the SerDe. Since the RegexSerDe class is part of the Hive contrib, you’ll need to register the JAR so that it’s copied into the distributed cache and can be loaded by the MapReduce tasks:

hive> add jar $HIVE_HOME/lib/hive-contrib-0.7.1-cdh3u2.jar;

hive> SELECT host, request FROM hive_logs LIMIT 10;

89.151.85.133 "GET /movie/127Hours HTTP/1.1"

212.76.137.2 "GET /movie/BlackSwan HTTP/1.1"

74.125.113.104 "GET /movie/TheFighter HTTP/1.1"

212.76.137.2 "GET /movie/Inception HTTP/1.1"

127.0.0.1 "GET /movie/TrueGrit HTTP/1.1"

10.0.12.1 "GET /movie/WintersBone HTTP/1.1"

If you’re seeing nothing but NULL values in the output, it’s probably because you have

a missing space in your regular expression.

Hive Storage formats . 9/10

Storage Formats:
------------------------

There are two dimensions that govern table storage in Hive: the row format and the file format. The row format dictates how rows, and the fields in a particular row, are stored. In Hive parlance, the row format is defined by a SerDe

When acting as a deserializer, which is the case when querying a table, a SerDe will deserialize a row of data from the bytes in the file to objects used internally by Hive to operate on that row of data.

The default storage format: Delimited text

When you create a table with no ROW FORMAT or STORED AS clauses, the default format is delimited text, with a row per line.

The default row delimiter is Control-A (octal form of the delimiter characters can be used—001 for Control-A)
The default collection item delimiter is a Control-B
The default map key delimiter is a Control-C
Rows in a table are delimited by a newline character.

Example:

John Doe^A100000.0^AMary Smith^BTodd Jones^AFederal Taxes^C.2^BState
Taxes^C.05^BInsurance^C.1^A1 Michigan Ave.^BChicago^BIL^B60600
Mary Smith^A80000.0^ABill King^AFederal Taxes^C.2^BState Taxes^C.
05^BInsurance^C.1^A100 Ontario St.^BChicago^BIL^B60601
Todd Jones^A70000.0^AFederal Taxes^C.15^BState Taxes^C.03^BInsurance^C.
1^A200 Chicago Ave.^BOak Park^BIL^B60700
Bill King^A60000.0^AFederal Taxes^C.15^BState Taxes^C.03^BInsurance^C.
1^A300 Obscure Dr.^BObscuria^BIL^B60100

look like in JavaScript Object Notation (JSON), where we have also inserted the names
from the table schema:
{
"name": "John Doe",
"salary": 100000.0,
"subordinates": ["Mary Smith", "Todd Jones"],
"deductions": {
"Federal Taxes": .2,
"State Taxes": .05,
"Insurance": .1
},
"address": {
"street": "1 Michigan Ave.",
"city": "Chicago",
"state": "IL",
"zip": 60600
}
}

Note:
Binary SerDe’s should not be used with the default TEXTFILE format (or explicitly using a STORED AS TEXTFILE clause). There is always the possibility that a binary row will contain a newline character, which would cause Hive to truncate the row and fail at deserialization time.

Hive supports what type of Data Types ?

Data Types:

Collection Data Types:

Sunday, September 15, 2013

hadoop-tutorial-hbase-part-6-key-design

Hadoop Tutorial: HBase Part 6 -- Key Design from Marty Hall

hadoop-tutorial-hbase-part-5-java-client-api-advanced-topics

Hadoop Tutorial: HBase Part 5 -- Java Client API Advanced Topics from Marty Hall

hadoop-tutorial-hbase-part-4-java-admin-api

Hadoop Tutorial: HBase Part 4 -- Java Admin API from Marty Hall

hadoop-tutorial-hbase-part-3-java-client-api

Hadoop Tutorial: HBase Part 3 -- Java Client API from Marty Hall

hadoop-tutorial-hbase-part-2-installation-and-shell

This below ppt is very good to learn HBase...part2

Hadoop Tutorial: HBase Part 2 -- Installation and Shell from Marty Hall

hadoop-tutorial-hbase-part-1-overview

This below ppt is very good to understand HBase.

Hadoop Tutorial: HBase Part 1 -- Overview from Marty Hall

Thursday, September 12, 2013

Installation of Hive in Single Node Hadoop Cluster Machine..

Hive Installation:

Hi all,

Here, I am going to show you how to install hive in hadoop single node cluster using tarball /offline.

Prerequisites:

- Hadoop must be installed, to check type " $ echo $HADOOP_HOME ".
- If HADOOP_HOME is not set, then set it immediately, because hive or any hadoop related applications always search for this variable in the current machine.

Note: Here commands or any file names in linux operating system is fully case sensitive, So be careful while typing or adding environment variables to .bashrc or .profiles

Here I am using 'hduser' as default user to run hadoop cluster. so I install through this user. Don't be confused with hduser, nagarjuna, sudo. ok

First you download the stable version of Hive from apache website. and must set to the currently installed Hadoop version, otherwise you will face errors or any bugs. ok

Installation Steps:

here I am using the hive version 0.9.0 for hadoop 1.0.4 cluster.
download hive-0.9.0-bin.tar.gz and do not download hive-0.9.0.tar.gz.

See the below structure I have in my machine

- In the above diagram, I have copied hivexxx.tar.gz file to /usr/local/. And observe it has no permissions to access other that root/sudo user.
- so give permission to this file using chmod command using sudo permission like below.

and then, extract that tar file using tarball command like below, here hduser may not have permission to extract to that folder so use sudo to extract.

Now, You will see hivexxx folder like below

Till now, we did only extraction of hivexxx tar.gz file to some location by using required permissions.
Okay, now we have to set system variables to run hive.

- We need HIVE_HOME and PATH system variables.
- Here we use User Level System Variables, by placing a bash script lines in .bashrc or .profile file under hduser's home directory which are already hidden.

- And editing of these files is your wish, I use nano or gedit to edit these files.

For example

add script to end of .bashrc or .profile file

Now, logout and login (Re-login) to hduser. Then check for $HIVE_HOME, If it shows the Hive home then Hive is ready to use.

check like below screen

Note:
From the above screen, hive shell will be displayed even though hadoop is not running. To run any SQL queries in hive shell, you must run hadoop. Otherwise you will get connection error like errors.

Please let me know if any mistakes are in this post, welcome your valuable feedback.
contact at nagarjuna.lingala@gmail.com

you can also see me at javaojava.blogspot.com also.

Wednesday, September 11, 2013

What does Secondary NameNode ? And what is the use ?

Secondary NameNode :

- This Secondary NN is not a hot backup of the NameNode. It cannot be used to the event of a NameNode failure.

- This Daemon, periodically synchronizes with the NameNode block index. During the synchronizing process, the Secondary NN retrieves the current NameNode image and edit logs, merges them together, and then sends the merged image back to the NameNode.

what is Data Locality ?

Data Locality :

- Applications using HDFS can achieve high throughput because the Hadoop framework was designed to move computation to the data.

- Applications can run on the nodes where the data resides instead of moving the data to the applications.

Tuesday, September 10, 2013

Important things about TaskTracker and mapred-site.xml configuration....

Tasks in TaskTracker:

    For each input split, a map task is created that runs the user-supplied map function on
each record in the split. Map tasks are executed in parallel. This means each chunk of
the input dataset is being processed at the same time by various machines that make
up the cluster. It’s fine if there are more map tasks to execute than the cluster can handle.
They’re simply queued and executed in whatever order the framework deems best.

The map function takes a key-value pair as input and produces zero or more intermediate key-value pairs.

The input format is responsible for turning each record into its key-value pair representation.

There is always a single tasktracker on each worker node.
Both tasktrackers and datanodes run on the same machines, which makes each node
both a compute node and a storage node, respectively. Each tasktracker is configured
with a specific number of map and reduce task slots that indicate how many of each
type of task it is capable of executing in parallel. A task slot is exactly what it sounds
like; it is an allocation of available resources on a worker node to which a task may be
assigned, in which case it is executed. A tasktracker executes some number of map
tasks and reduce tasks in parallel, so there is concurrency both within a worker where
many tasks run, and at the cluster level where many workers exist. Map and reduce
slots are configured separately because they consume resources differently.

It is common that tasktrackers allow more map tasks than reduce tasks to execute in parallel.

Upon receiving a task assignment from the jobtracker, the tasktracker executes an
attempt of the task in a separate process.

Upon receiving a task assignment from the jobtracker, the tasktracker executes an attempt

of the task in a separate process.

Difference between Task and Task attempt (Task instance):

- A task is the logical unit of work, while a task attempt is a specific, physical instance

of that task being executed.

mapred-site.xml

<property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>4</value>
    <description>The maximum number of map tasks that will be run simultaneously by a task tracker.</description>
</property>

<property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>4</value>
    <description>The maximum number of reduce tasks that will be run simultaneously by a task tracker.</description>
</property>