HADOOP: Weblog Data Analysis..9/10

Weblog data analysis:

Input text: hivelog.txt

89.151.85.133 - - [23/Jun/2009:10:39:11 +0300] "GET /movie/127Hours HTTP/1.1" 200 766

212.76.137.2 - - [23/Jun/2009:10:39:11 +0300] "GET /movie/BlackSwan HTTP/1.1" 200 766

74.125.113.104 - - [23/Jun/2009:10:39:11 +0300] "GET /movie/TheFighter HTTP/1.1" 200 766

212.76.137.2 - - [23/Jun/2009:10:39:11 +0300] "GET /movie/Inception HTTP/1.1" 200 766

127.0.0.1 - - [23/Jun/2009:10:39:11 +0300] "GET /movie/TrueGrit HTTP/1.1" 200 766

10.0.12.1 - - [23/Jun/2009:10:39:11 +0300] "GET /movie/WintersBone HTTP/1.1" 200 766

hive> CREATE TABLE hive_log (

host STRING,

identity STRING,

user STRING,

time STRING,

request STRING,

status STRING,

size STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" =

"([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)",

"output.format.string"="%1$s %2$s %3$s %4$s %5$s %6$s %7$s"

) STORED AS TEXTFILE;

Then you load the data from the log file to hive table:

Hive>load data local inpath ‘hivelog.txt ‘ into table hive_log;

A quick test will tell you if the data’s being correctly handled by the SerDe. Since the RegexSerDe class is part of the Hive contrib, you’ll need to register the JAR so that it’s copied into the distributed cache and can be loaded by the MapReduce tasks:

hive> add jar $HIVE_HOME/lib/hive-contrib-0.7.1-cdh3u2.jar;

hive> SELECT host, request FROM hive_logs LIMIT 10;

89.151.85.133 "GET /movie/127Hours HTTP/1.1"

212.76.137.2 "GET /movie/BlackSwan HTTP/1.1"

74.125.113.104 "GET /movie/TheFighter HTTP/1.1"

212.76.137.2 "GET /movie/Inception HTTP/1.1"

127.0.0.1 "GET /movie/TrueGrit HTTP/1.1"

10.0.12.1 "GET /movie/WintersBone HTTP/1.1"

If you’re seeing nothing but NULL values in the output, it’s probably because you have

a missing space in your regular expression.

HADOOP

Thursday, September 19, 2013

Weblog Data Analysis..9/10

No comments:

Post a Comment