Thursday, September 19, 2013

Weblog Data Analysis..9/10

Weblog data analysis: 


Input text: hivelog.txt

89.151.85.133 - - [23/Jun/2009:10:39:11 +0300] "GET /movie/127Hours HTTP/1.1" 200 766
212.76.137.2 - - [23/Jun/2009:10:39:11 +0300] "GET /movie/BlackSwan HTTP/1.1" 200 766
74.125.113.104 - - [23/Jun/2009:10:39:11 +0300] "GET /movie/TheFighter HTTP/1.1" 200 766
212.76.137.2 - - [23/Jun/2009:10:39:11 +0300] "GET /movie/Inception HTTP/1.1" 200 766
127.0.0.1 - - [23/Jun/2009:10:39:11 +0300] "GET /movie/TrueGrit HTTP/1.1" 200 766
10.0.12.1 - - [23/Jun/2009:10:39:11 +0300] "GET /movie/WintersBone HTTP/1.1" 200 766


hive> CREATE TABLE hive_log (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" =
"([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)",
"output.format.string"="%1$s %2$s %3$s %4$s %5$s %6$s %7$s"
) STORED AS TEXTFILE;



Then you load the data from the log file to hive table:

Hive>load data local inpath ‘hivelog.txt ‘ into table hive_log;


A quick test will tell you if the data’s being correctly handled by the SerDe. Since the RegexSerDe class is part of the Hive contrib, you’ll need to register the JAR so that it’s copied into the distributed cache and can be loaded by the MapReduce tasks:

hive> add jar $HIVE_HOME/lib/hive-contrib-0.7.1-cdh3u2.jar;
hive> SELECT host, request FROM hive_logs LIMIT 10;

89.151.85.133 "GET /movie/127Hours HTTP/1.1"
212.76.137.2 "GET /movie/BlackSwan HTTP/1.1"
74.125.113.104 "GET /movie/TheFighter HTTP/1.1"
212.76.137.2 "GET /movie/Inception HTTP/1.1"
127.0.0.1 "GET /movie/TrueGrit HTTP/1.1"
10.0.12.1 "GET /movie/WintersBone HTTP/1.1"

If you’re seeing nothing but NULL values in the output, it’s probably because you have
a missing space in your regular expression.





No comments:

Post a Comment