Thursday, September 19, 2013

UDF in Hive..


Writing a UDF:

      To illustrate the process of writing and using a UDF, we’ll write a simple UDF to trim
characters from the ends of strings. Hive already has a built-in function called trim, so
we’ll call ours strip. The code for the Strip Java class is shown in

A UDF for stripping characters from the ends of strings

package com.hadoopbook.hive;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class Strip extends UDF {
private Text result = new Text();
public Text evaluate(Text str) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString()));
return result;
}
public Text evaluate(Text str, String stripChars) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString(), stripChars));
return result;
}
}

A UDF must satisfy the following two properties:

1. A UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF.
2. A UDF must implement at least one evaluate() method.

To use the UDF in Hive, we need to package the compiled Java class in a JAR file and register the file with Hive:

ADD JAR /path/to/hive-examples.jar;

We also need to create an alias for the Java classname:

CREATE TEMPORARY FUNCTION strip AS 'com.hadoopbook.hive.Strip';

The TEMPORARY keyword here highlights the fact that UDFs are only defined for the duration of the Hive session (they are not persisted in the metastore). In practice, this means you need to add the JAR file, and define the function at the beginning of each script or session.

Note:

As an alternative to calling ADD JAR, you can specify—at launch time— a path where Hive looks for auxiliary JAR files to put on its classpath (including the MapReduce classpath). This technique is useful for automatically adding your own library of UDFs every time you run Hive.

There are two ways of specifying the path, either passing the --auxpath option to the hive command:

% hive --auxpath /path/to/hive-examples.jar or by setting the HIVE_AUX_JARS_PATH environment variable before invoking Hive. The auxiliary path may be a comma-separated list of JAR file paths or a directory containing JAR files.

Alternatively you can edit $HIVE_HOME/conf/hive-site.xml with a hive.aux.jars.path property. Either way you need to do this before starting hive


The UDF is now ready to be used, just like a built-in function:

hive> SELECT strip(' bee ') FROM dummy;
bee

hive> SELECT strip('banana', 'ab') FROM dummy;
nan

Notice that the UDF’s name is not case-sensitive:

hive> SELECT STRIP(' bee ') FROM dummy;
bee

No comments:

Post a Comment