Writing a UDF:
To illustrate the process of
writing and using a UDF, we’ll write a simple UDF to trim
characters from the ends of
strings. Hive already has a built-in function called trim, so
we’ll call ours strip. The code
for the Strip Java class is shown in
A UDF
for stripping characters from the ends of strings
package com.hadoopbook.hive;
import
org.apache.commons.lang.StringUtils;
import
org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class Strip extends UDF {
private Text result = new Text();
public Text evaluate(Text str) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString()));
return result;
}
public Text evaluate(Text str,
String stripChars) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString(),
stripChars));
return result;
}
}
A UDF must satisfy the following
two properties:
1. A UDF must be a subclass of
org.apache.hadoop.hive.ql.exec.UDF.
2. A UDF must implement at least
one evaluate() method.
To use the UDF in Hive, we need
to package the compiled Java class in a JAR file and register the
file with Hive:
ADD JAR
/path/to/hive-examples.jar;
We also need to create an alias
for the Java classname:
CREATE TEMPORARY FUNCTION strip
AS 'com.hadoopbook.hive.Strip';
The TEMPORARY keyword here
highlights the fact that UDFs are only defined for the duration of
the Hive session (they are not persisted in the metastore). In
practice, this means you need to add the JAR file, and define the
function at the beginning of each script or session.
Note:
As an alternative to calling
ADD JAR, you can specify—at launch time— a path where Hive looks
for auxiliary JAR files to put on its classpath (including the
MapReduce classpath). This technique is useful for automatically
adding your own library of UDFs every time you run Hive.
There are two ways of
specifying the path, either passing the --auxpath option to the hive
command:
% hive --auxpath
/path/to/hive-examples.jar or by setting the HIVE_AUX_JARS_PATH
environment variable before invoking Hive. The auxiliary path may be
a comma-separated list of JAR file paths or a directory containing
JAR files.
Alternatively you can edit
$HIVE_HOME/conf/hive-site.xml
with a hive.aux.jars.path
property. Either way you need to do this before starting hive
The UDF is now ready to be used,
just like a built-in function:
hive> SELECT strip(' bee ')
FROM dummy;
bee
hive> SELECT
strip('banana', 'ab') FROM dummy;
nan
Notice that the UDF’s name is
not case-sensitive:
hive> SELECT STRIP(' bee ')
FROM dummy;
bee
No comments:
Post a Comment