based on post, hive 0.12 - collect_list, trying locate java code implement udaf accomplish or similar functionality without repeating sequence.
for instance, collect_all() returns sequence a, a, a, b, b, a, c, c have sequence a, b, a, c returned. sequentially repeated items removed.
does know of function in hive 0.12 accomplish or has written own udaf?
as always, help.
i ran similar problem awhile back. didn't want have write full-on udaf did combo brickhouse collect , own udf. have data
id value 1 1 1 1 b 1 b 1 1 c 1 c 1 d 2 d 2 d 2 d 2 d 2 f 2 f 2 f 2 2 w 2 my udf was
package com.something; import java.util.arraylist; import org.apache.hadoop.hive.ql.exec.udf; import org.apache.hadoop.io.text; public class removesequentialduplicates extends udf { public arraylist<text> evaluate(arraylist<text> arr) { arraylist<text> newlist = new arraylist<text>(); newlist.add(arr.get(0)); (int i=1; i<arr.size(); i++) { string front = arr.get(i).tostring(); string = arr.get(i-1).tostring(); if (!back.equals(front)) { newlist.add(arr.get(i)); } } return newlist; } } and query was
add jar /path/to/jar/brickhouse-0.7.1.jar; add jar /path/to/other/jar/duplicates.jar; create temporary function remove_seq_dups 'com.something.removesequentialduplicates'; create temporary function collect 'brickhouse.udf.collect.collectudaf'; select id , remove_seq_dups(value_array) no_dups ( select id , collect(value) value_array db.table group id ) x output
1 ["a","b","a","c","d"] 2 ["d","f","a","w","a"] as aside, built-in collect_list not necessary keep elements of list in order grouped in; brickhouse collect will. hope helps.
Comments
Post a Comment