Hadoop/Hive Collect_list without repeating items -


based on post, hive 0.12 - collect_list, trying locate java code implement udaf accomplish or similar functionality without repeating sequence.

for instance, collect_all() returns sequence a, a, a, b, b, a, c, c have sequence a, b, a, c returned. sequentially repeated items removed.

does know of function in hive 0.12 accomplish or has written own udaf?

as always, help.

i ran similar problem awhile back. didn't want have write full-on udaf did combo brickhouse collect , own udf. have data

id  value 1   1   1   1   b 1   b 1   1   c 1   c 1   d 2   d 2   d 2   d 2   d 2   f 2   f 2   f 2   2   w 2   

my udf was

package com.something;  import java.util.arraylist; import org.apache.hadoop.hive.ql.exec.udf; import org.apache.hadoop.io.text;  public class removesequentialduplicates extends udf {     public arraylist<text> evaluate(arraylist<text> arr) {         arraylist<text> newlist = new arraylist<text>();         newlist.add(arr.get(0));         (int i=1; i<arr.size(); i++) {              string front = arr.get(i).tostring();             string = arr.get(i-1).tostring();              if (!back.equals(front)) {                 newlist.add(arr.get(i));             }         }         return newlist;     } } 

and query was

add jar /path/to/jar/brickhouse-0.7.1.jar; add jar /path/to/other/jar/duplicates.jar;  create temporary function remove_seq_dups 'com.something.removesequentialduplicates'; create temporary function collect 'brickhouse.udf.collect.collectudaf';  select id   , remove_seq_dups(value_array) no_dups (   select id     , collect(value) value_array   db.table   group id ) x 

output

1   ["a","b","a","c","d"] 2   ["d","f","a","w","a"] 

as aside, built-in collect_list not necessary keep elements of list in order grouped in; brickhouse collect will. hope helps.


Comments