In Apache Pig, select DISTINCT rows based on a single column
假设我有一个如下表,它可能包含也可能不包含给定字段的重复项:
1 2 3 4 5 6 | ID URL --- ------------------ 001 http://example.com/adam 002 http://example.com/beth 002 http://example.com/beth?extra=blah 003 http://example.com/charlie |
我想编写一个 Pig 脚本,根据单个字段的值仅查找 DISTINCT 行。例如,通过
1 2 3 4 5 | ID URL --- ------------------ 001 http://example.com/adam 002 http://example.com/beth 003 http://example.com/charlie |
Pig
Pig
出于我的目的,我不在乎返回 ID 为
我找到了一种方法,使用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | my_table = LOAD 'my_table_file' AS (A, B); my_table_grouped = GROUP my_table BY A; my_table_distinct = FOREACH my_table_grouped { -- For each group $0 refers to the group name, (A) -- and $1 refers to a bag of entire rows {(A, B), (A, B), ...}. -- Here, we take only the first (top 1) row in the bag: result = TOP(1, 0, $1); GENERATE FLATTEN(result); } DUMP my_table_distinct; |
这会导致每个 ID 列有一个不同的行:
1 2 3 | (001,http://example.com/adam) (002,http://example.com/beth?extra=blah) (003,http://example.com/charlie) |
我不知道是否有更好的方法,但这对我有用。我希望这可以帮助其他人从 Pig 开始。
(参考:http://pig.apache.org/docs/r0.12.1/func.html#topx)
你可以使用
Apache DataFu? (孵化)
FirstTupleFrom Bag
1 2 3 4 | register datafu-pig-incubating-1.3.1.jar define FirstTupleFromBag datafu.pig.bags.FirstTupleFromBag(); my_table_grouped = GROUP my_table BY A; my_table_grouped_first_tuple = foreach my_table_grouped generate flatten(FirstTupleFromBag(my_table,null)); |
我发现您可以通过嵌套分组和使用
1 2 3 4 5 6 7 8 9 10 11 | my_table = LOAD 'my_table_file' AS (A, B); -- Nested foreach grouping generates bags with same A, -- limit bags to 1 my_table_distinct = FOREACH (GROUP my_table BY A) { result = LIMIT my_table 1; GENERATE FLATTEN(result); } DUMP my_table_distinct; |