关于 group by：在 Apache Pig 中，根据单列选择 DISTINCT 行

In Apache Pig, select DISTINCT rows based on a single column

假设我有一个如下表，它可能包含也可能不包含给定字段的重复项：

1
2
3
4
5
6

ID URL
--- ------------------
001 http://example.com/adam
002 http://example.com/beth
002 http://example.com/beth?extra=blah
003 http://example.com/charlie

我想编写一个 Pig 脚本，根据单个字段的值仅查找 DISTINCT 行。例如，通过 ID 过滤上面的表格应该返回如下内容：

1
2
3
4
5

ID URL
--- ------------------
001 http://example.com/adam
002 http://example.com/beth
003 http://example.com/charlie

Pig GROUP BY 运算符返回一个按 ID 分组的元组包，如果我知道如何只获取每个包的第一个元组(可能是一个单独的问题)，这将起作用。

Pig DISTINCT 运算符适用于整行，因此在这种情况下，所有四行都将被认为是唯一的，这不是我想要的。

出于我的目的，我不在乎返回 ID 为 002 的哪些行。

我找到了一种方法，使用 GROUP BY 和 TOP 运算符：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

my_table = LOAD 'my_table_file' AS (A, B);

my_table_grouped = GROUP my_table BY A;

my_table_distinct = FOREACH my_table_grouped {

-- For each group $0 refers to the group name, (A)
-- and $1 refers to a bag of entire rows {(A, B), (A, B), ...}.
-- Here, we take only the first (top 1) row in the bag:

result = TOP(1, 0, $1);
GENERATE FLATTEN(result);

}

DUMP my_table_distinct;

这会导致每个 ID 列有一个不同的行：

1
2
3

(001,http://example.com/adam)
(002,http://example.com/beth?extra=blah)
(003,http://example.com/charlie)

我不知道是否有更好的方法，但这对我有用。我希望这可以帮助其他人从 Pig 开始。

(参考：http://pig.apache.org/docs/r0.12.1/func.html#topx)

你可以使用

Apache DataFu? (孵化)

FirstTupleFrom Bag

1
2
3
4

register datafu-pig-incubating-1.3.1.jar
define FirstTupleFromBag datafu.pig.bags.FirstTupleFromBag();
my_table_grouped = GROUP my_table BY A;
my_table_grouped_first_tuple = foreach my_table_grouped generate flatten(FirstTupleFromBag(my_table,null));

我发现您可以通过嵌套分组和使用 LIMIT 来做到这一点所以使用 Arel 的示例：

1
2
3
4
5
6
7
8
9
10
11

my_table = LOAD 'my_table_file' AS (A, B);

-- Nested foreach grouping generates bags with same A,
-- limit bags to 1

my_table_distinct = FOREACH (GROUP my_table BY A) {
result = LIMIT my_table 1;
GENERATE FLATTEN(result);
}

DUMP my_table_distinct;