关于scala：Apache Spark中的递归方法调用

Recursive method call in Apache Spark

我正在Apache Spark上的数据库中构建家族树，使用递归搜索来找到数据库中每个人的最终父母(即家族树顶部的人)。

假定搜索ID时返回的第一个人是正确的父母

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

val peopleById = peopleRDD.keyBy(f => f.id)
def findUltimateParentId(personId: String) : String = {

if((personId == null) || (personId.length() == 0))
return"-1"

val personSeq = peopleById.lookup(personId)
val person = personSeq(0)
if(person.personId =="0"|| person.id == person.parentId) {

return person.id

}
else {

return findUltimateParentId(person.parentId)

}

}

val ultimateParentIds = peopleRDD.foreach(f => f.findUltimateParentId(f.parentId))

它给出了以下错误

"Caused by: org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063."

通过阅读其他类似的问题，我了解到问题在于我正在foreach循环中调用findUltimateParentId，并且如果我使用具有人员ID的shell调用该方法，它将返回正确的最终parent id

但是，没有其他建议的解决方案对我有用，或者至少我看不到如何在我的程序中实现它们，任何人都可以帮忙吗？

相关讨论

如果我理解正确的话-这是一个适用于任何大小输入的解决方案(尽管性能可能不佳)-它会在RDD上执行N次迭代，其中N是"最深的家庭"(从祖先到孩子的最大距离)。输入：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

// representation of input: each person has an ID and an optional parent ID
case class Person(id: Int, parentId: Option[Int])

// representation of result: each person is optionally attached its"ultimate" ancestor,
// or none if it had no parent id in the first place
case class WithAncestor(person: Person, ancestor: Option[Person]) {
def hasGrandparent: Boolean = ancestor.exists(_.parentId.isDefined)
}

object RecursiveParentLookup {
// requested method
def findUltimateParent(rdd: RDD[Person]): RDD[WithAncestor] = {

// all persons keyed by id
def byId = rdd.keyBy(_.id).cache()

// recursive function that"climbs" one generation at each iteration
def climbOneGeneration(persons: RDD[WithAncestor]): RDD[WithAncestor] = {
val cached = persons.cache()
// find which persons can climb further up family tree
val haveGrandparents = cached.filter(_.hasGrandparent)

if (haveGrandparents.isEmpty()) {
cached // we're done, return result
} else {
val done = cached.filter(!_.hasGrandparent) // these are done, we'll return them as-is
// for those who can - join with persons to find the grandparent and attach it instead of parent
val withGrandparents = haveGrandparents
.keyBy(_.ancestor.get.parentId.get) // grandparent id
.join(byId)
.values
.map({ case (withAncestor, grandparent) => WithAncestor(withAncestor.person, Some(grandparent)) })
// call this method recursively on the result
done ++ climbOneGeneration(withGrandparents)
}
}

// call recursive method - start by assuming each person is its own parent, if it has one:
climbOneGeneration(rdd.map(p => WithAncestor(p, p.parentId.map(i => p))))
}

}

这是一个测试，以更好地了解其工作原理：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

/**
* Example input tree:
*
* 1 5
* | |
* ----- 2 ----- 6
* | |
* 3 4
*
*/

val person1 = Person(1, None)
val person2 = Person(2, Some(1))
val person3 = Person(3, Some(2))
val person4 = Person(4, Some(2))
val person5 = Person(5, None)
val person6 = Person(6, Some(5))

test("find ultimate parent") {
val input = sc.parallelize(Seq(person1, person2, person3, person4, person5, person6))
val result = RecursiveParentLookup.findUltimateParent(input).collect()
result should contain theSameElementsAs Seq(
WithAncestor(person1, None),
WithAncestor(person2, Some(person1)),
WithAncestor(person3, Some(person1)),
WithAncestor(person4, Some(person1)),
WithAncestor(person5, None),
WithAncestor(person6, Some(person5))
)
}

将输入映射到这些Person对象，将输出WithAncestor对象映射到所需的对象应该很容易。请注意，此代码假定如果有人具有parentId X-输入中实际上存在具有该id的另一个人