Recursive method call in Apache Spark
我正在Apache Spark上的数据库中构建家族树,使用递归搜索来找到数据库中每个人的最终父母(即家族树顶部的人)。
假定搜索ID时返回的第一个人是正确的父母
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | val peopleById = peopleRDD.keyBy(f => f.id) def findUltimateParentId(personId: String) : String = { if((personId == null) || (personId.length() == 0)) return"-1" val personSeq = peopleById.lookup(personId) val person = personSeq(0) if(person.personId =="0"|| person.id == person.parentId) { return person.id } else { return findUltimateParentId(person.parentId) } } val ultimateParentIds = peopleRDD.foreach(f => f.findUltimateParentId(f.parentId)) |
它给出了以下错误
"Caused by: org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example,
rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of therdd1.map transformation. For more information, see SPARK-5063."
通过阅读其他类似的问题,我了解到问题在于我正在foreach循环中调用
但是,没有其他建议的解决方案对我有用,或者至少我看不到如何在我的程序中实现它们,任何人都可以帮忙吗?
如果我理解正确的话-这是一个适用于任何大小输入的解决方案(尽管性能可能不佳)-它会在RDD上执行N次迭代,其中N是"最深的家庭"(从祖先到孩子的最大距离)。 输入:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | // representation of input: each person has an ID and an optional parent ID case class Person(id: Int, parentId: Option[Int]) // representation of result: each person is optionally attached its"ultimate" ancestor, // or none if it had no parent id in the first place case class WithAncestor(person: Person, ancestor: Option[Person]) { def hasGrandparent: Boolean = ancestor.exists(_.parentId.isDefined) } object RecursiveParentLookup { // requested method def findUltimateParent(rdd: RDD[Person]): RDD[WithAncestor] = { // all persons keyed by id def byId = rdd.keyBy(_.id).cache() // recursive function that"climbs" one generation at each iteration def climbOneGeneration(persons: RDD[WithAncestor]): RDD[WithAncestor] = { val cached = persons.cache() // find which persons can climb further up family tree val haveGrandparents = cached.filter(_.hasGrandparent) if (haveGrandparents.isEmpty()) { cached // we're done, return result } else { val done = cached.filter(!_.hasGrandparent) // these are done, we'll return them as-is // for those who can - join with persons to find the grandparent and attach it instead of parent val withGrandparents = haveGrandparents .keyBy(_.ancestor.get.parentId.get) // grandparent id .join(byId) .values .map({ case (withAncestor, grandparent) => WithAncestor(withAncestor.person, Some(grandparent)) }) // call this method recursively on the result done ++ climbOneGeneration(withGrandparents) } } // call recursive method - start by assuming each person is its own parent, if it has one: climbOneGeneration(rdd.map(p => WithAncestor(p, p.parentId.map(i => p)))) } } |
这是一个测试,以更好地了解其工作原理:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | /** * Example input tree: * * 1 5 * | | * ----- 2 ----- 6 * | | * 3 4 * */ val person1 = Person(1, None) val person2 = Person(2, Some(1)) val person3 = Person(3, Some(2)) val person4 = Person(4, Some(2)) val person5 = Person(5, None) val person6 = Person(6, Some(5)) test("find ultimate parent") { val input = sc.parallelize(Seq(person1, person2, person3, person4, person5, person6)) val result = RecursiveParentLookup.findUltimateParent(input).collect() result should contain theSameElementsAs Seq( WithAncestor(person1, None), WithAncestor(person2, Some(person1)), WithAncestor(person3, Some(person1)), WithAncestor(person4, Some(person1)), WithAncestor(person5, None), WithAncestor(person6, Some(person5)) ) } |
将输入映射到这些
通过使用SparkContext.broadcast修复了此问题:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | val peopleById = peopleRDD.keyBy(f => f.id) val broadcastedPeople = sc.broadcast(peopleById.collectAsMap()) def findUltimateParentId(personId: String) : String = { if((personId == null) || (personId.length() == 0)) return"-1" val personOption = broadcastedPeople.value.get(personId) if(personOption.isEmpty) { return"0"; } val person = personOption.get if(person.personId == 0 || person.orgId == person.personId) { return person.id } else { return findUltimateParentId(person.parentId) } } val ultimateParentIds = peopleRDD.foreach(f => f.findUltimateParentId(f.parentId)) |
现在工作很棒!