How to match a tree against a large set of patterns?
我有一组潜在的无限符号:
考虑非空有限树,这样每个节点都有一个附加到它的符号,以及0个或多个非空子树。给定节点的子树的顺序是重要的(例如,如果有一个节点有两个子树,我们可以区分哪个是左的,哪个是右的)。任何给定的符号都可以在树0中出现更多次附加到不同的节点。占位符符号
有限性要求是指一棵树的节点总数是一个正的有限整数。由此可见,每个子树中的附加符号总数、树深度和节点总数都是有限的。
树是以函数表示法给出的:一个节点由一个附加在它上面的符号表示,如果有任何子树,它后面是括号,括号中包含以相同方式递归表示的以逗号分隔的子树列表。例如树
1 2 3 4 5 6 7 8 9 | A / \ ? B / \ A C /|\ A C Q \ ? |
表示为
我预先计算了一组不变的树,它们将被用作匹配的模式。该集合通常具有约105个树,每个元素通常具有约10-30个节点。我可以利用大量时间预先创建最适合我下面所述问题的任何S表示。
我需要编写一个接受树t(通常有102个节点)的函数,并在t作为子树包含s的任何元素时尽可能快地进行检查,前提是具有占位符符号
请建议存储集合S的数据结构和检查匹配的算法。任何编程语言或伪代码都可以。
(P)1.This paper describes a variant of the aho-corasick algorithm,where instead of using a finite State machine(which the standard aho-corasick algorithm used for string matching)the Algorithm instead uses a pushdown automation for subtree matching.Like the aho-corasick string-matching algorithm,their variant only requires one pass through the input tree to match against the entire dictionary of S.(p)(P)The paper is quite complex-it may be worth it to contact the author to see if he has any source code available.(p)
(P)What you need is a finite State machine that tracks the set of potential matches you might have.(p)(P)In essence,such a machine is the result of matching the patterns against each other,and determining what part of the individual matches they share.This is analogous to how lexers take sets of regular expressions for tokens and compose them into a large FSA that can match any of the regular expressions by processing characters one at a time.(p)(P)You can find references to methods for doing this under term rewriting systems.(p)