Checking collision in filename search patterns with wildcards
我需要通过仅检查/比较表达式来比较文件系统通配符表达式以查看它们的结果是否会重叠。
例如,我们正在构建一个实用程序,该实用程序可以根据文件系统通配符表达式将文件从一个(或多个位置)分类到单独的文件夹中。例如:*.txt 进入文件夹 a,*.doc 进入文件夹 b,依此类推。我们支持的通配符是 * 和 ?
我希望能够仅通过分析通配符表达式来确定它们是否会冲突/重叠。
例如,如果我有以下表达式:
1 2 | *.x.y *.y |
它们会发生冲突(重叠),因为第二个表达式 *.y 将包含 *.x.y 结果。 (例如 A.x.y 将匹配两个表达式)
我正在通过使用所有表达式构建树结构来解决此问题,我认为如果表达式冲突,构建树的行为将失败。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | For example: *.x a.b a.c b.d might create a tree like +-*-.-x | start +--+ | +-b | | +-a-.-+-c | | +-b-.-d |
如果我尝试添加模式 b.x,树将成功地遵循 *.x 路径,从而说模式已经存在。
我是否朝着正确的方向前进?或者有没有已知的攻击算法?
要检查两个通配符模式是否可以匹配相同的文件名,您可以将此问题视为创建字符对之间的比较网格,然后检查是否存在对角路径。下图显示了如何检查通配符模式
当找到两个相同的文字字符之间的匹配时,您会沿对角线移动到下一个字符进行检查。 (用绿色箭头表示)
当遇到文字字符和单字符通配符
当遇到多字符通配符
代码示例1(迭代)
下面是一个简单的 javascript 实现,它沿对角线遍历网格,标记可以从当前单元格到达的单元格,然后检查右下角的单元格是否可以到达。运行代码片段以查看一些示例。 (更新:从上到下从左到右会很好,而不是对角线)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | function wildConflict(wild1, wild2) { var grid = [[true]], width = wild1.length, height = wild2.length; for (var x = 1; x <= width; x++) grid[x] = []; for (var y = 0; y < height; y++) { for (var x = 0; x < width; x++) { if (grid[x][y]) { var a = wild1.charAt(x), b = wild2.charAt(y); if (a == '*' || b == '*' || a == '?' || b == '?' || a == b) grid[x + 1][y + 1] = true; if (a == '*' || b == '*' || a == '?') grid[x + 1][y] = true; if (a == '*' || b == '*' || b == '?') grid[x][y + 1] = true; } } } return grid[width][height] == true; } var a = ["a","a","a*","abc","a*","*.x.y","*.x.y","a*b*","a*bc.*","a?c.de"]; var b = ["a","b","abc","a?","*b","*.y","*.x","a*c*","ab?.c??","ac.d??"]; for (var i in a) document.write(""" + a[i] +"" "" + b[i] +"" " + wildConflict(a[i], b[i]) +""); |
代码示例2(递归)
一个简单的递归实现的缺点是可能会多次检查某些字符对。它不需要二维数组,但递归显然也使用内存。
请注意,当遇到多字符通配符
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | function wildConflict(wild1, wild2) { var w1 = wild1.split(''), w2 = wild2.split(''); return conflict(0, 0); function conflict(p1, p2) { if (p1 == w1.length || p2 == w2.length) { if ((p1 == w1.length && p2 == w2.length) || (p1 == w1.length - 1 && (w1[p1] == '*' || w1[p1] == '?')) || (p2 == w2.length - 1 && (w2[p2] == '*' || w2[p2] == '?'))) { return true; } else return false; // premature end } else if (w1[p1] == '*' || w2[p2] == '*' || (w1[p1] == '?' && w2[p2] == '?')) { return conflict(p1 + 1, p2) || conflict(p1, p2 + 1); } else if (w1[p1] == '?') { return conflict(p1 + 1, p2) || conflict(p1 + 1, p2 + 1); } else if (w2[p2] == '?') { return conflict(p1, p2 + 1) || conflict(p1 + 1, p2 + 1); } else if (w1[p1] == w2[p2]) { return conflict(p1 + 1, p2 + 1); } else return false; // unequal literals } } var x = ["a","a","a*","abc","a*","*.x.y","*.x.y","a*b*","a*bc.*","a?c.de"]; var y = ["a","b","abc","a?","*b","*.y","*.x","a*c*","ab?.c??","ac.d??"]; for (var i in x) document.write(""" + x[i] +"" "" + y[i] +"" " + wildConflict(x[i], y[i]) +""); |
将每个通配符表达式转换为与之匹配的有限自动机。
计算有限自动机的交集。
使用动态规划来查看交集是否可以匹配。
如果您不认识这些概念,请参阅排除数字的算法,以便我在几年前尝试对其进行解释。 (此时用于计算匹配正则表达式集合的事物,但原理是相同的。)
我想你也许可以把这些模式转换成正则表达式,然后看看它们是否相互匹配?此处的解决方案基于 MSDN 上 Directory.GetFiles 的规则——我觉得它有些问题,但我不确定是什么。
这是一个基本的实现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | private bool Equivalent(string patternOne, string patternTwo) { // convert both patterns to regexes based on rules for Directory.GetFiles var expressionOne = FilePatternToRegex(patternOne); var expressionTwo = FilePatternToRegex(patternTwo); // if either regex matches the opposite pattern, we've got a conflict return expressionTwo.IsMatch(patternOne) || expressionOne.IsMatch(patternTwo); } Regex FilePatternToRegex(string pattern) { // separate extension and filename var extension = Path.GetExtension(pattern); var filename = Path.GetFileNameWithoutExtension(pattern); // escape filename filename = EscapeFilePattern(filename); // 3 character extensions are a special case -- should be greedy eg xls matches xlsx // extension.Length == 4 bc its dot AND 3 characters if (extension.Length == 4 && !extension.Contains("*") && !extension.Contains("?")) { extension = extension +".*"; } else { // all other extension lengths just get escaped like normal regexes extension = EscapeFilePattern(extension); } // our final pattern should also only match at the string start/end var finalPattern ="\\\\A" + filename + extension +"\\\\z"; return new Regex(finalPattern); } string EscapeFilePattern(string pattern) { // escape star and question mark bc they are filepattern significant pattern = pattern.Replace("*","%S%").Replace("?","%Q%"); // escape all other special regex characters pattern = Regex.Escape(pattern); // turn star and question mark into their regex equivalents pattern = pattern.Replace("%S%",".+").Replace("%Q%","."); return pattern; } |
编辑:根据评论中的进一步讨论,这被打破了。使用代码示例证明它已损坏:
1 2 3 4 5 6 7 8 9 10 11 12 | var dir = new DirectoryInfo(Environment.CurrentDirectory).CreateSubdirectory(Guid.NewGuid().ToString()); var path = Path.Combine(dir.FullName,"abc"); File.WriteAllText(path,"*"); // verify both patterns match our file Assert.AreEqual(path, dir.GetFiles("a*c*")[0].FullName); Assert.AreEqual(path, dir.GetFiles("a*b*")[0].FullName); // current regex based solution thinks they are NOT equivalent // when they are Assert.IsFalse(Equivalent("a*c*","a*b*")); |