d is less efficient than [0-9]
我昨天对一个答案发表了评论,有人在正则表达式中使用了
我今天决定对此进行测试,结果出乎意料地发现(至少在C regex引擎中)
1 2 3 | Regular expression \d took 00:00:00.2141226 result: 5077/10000 Regular expression [0-9] took 00:00:00.1357972 result: 5077/10000 63.42 % of first Regular expression [0123456789] took 00:00:00.1388997 result: 5077/10000 64.87 % of first |
我很惊讶有两个原因:
测试代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Diagnostics; using System.Text.RegularExpressions; namespace SO_RegexPerformance { class Program { static void Main(string[] args) { var rand = new Random(1234); var strings = new List<string>(); //10K random strings for (var i = 0; i < 10000; i++) { //Generate random string var sb = new StringBuilder(); for (var c = 0; c < 1000; c++) { //Add a-z randomly sb.Append((char)('a' + rand.Next(26))); } //In roughly 50% of them, put a digit if (rand.Next(2) == 0) { //Replace one character with a digit, 0-9 sb[rand.Next(sb.Length)] = (char)('0' + rand.Next(10)); } strings.Add(sb.ToString()); } var baseTime = testPerfomance(strings, @"\d"); Console.WriteLine(); var testTime = testPerfomance(strings,"[0-9]"); Console.WriteLine(" {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds); testTime = testPerfomance(strings,"[0123456789]"); Console.WriteLine(" {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds); } private static TimeSpan testPerfomance(List<string> strings, string regex) { var sw = new Stopwatch(); int successes = 0; var rex = new Regex(regex); sw.Start(); foreach (var str in strings) { if (rex.Match(str).Success) { successes++; } } sw.Stop(); Console.Write("Regex {0,-12} took {1} result: {2}/{3}", regex, sw.Elapsed, successes, strings.Count); return sw.Elapsed; } } } |
检查所有的
你可以产生这样的列表中的所有字符,使用下面的代码:
1 2 3 4 5 6 7 8 | var sb = new StringBuilder(); for(UInt16 i = 0; i < UInt16.MaxValue; i++) { string str = Convert.ToChar(i).ToString(); if (Regex.IsMatch(str, @"\d")) sb.Append(str); } Console.WriteLine(sb.ToString()); |
这产生的:
0123456789??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????0123456789
这是noticing byteblast信用到文档中。改变的只是正则表达式构造函数:
1 |
给出了新的时序:
1 2 3 | Regex \d took 00:00:00.1355787 result: 5077/10000 Regex [0-9] took 00:00:00.1360403 result: 5077/10000 100.34 % of first Regex [0123456789] took 00:00:00.1362112 result: 5077/10000 100.47 % of first |
从"D"一词中的平均数字的正则表达式?:
[0-9] isn't equivalent to\d .[0-9] matches only0123456789 characters, while\d matches[0-9] and other digit characters, for example Eastern Arabic numerals??????????
一名来自新浪iravianian除答案,这里是一个.NET 4.5版本(版本只支持从一输出一第一UTF16,他的代码,三线)使用Unicode代码点的全范围。由于缺乏适当的支持Unicode的平面高,许多人都没有意识到它总是检查包括在Unicode和平面上。然而,他们有时包括一些重要的人物。
更新
由于
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | public static void Main() { var unicodeEncoding = new UnicodeEncoding(!BitConverter.IsLittleEndian, false); Console.InputEncoding = unicodeEncoding; Console.OutputEncoding = unicodeEncoding; var sb = new StringBuilder(); for (var codePoint = 0; codePoint <= 0x10ffff; codePoint++) { var isSurrogateCodePoint = codePoint <= UInt16.MaxValue && ( char.IsLowSurrogate((char) codePoint) || char.IsHighSurrogate((char) codePoint) ); if (isSurrogateCodePoint) continue; var codePointString = char.ConvertFromUtf32(codePoint); foreach (var category in new []{ UnicodeCategory.DecimalDigitNumber, UnicodeCategory.LetterNumber, UnicodeCategory.OtherNumber}) { sb.AppendLine($"{category}"); foreach (var ch in charInfo[category]) { sb.Append(ch); } sb.AppendLine(); } } Console.WriteLine(sb.ToString()); Console.ReadKey(); } |
高产以下输出:
DecimalDigitNumber
etterNumber
???ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫ????ⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹ?????????????〇〡〢〣〤〥〦〧〨〩???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
OtherNumber
231??????????????????????????????????????????????????????????????????????????????????????????????????????????①②③④⑤⑥⑦⑧⑨⑩??????????⑴⑵⑶⑷⑸⑹⑺⑻⑼⑽⑾⑿⒀⒁⒂⒃⒄⒅⒆⒇⒈⒉⒊⒋⒌⒍⒎⒏⒐⒑⒒⒓⒔⒕⒖⒗⒘⒙⒚⒛?????????????????????????????????????????????????????一二三四㈠㈡㈢㈣㈤㈥㈦㈧㈨㈩???????????????????????一二三四五六七八九十???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
检查在所有的一维码、〔0 9〕有限公司10到这些特点。如果只有10位,你应该使用。我建议用另—D,无因写作。
例如,如果我想要找到IP地址的正则表达式,而不是美国,我会
一般说,如果在我的正则表达式的使用功能,更重要的比速度。