How do you match only valid roman numerals with a regular expression?
考虑到我的另一个问题,我决定甚至不能创建一个与罗马数字匹配的正则表达式(更不用说生成它们的上下文无关语法了)。
问题是只匹配有效的罗马数字。990不是"xm",而是"cmxc"
我在为这个做regex时遇到的问题是,为了允许或不允许某些字符,我需要回顾一下。例如,让我们以数千和数百为例。
我可以允许M 0,2 C?m(考虑900、1000、1900、2000、2900和3000)。但是,如果匹配是在c m上,我不能允许以下字符是c或d(因为我已经在900了)。
我怎么能用正则表达式表达这个?如果它不能在正则表达式中表示,那么它能在上下文无关的语法中表示吗?
尝试:
1 | ^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$ |
分解:
M{0,4}
这规定了数千段,基本上限制在
1 2 3 4 5 | 0: matched by M{0} 1000: M matched by M{1} 2000: MM matched by M{2} 3000: MMM matched by M{3} 4000: MMMM matched by M{4} |
(CM|CD|D?C{0,3})
稍微复杂一点,这是针对数百个部分的,涵盖了所有可能性:
1 2 3 4 5 6 7 8 9 10 | 0: matched by D?C{0} (with D not there) 100: C matched by D?C{1} (with D not there) 200: CC matched by D?C{2} (with D not there) 300: CCC matched by D?C{3} (with D not there) 400: CD matched by CD 500: D matched by D?C{0} (with D there) 600: DC matched by D?C{1} (with D there) 700: DCC matched by D?C{2} (with D there) 800: DCCC matched by D?C{3} (with D there) 900: CM matched by CM |
(XC|XL|L?X{0,3})
与上一节的规则相同,但对于十位数:
1 2 3 4 5 6 7 8 9 10 | 0: matched by L?X{0} (with L not there) 10: X matched by L?X{1} (with L not there) 20: XX matched by L?X{2} (with L not there) 30: XXX matched by L?X{3} (with L not there) 40: XL matched by XL 50: L matched by L?X{0} (with L there) 60: LX matched by L?X{1} (with L there) 70: LXX matched by L?X{2} (with L there) 80: LXXX matched by L?X{3} (with L there) 90: XC matched by XC |
(IX|IV|V?I{0,3})
这是单元部分,处理
1 2 3 4 5 6 7 8 9 10 | 0: matched by V?I{0} (with V not there) 1: I matched by V?I{1} (with V not there) 2: II matched by V?I{2} (with V not there) 3: III matched by V?I{3} (with V not there) 4: IV matched by IV 5: V matched by V?I{0} (with V there) 6: VI matched by V?I{1} (with V there) 7: VII matched by V?I{2} (with V there) 8: VIII matched by V?I{3} (with V there) 9: IX matched by IX |
实际上,你的前提是有缺陷的。990是"xm",也是"cmxc"。
罗马人比你的三年级老师更关心"规则"。只要加起来就可以了。因此,"IIII"和"IV"对4的效果一样好。而"iim"在998年是完全酷的。
(如果你在处理这件事上有困难…记住,直到17世纪英语拼写才正式化,直到那时,只要读者能理解,它就足够好了。
为了避免匹配空字符串,您需要重复该模式四次,并依次将每个
1 | (M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3})) |
在这种情况下(因为此模式使用
在我自己的特殊情况下(现实世界),我需要在单词结尾处匹配数字,但没有找到其他的方法。我需要从我的纯文本文档中删除脚注编号,其中"Red Seacl和Great Barrier Reefcli"等文本已转换为
把它保存在这里:
1 | (^(?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$) |
匹配所有罗马数字。不关心空字符串(至少需要一个罗马数字字母)。应该在PCRE、Perl、Python和Ruby中工作。
Ruby在线演示:http://rubular.com/r/klpr1zq3hj
在线转换:http://www.online conversion.com/roman_numerals_advanced.htm
幸运的是,数字的范围限制在1到3999左右。因此,您可以建立regex块餐。
1 | <opt-thousands-part><opt-hundreds-part><opt-tens-part><opt-units-part> |
每一部分都将处理罗马符号的变幻莫测。例如,使用Perl表示法:
1 | <opt-hundreds-part> = m/(CM|DC{0,3}|CD|C{1,3})?/; |
重复并组装。
补充:
1 | <opt-hundreds-part> = m/(C[MD]|D?C{0,3})/; |
自从D?C 0,3子句不能匹配任何内容,不需要问号。而且,最有可能的情况是,括号应该是非捕获类型——在Perl中:
1 | <opt-hundreds-part> = m/(?:C[MD]|D?C{0,3})/; |
当然,这也应该是不区分大小写的。
您还可以将此扩展到处理James Curran提到的选项(允许XM或IM用于990或999,以及CCCC用于400等)。
1 | <opt-hundreds-part> = m/(?:[IXC][MD]|D?C{0,4})/; |
1 2 3 4 5 6 | import re pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$' if re.search(pattern, 'XCCMCI'): print 'Valid Roman' else: print 'Not valid Roman' |
对于那些真正想理解逻辑的人,请看一看Diveintopython上3页的逐步解释。
与最初的解决方案(有
正如杰里米和帕克斯在上面指出的…'^m 0,4(cm cd d?C 0,3)(XC XL L?x 0,3)(ix iv v?我0,3)$'应该是你想要的解决方案…
应该附加的特定URL(imho)是http://thehazeltree.org/diveintopython/7.html
例7.8是使用n,m_的短格式
在我的例子中,我试图用文本中的一个词来查找和替换所有出现的罗马数字,所以我不能使用行首和行尾。因此,@paxdiablo解决方案发现许多零长度匹配。最后我得到了以下的表达:
1 | (?=\b[MCDXLVI]{1,6}\b)M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}) |
我最后的python代码是这样的:
1 2 3 4 | import re text ="RULES OF LIFE: I. STAY CURIOUS; II. NEVER STOP LEARNING" text = re.sub(r'(?=\b[MCDXLVI]{1,6}\b)M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})', 'ROMAN', text) print(text) |
输出:
1 | RULES OF LIFE: ROMAN. STAY CURIOUS; ROMAN. NEVER STOP LEARNING |
Steven Levithan在他的文章中使用了这个regex,它在"脱轨"值之前验证罗马数字:
1 | /^M*(?:D?C{0,3}|C[MD])(?:L?X{0,3}|X[CL])(?:V?I{0,3}|I[XV])$/ |
杰里米和帕克斯的解决方案的问题是,它也符合"没有"。
以下regex需要至少一个罗马数字:
1 | ^(M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|[IDCXMLV])$ |
我会为我的工作写函数。PowerShell中有两个罗马数字函数。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | function ConvertFrom-RomanNumeral { <# .SYNOPSIS Converts a Roman numeral to a number. .DESCRIPTION Converts a Roman numeral - in the range of I..MMMCMXCIX - to a number. .EXAMPLE ConvertFrom-RomanNumeral -Numeral MMXIV .EXAMPLE "MMXIV" | ConvertFrom-RomanNumeral #> [CmdletBinding()] [OutputType([int])] Param ( [Parameter(Mandatory=$true, HelpMessage="Enter a roman numeral in the range I..MMMCMXCIX", ValueFromPipeline=$true, Position=0)] [ValidatePattern("^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$")] [string] $Numeral ) Begin { $RomanToDecimal = [ordered]@{ M = 1000 CM = 900 D = 500 CD = 400 C = 100 XC = 90 L = 50 X = 10 IX = 9 V = 5 IV = 4 I = 1 } } Process { $roman = $Numeral +"" $value = 0 do { foreach ($key in $RomanToDecimal.Keys) { if ($key.Length -eq 1) { if ($key -match $roman.Substring(0,1)) { $value += $RomanToDecimal.$key $roman = $roman.Substring(1) break } } else { if ($key -match $roman.Substring(0,2)) { $value += $RomanToDecimal.$key $roman = $roman.Substring(2) break } } } } until ($roman -eq"") $value } End { } } function ConvertTo-RomanNumeral { <# .SYNOPSIS Converts a number to a Roman numeral. .DESCRIPTION Converts a number - in the range of 1 to 3,999 - to a Roman numeral. .EXAMPLE ConvertTo-RomanNumeral -Number (Get-Date).Year .EXAMPLE (Get-Date).Year | ConvertTo-RomanNumeral #> [CmdletBinding()] [OutputType([string])] Param ( [Parameter(Mandatory=$true, HelpMessage="Enter an integer in the range 1 to 3,999", ValueFromPipeline=$true, Position=0)] [ValidateRange(1,3999)] [int] $Number ) Begin { $DecimalToRoman = @{ Ones ="","I","II","III","IV","V","VI","VII","VIII","IX"; Tens ="","X","XX","XXX","XL","L","LX","LXX","LXXX","XC"; Hundreds ="","C","CC","CCC","CD","D","DC","DCC","DCCC","CM"; Thousands ="","M","MM","MMM" } $column = @{Thousands = 0; Hundreds = 1; Tens = 2; Ones = 3} } Process { [int[]]$digits = $Number.ToString().PadLeft(4,"0").ToCharArray() | ForEach-Object { [Char]::GetNumericValue($_) } $RomanNumeral ="" $RomanNumeral += $DecimalToRoman.Thousands[$digits[$column.Thousands]] $RomanNumeral += $DecimalToRoman.Hundreds[$digits[$column.Hundreds]] $RomanNumeral += $DecimalToRoman.Tens[$digits[$column.Tens]] $RomanNumeral += $DecimalToRoman.Ones[$digits[$column.Ones]] $RomanNumeral } End { } } |