Is there a regular expression to detect a valid regular expression?
是否可以用另一个正则表达式检测有效的正则表达式?如果是,请给出下面的示例代码。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | / ^ # start of string ( # first group start (?: (?:[^?+*{}()[\]\\|]+ # literals and ^, $ | \\. # escaped characters | \[ (?: \^?\\. | \^[^\\] | [^\\^] ) # character classes (?: [^\]\\]+ | \\. )* \] | \( (?:\?[:=!]|\?<[=!]|\?>)? (?1)?? \) # parenthesis, with recursive content | \(\? (?:R|[+-]?\d+) \) # recursive matching ) (?: (?:[?+*]|\{\d+(?:,\d*)?\}) [?+]? )? # quantifiers | \| # alternative )* # repeat content ) # end first group $ # end of string / |
这是一个递归regex,许多regex引擎不支持它。基于PCRE的应该支持它。
没有空格和注释:
1 | /^((?:(?:[^?+*{}()[\]\\|]+|\\.|\[(?:\^?\\.|\^[^\\]|[^\\^])(?:[^\]\\]+|\\.)*\]|\((?:\?[:=!]|\?<[=!]|\?>)?(?1)??\)|\(\?(?:R|[+-]?\d+)\))(?:(?:[?+*]|\{\d+(?:,\d*)?\})[?+]?)?|\|)*)$/ |
.NET不直接支持递归。(
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | ^ # start of string (?: (?: [^?+*{}()[\]\\|]+ # literals and ^, $ | \\. # escaped characters | \[ (?: \^?\\. | \^[^\\] | [^\\^] ) # character classes (?: [^\]\\]+ | \\. )* \] | \( (?:\?[:=!] | \?<[=!] | \?> | \?<[^\W\d]\w*> | \?'[^\W\d]\w*' )? # opening of group (?<N>) # increment counter | \) # closing of group (?<-N>) # decrement counter ) (?: (?:[?+*]|\{\d+(?:,\d*)?\}) [?+]? )? # quantifiers | \| # alternative )* # repeat content $ # end of string (?(N)(?!)) # fail if counter is non-zero. |
压实的:
1 | ^(?:(?:[^?+*{}()[\]\\|]+|\\.|\[(?:\^?\\.|\^[^\\]|[^\\^])(?:[^\]\\]+|\\.)*\]|\((?:\?[:=!]|\?<[=!]|\?>|\?<[^\W\d]\w*>|\?'[^\W\d]\w*')?(?<N>)|\)(?<-N>))(?:(?:[?+*]|\{\d+(?:,\d*)?\})[?+]?)?|\|)*$(?(N)(?!)) |
不太可能。
用
不,如果您严格地谈论正则表达式,而不包括一些实际上是上下文无关语法的正则表达式实现。
正则表达式有一个限制,这使得编写一个只匹配所有正则表达式的正则表达式是不可能的。不能匹配成对的大括号等实现。正则表达式使用许多这样的构造,让我们以[]为例。只要有[必须有匹配的]。对于regex"[.*]"来说足够简单。
使正则表达式不可能的是它们可以嵌套。如何编写与嵌套括号匹配的regex?答案是没有无限长的正则表达式是不可能的。您可以通过蛮力匹配任意数量的嵌套paren,但是您永远无法匹配任意长的嵌套括号集。
这种能力通常被称为计数(您正在计算嵌套的深度)。定义的regex不具有计数功能。
编辑:最后写了一篇关于这个的博客文章:正则表达式限制
问得好。真正的正则语言不能决定任意的嵌套得很好的括号。也就是说,如果您的字母表中包含"("和')",那么目标是确定其中的字符串是否具有格式良好的匹配括号。因为这是正则表达式的必要条件,所以答案是否定的。
但是:如果您放宽需求并添加递归,那么您可能可以这样做。原因是递归可以充当一个"堆栈",让您通过推到此堆栈来"计算"当前嵌套深度。
RussCox写了一篇关于regex引擎实现的精彩论文:正则表达式匹配可以简单快速
您可以将regex提交到preg_match,如果regex无效,则返回false。不要忘记使用"@"来禁止显示错误消息:
1 | @preg_match($regexToTest, ''); |
- 如果regex为"//",则返回1。
- 如果regex正常,则返回0。
- 否则将返回false。
尽管使用Mizardx发布的递归regex是完全可能的,但对于这种情况,解析器更有用。regex最初是用于常规语言的,递归或拥有平衡组只是一个补丁。
定义有效regex的语言实际上是一种上下文无关的语法,您应该使用适当的解析器来处理它。下面是一个大学项目的示例,用于解析简单的正则表达式(不包含大多数构造)。它使用javacc。是的,注释是西班牙语的,尽管方法名是很容易解释的。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | SKIP : { "" | " " | "\t" | " " } TOKEN : { < DIGITO: ["0" -"9"] > | < MAYUSCULA: ["A" -"Z"] > | < MINUSCULA: ["a" -"z"] > | < LAMBDA:"LAMBDA"> | < VACIO:"VACIO"> } IRegularExpression Expression() : { IRegularExpression r; } { r=Alternation() { return r; } } // Matchea disyunciones: ER | ER IRegularExpression Alternation() : { IRegularExpression r1 = null, r2 = null; } { r1=Concatenation() ("|" r2=Alternation() )? { if (r2 == null) { return r1; } else { return createAlternation(r1,r2); } } } // Matchea concatenaciones: ER.ER IRegularExpression Concatenation() : { IRegularExpression r1 = null, r2 = null; } { r1=Repetition() ("." r2=Repetition() { r1 = createConcatenation(r1,r2); } )* { return r1; } } // Matchea repeticiones: ER* IRegularExpression Repetition() : { IRegularExpression r; } { r=Atom() ("*" { r = createRepetition(r); } )* { return r; } } // Matchea regex atomicas: (ER), Terminal, Vacio, Lambda IRegularExpression Atom() : { String t; IRegularExpression r; } { ("(" r=Expression()")" {return r;}) | t=Terminal() { return createTerminal(t); } | <LAMBDA> { return createLambda(); } | <VACIO> { return createEmpty(); } } // Matchea un terminal (digito o minuscula) y devuelve su valor String Terminal() : { Token t; } { ( t=<DIGITO> | t=<MINUSCULA> ) { return t.image; } } |
以下例子由paul mcguire编写,最初来自pyparsing wiki,但现在只通过wayback机器提供,它给出了一种语法,用于解析某些regex,以便返回匹配字符串集。因此,它拒绝那些包含无界重复项的re's,比如"+"和"*"。但它应该给您一个关于如何构造将处理re的解析器的概念。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 | # # invRegex.py # # Copyright 2008, Paul McGuire # # pyparsing script to expand a regular expression into all possible matching strings # Supports: # - {n} and {m,n} repetition, but not unbounded + or * repetition # - ? optional elements # - [] character ranges # - () grouping # - | alternation # __all__ = ["count","invert"] from pyparsing import (Literal, oneOf, printables, ParserElement, Combine, SkipTo, operatorPrecedence, ParseFatalException, Word, nums, opAssoc, Suppress, ParseResults, srange) class CharacterRangeEmitter(object): def __init__(self,chars): # remove duplicate chars in character range, but preserve original order seen = set() self.charset ="".join( seen.add(c) or c for c in chars if c not in seen ) def __str__(self): return '['+self.charset+']' def __repr__(self): return '['+self.charset+']' def makeGenerator(self): def genChars(): for s in self.charset: yield s return genChars class OptionalEmitter(object): def __init__(self,expr): self.expr = expr def makeGenerator(self): def optionalGen(): yield"" for s in self.expr.makeGenerator()(): yield s return optionalGen class DotEmitter(object): def makeGenerator(self): def dotGen(): for c in printables: yield c return dotGen class GroupEmitter(object): def __init__(self,exprs): self.exprs = ParseResults(exprs) def makeGenerator(self): def groupGen(): def recurseList(elist): if len(elist)==1: for s in elist[0].makeGenerator()(): yield s else: for s in elist[0].makeGenerator()(): for s2 in recurseList(elist[1:]): yield s + s2 if self.exprs: for s in recurseList(self.exprs): yield s return groupGen class AlternativeEmitter(object): def __init__(self,exprs): self.exprs = exprs def makeGenerator(self): def altGen(): for e in self.exprs: for s in e.makeGenerator()(): yield s return altGen class LiteralEmitter(object): def __init__(self,lit): self.lit = lit def __str__(self): return"Lit:"+self.lit def __repr__(self): return"Lit:"+self.lit def makeGenerator(self): def litGen(): yield self.lit return litGen def handleRange(toks): return CharacterRangeEmitter(srange(toks[0])) def handleRepetition(toks): toks=toks[0] if toks[1] in"*+": raise ParseFatalException("",0,"unbounded repetition operators not supported") if toks[1] =="?": return OptionalEmitter(toks[0]) if"count" in toks: return GroupEmitter([toks[0]] * int(toks.count)) if"minCount" in toks: mincount = int(toks.minCount) maxcount = int(toks.maxCount) optcount = maxcount - mincount if optcount: opt = OptionalEmitter(toks[0]) for i in range(1,optcount): opt = OptionalEmitter(GroupEmitter([toks[0],opt])) return GroupEmitter([toks[0]] * mincount + [opt]) else: return [toks[0]] * mincount def handleLiteral(toks): lit ="" for t in toks: if t[0] =="\": if t[1] =="t": lit += '\t' else: lit += t[1] else: lit += t return LiteralEmitter(lit) def handleMacro(toks): macroChar = toks[0][1] if macroChar =="d": return CharacterRangeEmitter("0123456789") elif macroChar =="w": return CharacterRangeEmitter(srange("[A-Za-z0-9_]")) elif macroChar =="s": return LiteralEmitter("") else: raise ParseFatalException("",0,"unsupported macro character (" + macroChar +")") def handleSequence(toks): return GroupEmitter(toks[0]) def handleDot(): return CharacterRangeEmitter(printables) def handleAlternative(toks): return AlternativeEmitter(toks[0]) _parser = None def parser(): global _parser if _parser is None: ParserElement.setDefaultWhitespaceChars("") lbrack,rbrack,lbrace,rbrace,lparen,rparen = map(Literal,"[]{}()") reMacro = Combine("\" + oneOf(list("dws"))) escapedChar = ~reMacro + Combine("\" + oneOf(list(printables))) reLiteralChar ="".join(c for c in printables if c not in r"\[]{}().*?+|") +" \t" reRange = Combine(lbrack + SkipTo(rbrack,ignore=escapedChar) + rbrack) reLiteral = ( escapedChar | oneOf(list(reLiteralChar)) ) reDot = Literal(".") repetition = ( ( lbrace + Word(nums).setResultsName("count") + rbrace ) | ( lbrace + Word(nums).setResultsName("minCount")+","+ Word(nums).setResultsName("maxCount") + rbrace ) | oneOf(list("*+?")) ) reRange.setParseAction(handleRange) reLiteral.setParseAction(handleLiteral) reMacro.setParseAction(handleMacro) reDot.setParseAction(handleDot) reTerm = ( reLiteral | reRange | reMacro | reDot ) reExpr = operatorPrecedence( reTerm, [ (repetition, 1, opAssoc.LEFT, handleRepetition), (None, 2, opAssoc.LEFT, handleSequence), (Suppress('|'), 2, opAssoc.LEFT, handleAlternative), ] ) _parser = reExpr return _parser def count(gen): """Simple function to count the number of elements returned by a generator.""" i = 0 for s in gen: i += 1 return i def invert(regex): """Call this routine as a generator to return all the strings that match the input regular expression. for s in invert("[A-Z]{3}\d{3}"): print s """ invReGenerator = GroupEmitter(parser().parseString(regex)).makeGenerator() return invReGenerator() def main(): tests = r""" [A-EA] [A-D]* [A-D]{3} X[A-C]{3}Y X[A-C]{3}\( X\d foobar\d\d foobar{2} foobar{2,9} fooba[rz]{2} (foobar){2} ([01]\d)|(2[0-5]) ([01]\d\d)|(2[0-4]\d)|(25[0-5]) [A-C]{1,2} [A-C]{0,3} [A-C]\s[A-C]\s[A-C] [A-C]\s?[A-C][A-C] [A-C]\s([A-C][A-C]) [A-C]\s([A-C][A-C])? [A-C]{2}\d{2} @|TH[12] @(@|TH[12])? @(@|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9]))? @(@|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9])|OH(1[0-9]?|2[0-9]?|30?|[4-9]))? (([ECMP]|HA|AK)[SD]|HS)T [A-CV]{2} A[cglmrstu]|B[aehikr]?|C[adeflmorsu]?|D[bsy]|E[rsu]|F[emr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airu]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|Uu[bhopqst]|U|V|W|Xe|Yb?|Z[nr] (a|b)|(x|y) (a|b) (x|y) """.split(' ') for t in tests: t = t.strip() if not t: continue print '-'*50 print t try: print count(invert(t)) for s in invert(t): print s except ParseFatalException,pfe: print pfe.msg continue if __name__ =="__main__": main() |
不,如果使用标准正则表达式。
原因是您不能满足常规语言的泵送引理。抽运引理表明,如果存在一个数字n,那么属于语言l的字符串是规则的,这样,在将字符串划分为3个子字符串xyz之后,x>=1&;&;x y<=n,您可以重复y任意多次,并且整个字符串仍然属于l。
抽运引理的一个结果是,不能有形式为