关于Python Regex Engine:Python Regex Engine-“后视需要固定宽度模式”错误

Python Regex Engine - “look-behind requires fixed-width pattern” Error

我正在尝试处理CSV格式的字符串中不匹配的双引号。

确切地说,

1
"It"does"not"make"sense", Well,"Does"it"

应该改正为

1
"It""does""not""make""sense", Well,"Does""it"

所以基本上我想做的是

replace all the '" '

  • Not preceded by a beginning of line or a comma (and)
  • Not followed by a comma or an end of line
  • with '"" '

    为此,我使用下面的正则表达式

    1
    (?<!^|,)"(?!,|$)

    问题是Ruby regex引擎(http://www.rubular.com/)能够解析regex,python regex引擎(https://pythex.org/,http://www.pyregex.com/)抛出以下错误

    1
    Invalid regular expression: look-behind requires fixed-width pattern

    并使用python 2.7.3引发

    1
    sre_constants.error: look-behind requires fixed-width pattern

    谁能告诉我这里有什么vexes python?

    ================================================== ===============================

    编辑:

    在Tim的回应之后,我得到了以下多行字符串的输出

    1
    2
    3
    4
    5
    6
    7
    8
    9
    >>> str =""""It"does"not"make"sense", Well,"Does"it"
    ..."It"does"not"make"sense", Well,"Does"it"
    ..."It"does"not"make"sense", Well,"Does"it"
    ..."It"does"not"make"sense", Well,"Does"it"""
    "
    >>> re.sub(r'\\b\\s*"
    (?!,|$)', '""', str)
    '
    "It""does""not""make""sense", Well,"Does""it""\
    "
    It""does""not""make""sense", Well,"Does""it""\
    "It""does""not""make""sense", Well,"Does""it""\
    "
    It""does""not""make""sense", Well,"Does""it"" '

    在每一行的末尾,在" it"旁边添加了两个双引号。

    因此,我对正则表达式进行了很小的更改以处理新行。

    1
    re.sub(r'\\b\\s*"(?!,|$)', '""', str,flags=re.MULTILINE)

    但这给出了输出

    1
    2
    3
    4
    5
    >>> re.sub(r'\\b\\s*"(?!,|$)', '""', str,flags=re.MULTILINE)
    '"It""does""not""make""sense", Well,"Does""it"\
    ..."It""does""not""make""sense", Well,"Does""it"\
    ..."It""does""not""make""sense", Well,"Does""it"\
    ..."It""does""not""make""sense", Well,"Does""it"" '

    仅最后一个" it"有两个双引号。

    但是我想知道为什么'$'行尾字符不能识别该行已经结束。

    ================================================== ===============================

    最终的答案是

    1
    re.sub(r'\\b\\s*"(?!,|[ \\t]*$)', '""', str,flags=re.MULTILINE)


    Python re lookbehinds实际上需要为固定宽度,并且当您在lookbehind模式中具有不同长度的替换时,有几种方法可以处理这种情况:

    • 重写模式,以便您不必使用替代(例如,Tim的上述回答使用单词边界,或者您也可以使用与当前模式完全等效的(?<=[^,])"(?!,|$),该模式在双引号之前需要一个字符,而不是逗号,或用于匹配空格((?<=\\s|^)\\w+(?=\\s|$))中包含的单词的通用模式,可以写为(?),或
    • 分割幕后花絮:

      • 正向后视需要在组中交替(例如,(?<=a|bc)应该重写为(?:(?<=a)|(?<=bc)))
      • 负的lookbehind可以被串联(例如(?应该看起来像(?)。

    或者,只需使用pip install regex(或pip3 install regex)安装PyPi正则表达式模块,并享受无限宽的后向外观。


    Python后置断言需要固定宽度,但是您可以尝试以下操作:

    1
    2
    3
    >>> s = '"It"does"not"make"sense", Well,"Does"it"'
    >>> re.sub(r'\\b\\s*"(?!,|$)', '""', s)
    '"It""does""not""make""sense", Well,"Does""it"'

    说明:

    1
    2
    3
    4
    \\b      # Start the match at the end of a"word"
    \\s*     # Match optional whitespace
    "       # Match a quote
    (?!,|$) # unless it's followed by a comma or end of string