关于python:regex include换行符

Regex include line breaks

本问题已经有最佳答案，请猛点这里访问。

我有以下XML文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

A




B
C


D



Picture number 3?

我只想知道和之间的文本。所以我试过这个代码：

1
2
3
4
5
6
7
8
9

import os, re

html = open("2.xml","r")
text = html.read()
lon = re.compile(r'
(.+)
', re.MULTILINE)
lon = lon.search(text).group(1)
print lon

但它似乎不起作用。

相关讨论

1)不要用regex解析XML。只是不起作用。使用XML分析器。

2)如果使用regex，则不需要re.MULTILINE，它控制^和$在多行字符串中的工作方式。你需要re.DOTALL，它控制.是否与
匹配。

3)您可能还希望模式返回尽可能短的匹配，使用非贪婪+?运算符。

1
2
3

lon = re.compile(r'
(.+?)
', re.DOTALL)

您可以尝试在DIV上进行拆分，然后在列表项上进行匹配。这对大型数据上的regex也很有效。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

import re

html ="""
A




B
C


D



Picture number 3?


"""

for div in html.split('<div'):
m = re.search(r'xml:lang="unknown">.+(<p[^<]+)', div, re.DOTALL)
if m:
print m.group(1)

当您在一个块中并设置标志为真时，以及当您退出并设置标志为假并中断时，您可以这样解析一段块代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

def get_infobox(self):
"""returns Infobox wikitext from text blob
learning form https://github.com/siznax/wptools/blob/master/wp_infobox.py
"""
if self._rawtext:
text = self._rawtext
else:
text = self.get_rawtext()
output = []
region = False
braces = 0
lines = text.split("
")
if len(lines) < 3:
raise RuntimeError("too few lines!")

for line in lines:
match = re.search(r'(?im){{[^{]*box$', line)
braces += len(re.findall(r'{{', line))
braces -= len(re.findall(r'}}', line))
if match:
region = True
if region:
output.append(line.lstrip())
if braces <= 0:
region = False
break
self._infobox ="
".join(output)
assert self._infobox
return self._infobox