关于python：使用BeautifulSoup或lxml解析和修改html。

Parsing and Modyfying the html with BeautifulSoup or lxml. Surround a text with some html tag which is directly under the <body> tag

我是一个初学者，在Python2.7工作。我想解析和修改一些HTML文件。为此，我使用了漂亮的soup，lxml也是一个选项。现在的问题是，我可以通过修改HTML来用一些HTML标记围绕文本吗？文本直接位于"body"标记下，因此，无论什么文本都直接位于body标记下，我都希望修改HTML，以便在所需标记下获取文本。所以我可以很容易地解析它并找出文本的位置。

1
2
3
4
5
6
7
8
9
10
11
12
13

<html><body>
List Price:
<strike>$150.00</strike> 
Price
$117.80 
You Save:
$32.20(21%) 
In Stock
 
Free Shipping
 
Ships from and sold by Amazon.com 
Gift-wrap available. </body></html>

所以在这个例子中，我想用一些用户HTML标记将文本"$117.80"和"$32.20"包围起来。我如何用BeautifulSoup或LXML来实现这一点。

我认为您需要包围tail文本，我会选择LXML更好地处理它们。下面的脚本搜索包含tail文本的任何element，创建一个新的标记(选择您的标记)并将其插入其中。它使用正则表达式检查文本是否是价格，这样跳过了Ships from and sold by Amazon.com或Gift-wrap available.末尾的文本：

1
2
3
4
5
6
7
8
9
10
11
12
13
14

from lxml import etree
import re

tree = etree.parse('htmlfile')
root = tree.getroot()

for elem in root.iter('*'):
if elem.tail is not None and elem.tail.strip() and re.search('\$\d+', elem.tail):
e = etree.Element('div')
e.text = elem.tail
elem.tail = ''
elem.addnext(e)

print(etree.tostring(root))

它产生：

1
2
3
4
5
6
7
8
9
10
11
12
13

<html><body>
List Price:
<strike>$150.00</strike> 
Price
$117.80 
You Save:
$32.20(21%) 
In Stock
 
Free Shipping
 
Ships from and sold by Amazon.com 
Gift-wrap available. </body></html>

号