Parsing and Modyfying the html with BeautifulSoup or lxml. Surround a text with some html tag which is directly under the <body> tag
我是一个初学者,在Python2.7工作。我想解析和修改一些HTML文件。为此,我使用了漂亮的soup,lxml也是一个选项。现在的问题是,我可以通过修改HTML来用一些HTML标记围绕文本吗?文本直接位于"body"标记下,因此,无论什么文本都直接位于body标记下,我都希望修改HTML,以便在所需标记下获取文本。所以我可以很容易地解析它并找出文本的位置。
1 2 3 4 5 6 7 8 9 10 11 12 13 | <html><body> List Price: <strike>$150.00</strike><br /> Price $117.80<br /> You Save: $32.20(21%)<br /> <font size="-1" color="#009900">In Stock</font> <br /> Free Shipping <br/> Ships from and sold by Amazon.com<br /> Gift-wrap available.<br /></body></html> |
所以在这个例子中,我想用一些用户HTML标记将文本"$117.80"和"$32.20"包围起来。我如何用BeautifulSoup或LXML来实现这一点。
我认为您需要包围
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | from lxml import etree import re tree = etree.parse('htmlfile') root = tree.getroot() for elem in root.iter('*'): if elem.tail is not None and elem.tail.strip() and re.search('\$\d+', elem.tail): e = etree.Element('div') e.text = elem.tail elem.tail = '' elem.addnext(e) print(etree.tostring(root)) |
它产生:
1 2 3 4 5 6 7 8 9 10 11 12 13 | <html><body> List Price: <strike>$150.00</strike><br/> Price $117.80<br/> You Save: $32.20(21%)<br/> <font size="-1" color="#009900">In Stock</font> <br/> Free Shipping <br/> Ships from and sold by Amazon.com<br/> Gift-wrap available.<br/></body></html> |
号