关于python：使用lxml按属性查找元素

finding elements by attribute with lxml

我需要解析一个XML文件来提取一些数据。我只需要一些具有特定属性的元素，以下是文档示例：

1
2
3
4
5
6
7
8
9
10
11
12
13

在这里我只想得到"新闻"类型的文章。使用LXML最有效、最优雅的方法是什么？

我试过寻找方法，但不是很好：

1
2
3
4
5
6
7
8
9
10

from lxml import etree
f = etree.parse("myfile")
root = f.getroot()
articles = root.getchildren()[0]
article_list = articles.findall('article')
for article in article_list:
if"type" in article.keys():
if article.attrib['type'] == 'news':
content = article.find('content')
content = content.text

号

您可以使用xpath，例如root.xpath("//article[@type='news']")。

此xpath表达式将返回具有值为"news"的"type"属性的所有元素的列表。然后您可以迭代它来做您想要做的事情，或者将它传递到任何地方。

要获取文本内容，可以像这样扩展xpath：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

root = etree.fromstring("""
<root>

<content>some text</content>
</article>

<content>some text</content>
</article>

<content>some text</content>
</article>
</articles>
</root>
""")

print root.xpath("//article[@type='news']/content/text()")

这将输出['some text', 'some text']。或者，如果您只需要内容元素，它将是"//article[@type='news']/content"，依此类推。

仅供参考，您可以使用findall获得相同的结果：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

root = etree.fromstring("""
<root>

<content>some text</content>
</article>

<content>some text</content>
</article>

<content>some text</content>
</article>
</articles>
</root>
""")

articles = root.find("articles")
article_list = articles.findall("article[@type='news']/content")
for a in article_list:
print a.text

号