How to parse xml string having deep structures using python
这里也提出了类似的问题(PythonXML解析),但我无法访问感兴趣的内容。
如果
在下面给出的示例中,有三个这样的值:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | <?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?> <ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink"> <ops:meta name="elapsed-time" value="21"/> <exchange-documents> <exchange-document system="ops.epo.org" family-id="39103486" country="US" doc-number="2009234106" kind="A1"> <bibliographic-data> <publication-reference> <document-id document-id-type="docdb"> <country>US</country> <doc-number>2009234106</doc-number> <kind>A1</kind> <date>20090917</date> </document-id> <document-id document-id-type="epodoc"> <doc-number>US2009234106</doc-number> <date>20090917</date> </document-id> </publication-reference> <classifications-ipcr> <classification-ipcr sequence="1"> <text>C07K 16/ 44 A I </text> </classification-ipcr> </classifications-ipcr> <patent-classifications> <patent-classification sequence="1"> <classification-scheme office="" scheme="CPC"/> <section>C</section> <class>07</class> <subclass>K</subclass> <main-group>16</main-group> <subgroup>22</subgroup> <classification-value>I</classification-value> </patent-classification> <patent-classification sequence="2"> <classification-scheme office="" scheme="CPC"/> <section>A</section> <class>61</class> <subclass>K</subclass> <main-group>2039</main-group> <subgroup>505</subgroup> <classification-value>A</classification-value> </patent-classification> <patent-classification sequence="7"> <classification-scheme office="" scheme="CPC"/> <section>C</section> <class>07</class> <subclass>K</subclass> <main-group>2317</main-group> <subgroup>92</subgroup> <classification-value>A</classification-value> </patent-classification> <patent-classification sequence="1"> <classification-scheme office="US" scheme="UC"/> <classification-symbol>530/387.9</classification-symbol> </patent-classification> </patent-classifications> </bibliographic-data> </exchange-document> </exchange-documents> </ops:world-patent-data> |
如果没有,请安装漂亮的汤:
试试这个:
1 2 3 4 5 6 7 8 9 10 11 | from bs4 import BeautifulSoup xml = open('example.xml', 'rb').read() bs = BeautifulSoup(xml) # find patent-classification patents = bs.findAll('patent-classification') # filter the ones with CPC for pa in patents: if pa.find('classification-scheme', {'scheme': 'CPC'} ): print pa.getText() |
您可以使用python
1 2 3 4 5 6 7 8 9 10 | import xml.etree.ElementTree as ET root = ET.parse('a.xml').getroot() for node in root.iterfind(".//{http://www.epo.org/exchange}classification-scheme[@scheme='CPC']/.."): data = [] for d in node.getchildren(): if d.text: data.append(d.text) print ' '.join(data) |
号