Parsing compressed xml feed into ElementTree
我正在尝试将以下提要解析为python中的elementtree:"http://smarkets.s3.amazonaws.com/oddsfeed.xml"(警告大型文件)
以下是我迄今为止所做的尝试:
1 2 3 4 5 6 7 8 9 10 11 12 | feed = urllib.urlopen("http://smarkets.s3.amazonaws.com/oddsfeed.xml") # feed is compressed compressed_data = feed.read() import StringIO compressedstream = StringIO.StringIO(compressed_data) import gzip gzipper = gzip.GzipFile(fileobj=compressedstream) data = gzipper.read() # Parse XML tree = ET.parse(data) |
但它似乎只停留在
接下来,我用
1 2 3 | url ="http://smarkets.s3.amazonaws.com/oddsfeed.xml" headers = {'accept-encoding': 'gzip, deflate'} r = requests.get(url, headers=headers, stream=True) |
但是现在
1 | tree=ET.parse(r.content) |
或
1 | tree=ET.parse(r.text) |
但这也引发了例外。
正确的方法是什么?
您可以将
1 2 3 4 5 6 7 8 9 10 | #!/usr/bin/env python3 import xml.etree.ElementTree as etree from gzip import GzipFile from urllib.request import urlopen, Request with urlopen(Request("http://smarkets.s3.amazonaws.com/oddsfeed.xml", headers={"Accept-Encoding":"gzip"})) as response, \ GzipFile(fileobj=response) as xml_file: for elem in getelements(xml_file, 'interesting_tag'): process(elem) |
其中
1 2 3 4 5 6 7 8 | def getelements(filename_or_file, tag): """Yield *tag* elements from *filename_or_file* xml incrementaly.""" context = iter(etree.iterparse(filename_or_file, events=('start', 'end'))) _, root = next(context) # get root element for event, elem in context: if event == 'end' and elem.tag == tag: yield elem root.clear() # free memory |
为了保留内存,在每个标记元素上清除构造的XML树。
您需要
或者,如果你愿意,你已经有了一个文件对象,
这些都在文档中的简短教程中介绍:
We can import this data by reading from a file:
1 2 3 | import xml.etree.ElementTree as ET tree = ET.parse('country_data.xml') root = tree.getroot() |
Or directly from a string:
1 | root = ET.fromstring(country_data_as_string) |