urlopen trouble while trying to download a gzip file
我将使用wikitionary转储来进行POS标记。 不知何故,它在下载时卡住了。 这是我的代码:
1 2 3 4 5 6 7 8 9 10 11 12 | import nltk from urllib import urlopen from collections import Counter import gzip url = 'http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-all-titles-in-ns0.gz' fStream = gzip.open(urlopen(url).read(), 'rb') dictFile = fStream.read() fStream.close() text = nltk.Text(word.lower() for word in dictFile()) tokens = nltk.word_tokenize(text) |
这是我得到的错误:
1 2 3 4 5 6 7 8 9 | Traceback (most recent call last): File"~/dir1/dir1/wikt.py", line 15, in <module> fStream = gzip.open(urlopen(url).read(), 'rb') File"/usr/lib/python2.7/gzip.py", line 34, in open return GzipFile(filename, mode, compresslevel) File"/usr/lib/python2.7/gzip.py", line 89, in __init__ fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb') TypeError: file() argument 1 must be encoded string without NULL bytes, not str Process finished with exit code 1 |
您正在将下载的数据传递给
然后代码尝试打开由gzip压缩数据命名的文件名,然后失败。
将URL数据保存到文件,然后在其上使用
1 2 3 4 5 6 7 8 | from StringIO import StringIO from urllib import urlopen import gzip url = 'http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-all-titles-in-ns0.gz' inmemory = StringIO(urlopen(url).read()) fStream = gzip.GzipFile(fileobj=inmemory, mode='rb') |