UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 386: character maps to <undefined>
我试图用Slate库读取一个PDF文件,但它抛出了以下错误:
1 2 3 4 5 6 7 8 9 10 | import slate pdf = 'tabla9.pdf' with open(pdf,encoding="utf-8") as f: doc = slate.PDF(f) for page in doc[:2]: print(page) |
完全错误:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | File"C:\Users\user\libro5.py", line 7, in <module> doc = slate.PDF(f) File"C:\Python3\lib\slate\classes.py", line 52, in __init__ self.parser = PDFParser(file) File"C:\Python3\lib\site-packages\pdfminer\pdfparser.py", line 646, in __init__ PSStackParser.__init__(self, fp) File"C:\Python3\lib\site-packages\pdfminer\psparser.py", line 189, in __init__ PSBaseParser.__init__(self, fp) File"C:\Python3\lib\site-packages\pdfminer\psparser.py", line 134, in __init__ data = fp.read() File"C:\Python3\lib\codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte |
1 2 3 | class PDF(list): def __init__(self, file, password='', just_text=1, check_extractable=True, char_margin=1.0, line_margin=0.1, word_margin=0.1): self.parser = PDFParser(file) |
1 2 | def __init__(self, fp): PSStackParser.__init__(self, fp) |
1 2 3 4 | class PSStackParser(PSBaseParser): def __init__(self, fp): PSBaseParser.__init__(self, fp) |
1 2 3 4 5 6 | class PSBaseParser: """Most basic PostScript parser that performs only tokenization. """ def __init__(self, fp): data = fp.read() |
文件"c:python3libcodecs.py",第322行,解码中(结果,消耗)=self.u缓冲区解码(数据,self.errors,最终)unicodedecode错误:"utf-8"编解码器无法解码位置10中的字节0xe2:无效的继续字节:
1 2 3 4 | def decode(self, input, final=False): # decode input (taking the buffer into account) data = self.buffer + input (result, consumed) = self._buffer_decode(data, self.errors, final) |
我在Windows10上使用的是python 3.7。
PDF文件是二进制的,不适合以文本模式以编码方式打开它。
尝试:
1 | with open(pdf,"rb") as f: |