关于python:UnicodeDecodeError:’charmap’编解码器无法解码位置386中的字节0x8d:字符映射到< undefined>

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 386: character maps to <undefined>

我试图用Slate库读取一个PDF文件,但它抛出了以下错误:

1
2
3
4
5
6
7
8
9
10
import slate

pdf = 'tabla9.pdf'

with open(pdf,encoding="utf-8") as f:

doc = slate.PDF(f)

for page in doc[:2]:
   print(page)

完全错误:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
File"C:\Users\user\libro5.py", line 7, in <module>
doc = slate.PDF(f)
File"C:\Python3\lib\slate\classes.py", line 52, in __init__
self.parser = PDFParser(file)
File"C:\Python3\lib\site-packages\pdfminer\pdfparser.py", line 646, in
__init__
PSStackParser.__init__(self, fp)
File"C:\Python3\lib\site-packages\pdfminer\psparser.py", line 189, in
__init__
PSBaseParser.__init__(self, fp)
File"C:\Python3\lib\site-packages\pdfminer\psparser.py", line 134, in
__init__
data = fp.read()
File"C:\Python3\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10:
invalid continuation byte

classes.py52行:

1
2
3
class PDF(list):
    def __init__(self, file, password='', just_text=1, check_extractable=True, char_margin=1.0, line_margin=0.1, word_margin=0.1):
        self.parser = PDFParser(file)

pdfparser.py646行:

1
2
def __init__(self, fp):
        PSStackParser.__init__(self, fp)

psparser.py,第189行:

1
2
3
4
class PSStackParser(PSBaseParser):

    def __init__(self, fp):
        PSBaseParser.__init__(self, fp)

psparser.py134行:

1
2
3
4
5
6
class PSBaseParser:

   """Most basic PostScript parser that performs only tokenization.
   """

    def __init__(self, fp):
        data = fp.read()

文件"c:python3libcodecs.py",第322行,解码中(结果,消耗)=self.u缓冲区解码(数据,self.errors,最终)unicodedecode错误:"utf-8"编解码器无法解码位置10中的字节0xe2:无效的继续字节:

1
2
3
4
def decode(self, input, final=False):
    # decode input (taking the buffer into account)
    data = self.buffer + input
    (result, consumed) = self._buffer_decode(data, self.errors, final)

我在Windows10上使用的是python 3.7。


PDF文件是二进制的,不适合以文本模式以编码方式打开它。

尝试:

1
with open(pdf,"rb") as f: