关于unicode：Python无法使用surrogateescape进行编码

Python can't encode with surrogateescape

我在Python(3.4)中对Unicode代理编码有问题：

1
2
3
4

>>> b'\xCC'.decode('utf-16_be', 'surrogateescape').encode('utf-16_be', 'surrogateescape')
Traceback (most recent call last):
File"<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16-be' codec can't encode character '\udccc' in position 0: surrogates not allowed

如果我没有弄错，根据python文档：

'surrogateescape': On decoding, replace byte with individual surrogate
code ranging from U+DC80 to U+DCFF. This code will then be turned back
into the same byte when the 'surrogateescape' error handler is used
when encoding the data.

代码应该只生成源序列(b'\xCC')。那么，为什么会引发异常呢？

这可能与我的第二个问题有关：

Changed in version 3.4: The utf-16* and utf-32* encoders no longer allow surrogate code points (U+D800–U+DFFF) to be encoded.

(摘自https://docs.python.org/3/library/codecs.html标准编码)

据我所知，在没有代理项对的情况下，将一些代码点编码为UTF-16是不可能的。这背后的原因是什么？

之所以进行此更改，是因为Unicode标准明确禁止此类编码。请参阅第12892期，但显然不能使surrogateescape错误处理程序与UTF-16或UTF-32一起使用，因为这些编解码器与ASCII不兼容。

明确地：

I tested utf_16_32_surrogates_4.patch: surrogateescape with as encoder
does not work as expected.

1
2
3
4
5
6
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'ignore')
'[]'
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'replace')
'[?]'
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'surrogateescape')
'[\udc80\udcdc\uffff'

=> I expected '[\udc80\udcdc]'.

对此做出了回应：

Yes, surrogateescape doesn't work with ASCII incompatible encodings and can't. First, it can't represent the result of decoding b'\x00\xd8' from utf-16-le or b'ABCD' from utf-32*. This problem is worth separated issue (or even PEP) and discussion on Python-Dev.

我相信surrogateescape处理程序更适合于utf-8数据；现在对utf-16或utf-32的解码也可以使用它，这是一个很好的额外功能，但显然它不能在另一个方向上工作。