关于c＃：如何解码编码为UTF8的代理字符？

How to decode surrogate characters encoded as UTF8?

我的C程序获取一些UTF-8编码的数据，并使用Encoding.UTF8.GetString(data)对其进行解码。当产生数据的程序得到BMP之外的字符时，它将它们编码为2个代理字符，每个代理字符分别编码为UTF-8。在这种情况下，我的程序无法正确解码它们。

我怎样才能用C解码这些数据？

例子：

1
2
3
4
5
6
7
8
9
10
11

static void Main(string[] args)
{
string orig ="??";
byte[] correctUTF8 = Encoding.UTF8.GetBytes(orig); // Simulate correct conversion using std::codecvt_utf8_utf16<wchar_t>
Console.WriteLine("correctUTF8:" + BitConverter.ToString(correctUTF8)); // F0-9F-8C-8E - that's what the C++ program should've produced

// Simulate bad conversion using std::codecvt_utf8<wchar_t> - that's what I get from the program
byte[] badUTF8 = new byte[] { 0xED, 0xA0, 0xBC, 0xED, 0xBC, 0x8E };
string badString = Encoding.UTF8.GetString(badUTF8); // ???? (4 * U+FFFD 'REPLACMENT CHARACTER')
// How can I convert this?
}

注：编码程序用C++编写，并使用EDCOX1×1(以下代码)转换数据。正如@peterduniho的回答正确指出的那样，它应该使用std::codecvt_utf8_utf16。不幸的是，我不能控制这个程序，也不能改变它的行为——只能处理它的错误输入。

1 2	std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8Converter; std::string utf8str = utf8Converter.to_bytes(wstr);

相关讨论

没有一个好的最小、完整和可验证的代码示例是不可能确定的。但在我看来，好像你在C++中使用了错误的转换器。

std::codecvt_utf8区域设置转换自ucs-2，而不是utf-16。这两个非常相似，但ucs-2不支持对要编码的字符进行编码所需的代理项对。

相反，您应该使用std::codecvt_utf8_utf16：

1 2	std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> utf8Converter; std::string utf8str = utf8Converter.to_bytes(wstr);

当我使用这个转换器时，我得到了所需的UTF-8字节：F0 9F 8C 8E。当然，当解释为UTF-8时，它们在.NET中正确解码。

< BR>附录：

问题已更新，表明无法更改编码代码。UCS-2已经被编码成无效的UTF8，这使您陷入困境。因为utf8无效，所以您必须自己对文本进行解码。

我看到了一些合理的方法。首先，编写一个不关心UTF8是否包含无效字节序列的解码器。第二，使用C++ EDCOX1×3转换器来解码你的字节(例如，用C++编写你的接收代码，或者写一个C++的DLL，你可以从C代码中调用它来完成这项工作)。

从某种意义上说，第二个选项更可靠，即您使用的正是最初创建坏数据的解码器。另一方面，即使创建一个DLL也可能是过份的，不必介意在C++中编写整个客户端。制作一个DLL，即使使用C++/CLI，你仍然有一些头痛使互操作正确，除非你已经是专家。

我很熟悉，但几乎没有专家，用C++/CLI。我对C的理解要好得多，所以下面是第一个选项的一些代码：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111

private const int _khighOffset = 0xD800 - (0x10000 >> 10);

/// <summary>
/// Decodes a nominally UTF8 byte sequence as UTF16. Ignores all data errors
/// except those which prevent coherent interpretation of the input data.
/// Input with invalid-but-decodable UTF8 sequences will be decoded without
/// error, and may lead to invalid UTF16.
/// </summary>
/// <param name="bytes">The UTF8 byte sequence to decode</param>
/// <returns>A string value representing the decoded UTF8</returns>
/// <remarks>
/// This method has not been thoroughly validated. It should be tested
/// carefully with a broad range of inputs (the entire UTF16 code point
/// range would not be unreasonable) before being used in any sort of
/// production environment.
/// </remarks>
private static string DecodeUtf8WithOverlong(byte[] bytes)
{
List<char> result = new List<char>();
int continuationCount = 0, continuationAccumulator = 0, highBase = 0;
char continuationBase = '\0';

for (int i = 0; i < bytes.Length; i++)
{
byte b = bytes[i];

if (b < 0x80)
{
result.Add((char)b);
continue;
}

if (b < 0xC0)
{
// Byte values in this range are used only as continuation bytes.
// If we aren't expecting any continuation bytes, then the input
// is invalid beyond repair.
if (continuationCount == 0)
{
throw new ArgumentException("invalid encoding");
}

// Each continuation byte represents 6 bits of the actual
// character value
continuationAccumulator <<= 6;
continuationAccumulator |= (b - 0x80);
if (--continuationCount == 0)
{
continuationAccumulator += highBase;

if (continuationAccumulator > 0xffff)
{
// Code point requires more than 16 bits, so split into surrogate pair
char highSurrogate = (char)(_khighOffset + (continuationAccumulator >> 10)),
lowSurrogate = (char)(0xDC00 + (continuationAccumulator & 0x3FF));

result.Add(highSurrogate);
result.Add(lowSurrogate);
}
else
{
result.Add((char)(continuationBase | continuationAccumulator));
}
continuationAccumulator = 0;
continuationBase = '\0';
highBase = 0;
}
continue;
}

if (b < 0xE0)
{
continuationCount = 1;
continuationBase = (char)((b - 0xC0) * 0x0040);
continue;
}

if (b < 0xF0)
{
continuationCount = 2;
continuationBase = (char)(b == 0xE0 ? 0x0800 : (b - 0xE0) * 0x1000);
continue;
}

if (b < 0xF8)
{
continuationCount = 3;
highBase = (b - 0xF0) * 0x00040000;
continue;
}

if (b < 0xFC)
{
continuationCount = 4;
highBase = (b - 0xF8) * 0x01000000;
continue;
}

if (b < 0xFE)
{
continuationCount = 5;
highBase = (b - 0xFC) * 0x40000000;
continue;
}

// byte values of 0xFE and 0xFF are invalid
throw new ArgumentException("invalid encoding");
}

return new string(result.ToArray());
}

我用你的地球仪测试过，结果很好。它还正确地为该字符(即F0 9F 8C 8E)解码正确的utf8。当然，如果您打算使用该代码对所有的utf8输入进行解码，那么您需要使用完整的数据范围来测试它。