How to decode surrogate characters encoded as UTF8?
我的C程序获取一些UTF-8编码的数据,并使用
我怎样才能用C解码这些数据?
例子:
1 2 3 4 5 6 7 8 9 10 11 | static void Main(string[] args) { string orig ="??"; byte[] correctUTF8 = Encoding.UTF8.GetBytes(orig); // Simulate correct conversion using std::codecvt_utf8_utf16<wchar_t> Console.WriteLine("correctUTF8:" + BitConverter.ToString(correctUTF8)); // F0-9F-8C-8E - that's what the C++ program should've produced // Simulate bad conversion using std::codecvt_utf8<wchar_t> - that's what I get from the program byte[] badUTF8 = new byte[] { 0xED, 0xA0, 0xBC, 0xED, 0xBC, 0x8E }; string badString = Encoding.UTF8.GetString(badUTF8); // ???? (4 * U+FFFD 'REPLACMENT CHARACTER') // How can I convert this? } |
注:编码程序用C++编写,并使用EDCOX1×1(以下代码)转换数据。正如@peterduniho的回答正确指出的那样,它应该使用
1 2 | std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8Converter; std::string utf8str = utf8Converter.to_bytes(wstr); |
没有一个好的最小、完整和可验证的代码示例是不可能确定的。但在我看来,好像你在C++中使用了错误的转换器。
相反,您应该使用
1 2 | std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> utf8Converter; std::string utf8str = utf8Converter.to_bytes(wstr); |
当我使用这个转换器时,我得到了所需的UTF-8字节:
< BR>附录:
问题已更新,表明无法更改编码代码。UCS-2已经被编码成无效的UTF8,这使您陷入困境。因为utf8无效,所以您必须自己对文本进行解码。
我看到了一些合理的方法。首先,编写一个不关心UTF8是否包含无效字节序列的解码器。第二,使用C++ EDCOX1×3转换器来解码你的字节(例如,用C++编写你的接收代码,或者写一个C++的DLL,你可以从C代码中调用它来完成这项工作)。
从某种意义上说,第二个选项更可靠,即您使用的正是最初创建坏数据的解码器。另一方面,即使创建一个DLL也可能是过份的,不必介意在C++中编写整个客户端。制作一个DLL,即使使用C++/CLI,你仍然有一些头痛使互操作正确,除非你已经是专家。
我很熟悉,但几乎没有专家,用C++/CLI。我对C的理解要好得多,所以下面是第一个选项的一些代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | private const int _khighOffset = 0xD800 - (0x10000 >> 10); /// <summary> /// Decodes a nominally UTF8 byte sequence as UTF16. Ignores all data errors /// except those which prevent coherent interpretation of the input data. /// Input with invalid-but-decodable UTF8 sequences will be decoded without /// error, and may lead to invalid UTF16. /// </summary> /// <param name="bytes">The UTF8 byte sequence to decode</param> /// <returns>A string value representing the decoded UTF8</returns> /// <remarks> /// This method has not been thoroughly validated. It should be tested /// carefully with a broad range of inputs (the entire UTF16 code point /// range would not be unreasonable) before being used in any sort of /// production environment. /// </remarks> private static string DecodeUtf8WithOverlong(byte[] bytes) { List<char> result = new List<char>(); int continuationCount = 0, continuationAccumulator = 0, highBase = 0; char continuationBase = '\0'; for (int i = 0; i < bytes.Length; i++) { byte b = bytes[i]; if (b < 0x80) { result.Add((char)b); continue; } if (b < 0xC0) { // Byte values in this range are used only as continuation bytes. // If we aren't expecting any continuation bytes, then the input // is invalid beyond repair. if (continuationCount == 0) { throw new ArgumentException("invalid encoding"); } // Each continuation byte represents 6 bits of the actual // character value continuationAccumulator <<= 6; continuationAccumulator |= (b - 0x80); if (--continuationCount == 0) { continuationAccumulator += highBase; if (continuationAccumulator > 0xffff) { // Code point requires more than 16 bits, so split into surrogate pair char highSurrogate = (char)(_khighOffset + (continuationAccumulator >> 10)), lowSurrogate = (char)(0xDC00 + (continuationAccumulator & 0x3FF)); result.Add(highSurrogate); result.Add(lowSurrogate); } else { result.Add((char)(continuationBase | continuationAccumulator)); } continuationAccumulator = 0; continuationBase = '\0'; highBase = 0; } continue; } if (b < 0xE0) { continuationCount = 1; continuationBase = (char)((b - 0xC0) * 0x0040); continue; } if (b < 0xF0) { continuationCount = 2; continuationBase = (char)(b == 0xE0 ? 0x0800 : (b - 0xE0) * 0x1000); continue; } if (b < 0xF8) { continuationCount = 3; highBase = (b - 0xF0) * 0x00040000; continue; } if (b < 0xFC) { continuationCount = 4; highBase = (b - 0xF8) * 0x01000000; continue; } if (b < 0xFE) { continuationCount = 5; highBase = (b - 0xFC) * 0x40000000; continue; } // byte values of 0xFE and 0xFF are invalid throw new ArgumentException("invalid encoding"); } return new string(result.ToArray()); } |
我用你的地球仪测试过,结果很好。它还正确地为该字符(即