关于java：实现一个函数来检查字符串/字节数组是否遵循utf-8格式

Implement a function to check if a string/byte array follows utf-8 format

我正在努力解决这个面试问题。

After given clearly definition of UTF-8 format. ex: 1-byte :
0b0xxxxxxx 2- bytes:.... Asked to write a function to validate whether
the input is valid UTF-8. Input will be string/byte array, output
should be yes/no.

我有两种可能的方法。

首先，如果输入是一个字符串，因为utf-8最多是4个字节，在删除前两个字符"0b"之后，我们可以使用integer.parseint(s)来检查字符串的其余部分是否在0到10ffff的范围内。此外，最好先检查字符串的长度是否是8的倍数，以及输入字符串是否包含所有0和1。因此，我将要遍历字符串两次，复杂度将是O(n)。

其次，如果输入是字节数组(如果输入是字符串，我们也可以使用此方法)，我们检查每个1字节元素是否在正确的范围内。如果输入是一个字符串，首先检查字符串的长度是8的倍数，然后检查每个8个字符的子字符串是否在范围内。

我知道如何使用Java库来检查字符串有两个解决方案，但我的问题是我应该如何基于这个问题来实现函数。

谢谢。

相关讨论

让我们先看看UTF-8设计的可视化表示。

enter image description here

现在让我们继续我们必须做的事情。

循环字符串的所有字符(每个字符都是一个字节)。
我们需要根据代码点对每个字节应用一个掩码，因为x字符代表实际的代码点。我们将使用binary和operator(&)，如果结果在两个操作数中都存在，则会将其复制到结果中。
应用掩码的目的是删除尾随位，以便我们将实际字节作为第一个代码点进行比较。我们将使用0b1xxxxxxx进行位操作，其中1将显示"按顺序字节"时间，其他位将为0。
然后，我们可以与第一个字节进行比较，以验证它是否有效，并确定实际的字节是什么。
如果输入的字符不在任何情况下，这意味着字节无效，我们返回"否"。
如果我们可以退出循环，这意味着每个字符都是有效的，因此字符串是有效的。
确保返回"真"的比较与预期长度对应。

方法如下：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

public static final boolean isUTF8(final byte[] pText) {

int expectedLength = 0;

for (int i = 0; i < pText.length; i++) {
if ((pText[i] & 0b10000000) == 0b00000000) {
expectedLength = 1;
} else if ((pText[i] & 0b11100000) == 0b11000000) {
expectedLength = 2;
} else if ((pText[i] & 0b11110000) == 0b11100000) {
expectedLength = 3;
} else if ((pText[i] & 0b11111000) == 0b11110000) {
expectedLength = 4;
} else if ((pText[i] & 0b11111100) == 0b11111000) {
expectedLength = 5;
} else if ((pText[i] & 0b11111110) == 0b11111100) {
expectedLength = 6;
} else {
return false;
}

while (--expectedLength > 0) {
if (++i >= pText.length) {
return false;
}
if ((pText[i] & 0b11000000) != 0b10000000) {
return false;
}
}
}

return true;
}

编辑：实际方法不是原来的方法(几乎，但不是)，从这里被盗。根据@ejp注释，原来的一个没有正常工作。

相关讨论

一个用于现实世界中UTF-8兼容性检查的小型解决方案：

1
2
3
4
5

public static final boolean isUTF8(final byte[] inputBytes) {
final String converted = new String(inputBytes, StandardCharsets.UTF_8);
final byte[] outputBytes = converted.getBytes(StandardCharsets.UTF_8);
return Arrays.equals(inputBytes, outputBytes);
}

您可以检查测试结果：

1
2
3
4
5
6
7
8
9
10
11
12
13
14

@Test
public void testEnconding() {

byte[] invalidUTF8Bytes1 = new byte[]{(byte)0b10001111, (byte)0b10111111 };
byte[] invalidUTF8Bytes2 = new byte[]{(byte)0b10101010, (byte)0b00111111 };
byte[] validUTF8Bytes1 = new byte[]{(byte)0b11001111, (byte)0b10111111 };
byte[] validUTF8Bytes2 = new byte[]{(byte)0b11101111, (byte)0b10101010, (byte)0b10111111 };

assertThat(isUTF8(invalidUTF8Bytes1)).isFalse();
assertThat(isUTF8(invalidUTF8Bytes2)).isFalse();
assertThat(isUTF8(validUTF8Bytes1)).isTrue();
assertThat(isUTF8(validUTF8Bytes2)).isTrue();
assertThat(isUTF8("\u24b6".getBytes(StandardCharsets.UTF_8))).isTrue();
}

测试用例复制自https://codereview.stackexchange.com/questions/59428/validating-utf-8-byte-array

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

public static boolean validUTF8(byte[] input) {
int i = 0;
// Check for BOM
if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
&& (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
i = 3;
}

int end;
for (int j = input.length; i < j; ++i) {
int octet = input[i];
if ((octet & 0x80) == 0) {
continue; // ASCII
}

// Check for UTF-8 leading byte
if ((octet & 0xE0) == 0xC0) {
end = i + 1;
} else if ((octet & 0xF0) == 0xE0) {
end = i + 2;
} else if ((octet & 0xF8) == 0xF0) {
end = i + 3;
} else {
// Java only supports BMP so 3 is max
return false;
}

while (i < end) {
i++;
octet = input[i];
if ((octet & 0xC0) != 0x80) {
// Not a valid trailing byte
return false;
}
}
}
return true;
}

嗯，我很感激你的评论和回答。首先，我必须承认这是"另一个愚蠢的面试问题"。确实，Java字符串已经被编码，所以它总是与UTF-8兼容。一种检查方法是给它一个字符串：

1
2
3
4
5
6
7
8
9

public static boolean isUTF8(String s){
try{
byte[]bytes = s.getBytes("UTF-8");
}catch(UnsupportedEncodingException e){
e.printStackTrace();
System.exit(-1);
}
return true;
}

但是，由于所有可打印的字符串都是Unicode格式，所以我没有机会出错。

其次，如果给定一个字节数组，它将始终在-2^7(0B110000000)到2^7(0B111111)的范围内，因此它将始终在有效的UTF-8范围内。

我对这个问题的初步理解是，给定一个字符串，比如说"0B111111111"，检查它是否是有效的UTF-8，我想我错了。

此外，Java提供构造函数将字节数组转换为字符串，如果您对解码方法感兴趣，请在这里检查。

还有一件事，如果换一种语言，上述答案是正确的。唯一的改进可能是：

In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences.

所以4字节就足够了。

我肯定会这样，所以如果我错了就纠正我。谢谢。