关于utf 8：Java – 以独立于系统的方式从文件读取UTF8字节到字符串

Java - Reading UTF8 bytes from File into String in a system independent way

如何将Java中的UTF8编码文件准确地读入字符串？

当我将这个.java文件的编码更改为UTF-8(Eclipse >右击App.java>属性>资源>文本文件编码)时，它在Eclipse中运行良好，而不是命令行。似乎Eclipse在运行应用程序时正在设置file.encoding参数。

为什么源文件的编码对从字节创建字符串有任何影响？当已知编码时，从字节创建字符串的防错方法是什么？我可能有不同编码的文件。一旦知道了文件的编码，不管file.encoding的值是多少，我都必须能够读入字符串。

utf8文件的内容如下

1
2
3
4
5
6
7
8
9

English Hello World.
Korean ?????.
Japanese 世界こんにちは。
Russian Привет мир.
German Hallo Welt.
Spanish Hola mundo.
Hindi ???? ???????
Gujarati ???? ??????.
Thai ????????????.

-文件结束-

代码如下。我的意见在里面。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169

public class App {
public static void main(String[] args) {
String slash = System.getProperty("file.separator");
File inputUtfFile = new File("C:" + slash +"sources" + slash +"TestUtfRead" + slash +"utf8text.txt");
File outputUtfFile = new File("C:" + slash +"sources" + slash +"TestUtfRead" + slash +"utf8text_out.txt");
File outputUtfByteWrittenFile = new File(
"C:" + slash +"sources" + slash +"TestUtfRead" + slash +"utf8text_byteout.txt");
outputUtfFile.delete();
outputUtfByteWrittenFile.delete();

try {

/*
* read a utf8 text file with internationalized strings into bytes.
* there should be no information loss here, when read into raw bytes.
* We are sure that this file is UTF-8 encoded.
* Input file created using Notepad++. Text copied from Google translate.
*/
byte[] fileBytes = readBytes(inputUtfFile);

/*
* Create a string from these bytes. Specify that the bytes are UTF-8 bytes.
*/
String str = new String(fileBytes, StandardCharsets.UTF_8);

/*
* The console is incapable of displaying this string.
* So we write into another file. Open in notepad++ to check.
*/
ArrayList<String> list = new ArrayList<>();
list.add(str);
writeLines(list, outputUtfFile);

/*
* Works fine when I read bytes and write bytes.
* Open the other output file in notepad++ and check.
*/
writeBytes(fileBytes, outputUtfByteWrittenFile);

/*
* I am using JDK 8u60.
* I tried running this on command line instead of eclipse. Does not work.
* I tried using apache commons io library. Does not work.
*
* This means that new String(bytes, charset); does not work correctly.
* There is no real effect of specifying charset to string.
*/
} catch (IOException e) {
e.printStackTrace();
}

}

public static void writeLines(List<String> lines, File file) throws IOException {
BufferedWriter writer = null;
OutputStreamWriter osw = null;
OutputStream fos = null;
try {
fos = new FileOutputStream(file);
osw = new OutputStreamWriter(fos);
writer = new BufferedWriter(osw);
String lineSeparator = System.getProperty("line.separator");
for (int i = 0; i < lines.size(); i++) {
String line = lines.get(i);
writer.write(line);
if (i < lines.size() - 1) {
writer.write(lineSeparator);
}
}
} catch (IOException e) {
throw e;
} finally {
close(writer);
close(osw);
close(fos);
}
}

public static byte[] readBytes(File file) {
FileInputStream fis = null;
byte[] b = null;
try {
fis = new FileInputStream(file);
b = readBytesFromStream(fis);
} catch (Exception e) {
e.printStackTrace();
} finally {
close(fis);
}
return b;
}

public static void writeBytes(byte[] inBytes, File file) {
FileOutputStream fos = null;
try {
fos = new FileOutputStream(file);
writeBytesToStream(inBytes, fos);
fos.flush();
} catch (Exception e) {
e.printStackTrace();
} finally {
close(fos);
}
}

public static void close(InputStream inStream) {
try {
inStream.close();
} catch (IOException e) {
e.printStackTrace();
}
inStream = null;
}

public static void close(OutputStream outStream) {
try {
outStream.close();
} catch (IOException e) {
e.printStackTrace();
}
outStream = null;
}

public static void close(Writer writer) {
if (writer != null) {
try {
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
writer = null;
}
}

public static long copy(InputStream readStream, OutputStream writeStream) throws IOException {
int bytesread = -1;
byte[] b = new byte[4096]; //4096 is default cluster size in Windows for < 2TB NTFS partitions
long count = 0;
bytesread = readStream.read(b);
while (bytesread != -1) {
writeStream.write(b, 0, bytesread);
count += bytesread;
bytesread = readStream.read(b);
}
return count;
}
public static byte[] readBytesFromStream(InputStream readStream) throws IOException {
ByteArrayOutputStream writeStream = null;
byte[] byteArr = null;
writeStream = new ByteArrayOutputStream();
try {
copy(readStream, writeStream);
writeStream.flush();
byteArr = writeStream.toByteArray();
} finally {
close(writeStream);
}
return byteArr;
}
public static void writeBytesToStream(byte[] inBytes, OutputStream writeStream) throws IOException {
ByteArrayInputStream bis = null;
bis = new ByteArrayInputStream(inBytes);
try {
copy(bis, writeStream);
} finally {
close(bis);
}
}
};

编辑：针对@jb nizet和所有人：)

1
2
3

//writeLines(list, outputUtfFile, StandardCharsets.UTF_16BE); //does not work
//writeLines(list, outputUtfFile, Charset.defaultCharset()); //does not work.
writeLines(list, outputUtfFile, StandardCharsets.UTF_16LE); //works

我需要在将字节读取到字符串中时指定字节编码。当我将字节从字符串写入文件时，需要指定字节编码。

一旦我在JVM中有了一个字符串，我就不需要记住源字节编码，对吗？

当我写入文件时，它应该将字符串转换为我的机器的默认字符集(不管是utf8、ascii或cp1252)。这是失败的。UTF16也失败了。为什么有些字符集会失败？