Java - Reading UTF8 bytes from File into String in a system independent way
如何将Java中的UTF8编码文件准确地读入字符串?
当我将这个.java文件的编码更改为UTF-8(Eclipse >右击App.java>属性>资源>文本文件编码)时,它在Eclipse中运行良好,而不是命令行。似乎Eclipse在运行应用程序时正在设置file.encoding参数。
为什么源文件的编码对从字节创建字符串有任何影响?当已知编码时,从字节创建字符串的防错方法是什么?我可能有不同编码的文件。一旦知道了文件的编码,不管file.encoding的值是多少,我都必须能够读入字符串。
utf8文件的内容如下
1 2 3 4 5 6 7 8 9 | English Hello World. Korean ?????. Japanese 世界こんにちは。 Russian Привет мир. German Hallo Welt. Spanish Hola mundo. Hindi ???? ??????? Gujarati ???? ??????. Thai ????????????. |
-文件结束-
代码如下。我的意见在里面。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 | public class App { public static void main(String[] args) { String slash = System.getProperty("file.separator"); File inputUtfFile = new File("C:" + slash +"sources" + slash +"TestUtfRead" + slash +"utf8text.txt"); File outputUtfFile = new File("C:" + slash +"sources" + slash +"TestUtfRead" + slash +"utf8text_out.txt"); File outputUtfByteWrittenFile = new File( "C:" + slash +"sources" + slash +"TestUtfRead" + slash +"utf8text_byteout.txt"); outputUtfFile.delete(); outputUtfByteWrittenFile.delete(); try { /* * read a utf8 text file with internationalized strings into bytes. * there should be no information loss here, when read into raw bytes. * We are sure that this file is UTF-8 encoded. * Input file created using Notepad++. Text copied from Google translate. */ byte[] fileBytes = readBytes(inputUtfFile); /* * Create a string from these bytes. Specify that the bytes are UTF-8 bytes. */ String str = new String(fileBytes, StandardCharsets.UTF_8); /* * The console is incapable of displaying this string. * So we write into another file. Open in notepad++ to check. */ ArrayList<String> list = new ArrayList<>(); list.add(str); writeLines(list, outputUtfFile); /* * Works fine when I read bytes and write bytes. * Open the other output file in notepad++ and check. */ writeBytes(fileBytes, outputUtfByteWrittenFile); /* * I am using JDK 8u60. * I tried running this on command line instead of eclipse. Does not work. * I tried using apache commons io library. Does not work. * * This means that new String(bytes, charset); does not work correctly. * There is no real effect of specifying charset to string. */ } catch (IOException e) { e.printStackTrace(); } } public static void writeLines(List<String> lines, File file) throws IOException { BufferedWriter writer = null; OutputStreamWriter osw = null; OutputStream fos = null; try { fos = new FileOutputStream(file); osw = new OutputStreamWriter(fos); writer = new BufferedWriter(osw); String lineSeparator = System.getProperty("line.separator"); for (int i = 0; i < lines.size(); i++) { String line = lines.get(i); writer.write(line); if (i < lines.size() - 1) { writer.write(lineSeparator); } } } catch (IOException e) { throw e; } finally { close(writer); close(osw); close(fos); } } public static byte[] readBytes(File file) { FileInputStream fis = null; byte[] b = null; try { fis = new FileInputStream(file); b = readBytesFromStream(fis); } catch (Exception e) { e.printStackTrace(); } finally { close(fis); } return b; } public static void writeBytes(byte[] inBytes, File file) { FileOutputStream fos = null; try { fos = new FileOutputStream(file); writeBytesToStream(inBytes, fos); fos.flush(); } catch (Exception e) { e.printStackTrace(); } finally { close(fos); } } public static void close(InputStream inStream) { try { inStream.close(); } catch (IOException e) { e.printStackTrace(); } inStream = null; } public static void close(OutputStream outStream) { try { outStream.close(); } catch (IOException e) { e.printStackTrace(); } outStream = null; } public static void close(Writer writer) { if (writer != null) { try { writer.close(); } catch (IOException e) { e.printStackTrace(); } writer = null; } } public static long copy(InputStream readStream, OutputStream writeStream) throws IOException { int bytesread = -1; byte[] b = new byte[4096]; //4096 is default cluster size in Windows for < 2TB NTFS partitions long count = 0; bytesread = readStream.read(b); while (bytesread != -1) { writeStream.write(b, 0, bytesread); count += bytesread; bytesread = readStream.read(b); } return count; } public static byte[] readBytesFromStream(InputStream readStream) throws IOException { ByteArrayOutputStream writeStream = null; byte[] byteArr = null; writeStream = new ByteArrayOutputStream(); try { copy(readStream, writeStream); writeStream.flush(); byteArr = writeStream.toByteArray(); } finally { close(writeStream); } return byteArr; } public static void writeBytesToStream(byte[] inBytes, OutputStream writeStream) throws IOException { ByteArrayInputStream bis = null; bis = new ByteArrayInputStream(inBytes); try { copy(bis, writeStream); } finally { close(bis); } } }; |
编辑:针对@jb nizet和所有人:)
1 2 3 | //writeLines(list, outputUtfFile, StandardCharsets.UTF_16BE); //does not work //writeLines(list, outputUtfFile, Charset.defaultCharset()); //does not work. writeLines(list, outputUtfFile, StandardCharsets.UTF_16LE); //works |
我需要在将字节读取到字符串中时指定字节编码。当我将字节从字符串写入文件时,需要指定字节编码。
一旦我在JVM中有了一个字符串,我就不需要记住源字节编码,对吗?
当我写入文件时,它应该将字符串转换为我的机器的默认字符集(不管是utf8、ascii或cp1252)。这是失败的。UTF16也失败了。为什么有些字符集会失败?
Java源代码编码确实不相关。代码的阅读部分是正确的(尽管效率很低)。不正确的是书写部分:
1 |
应改为
1 |
否则,您将使用默认编码(在您的系统中似乎不是utf8)而不是utf8。
注意,Java允许在文件路径中使用前斜杠,即使在Windows上也是如此。你可以简单地写
编辑:
Once I have a String in JVM, I do not need to remember the source byte encoding, am I right?
是的,你是对的。
When I write to file, it should convert the String into the default Charset of my machine (be it UTF8 or ASCII or cp1252). That is failing.
如果没有指定任何编码,Java实际上将使用平台默认编码将字符转换为字节。如果您指定了一个编码(如本答案开头所建议的那样),那么它将使用您告诉它要使用的编码。
但是所有编码不能像UTF8那样表示所有的Unicode字符。例如,ASCII只支持128个不同的字符。CP1252,afaik,仅支持256个字符。因此,编码成功了,但它用一个特殊的字符(我记不清是哪个)替换了不可编码的字符,这意味着:我不能对这个泰语或俄语字符进行编码,因为它不是我支持的字符集的一部分。
UTF16编码应该可以。但请确保在读取和显示文件内容时也将文本编辑器配置为使用UTF16。如果配置为使用其他编码,则显示的内容将不正确。