How to convert EBCDIC with chinese chars to UTF-8 format
我需要将使用EBCDIC编码的文件转换为使用IBM937代码页编码的UTF-8格式,以便将文件加载到启用了多字节的DB2数据库中。
我尝试过unix recode和iconv。它们都没有能力将IBM937转换为UTF8。我正在寻找在这个世界上可以在基于UNIX的系统上实现的任何实用工具(Java、Perl、UNIX)。有人能帮我吗?
SL
请看一下ICU(Unicode的国际组件):http://site.icu-project.org/
它有一个用于IBM-937的转换器:http://demo.icu-project.org/icu-bin/converxp?conv=ibm-937_p110-1999&s=all(全部)
CU is a mature, widely used set of
C/C++ and Java libraries providing
Unicode and Globalization support for
software applications. ICU is widely
portable and gives applications the
same results on all platforms and
between C/C++ and Java software. ICU
is released under a nonrestrictive
open source license that is suitable
for use with both commercial software
and with other open source or free
software.Here are a few highlights of the
services provided by ICU:
Code Page Conversion: Convert text
data to or from Unicode and nearly any
other character set or encoding. ICU's
conversion tables are based on charset
data collected by IBM over the course
of many decades, and is the most
complete available anywhere.Collation: Compare strings according
to the conventions and standards of a
particular language, region or
country. ICU's collation is based on
the Unicode Collation Algorithm plus
locale-specific comparison rules from
the Common Locale Data Repository, a
comprehensive source for this type of
data.Formatting: Format numbers, dates,
times and currency amounts according
the conventions of a chosen locale.
This includes translating month and
day names into the selected language,
choosing appropriate abbreviations,
ordering fields correctly, etc. This
data also comes from the Common Locale
Data Repository.Time Calculations: Multiple types of
calendars are provided beyond the
traditional Gregorian calendar. A
thorough set of timezone calculation
APIs are provided.Unicode Support: ICU closely tracks
the Unicode standard, providing easy
access to all of the many Unicode
character properties, Unicode
Normalization, Case Folding and other
fundamental operations as specified by
the Unicode Standard.Regular Expression: ICU's regular
expressions fully support Unicode
while providing very competitive
performance.Bidi: support for handling text
containing a mixture of left to right
(English) and right to left (Arabic or
Hebrew) data.Text Boundaries: Locate the positions
of words, sentences, paragraphs within
a range of text, or identify locations
that would be suitable for line
wrapping when displaying the text.And much more. Refer to the ICU User Guide for details.
看来,Java可以将IBM M937代码页转换为UTF-8。
您将输入格式指定为"CP937"。
以下是Oracle页面中关于字符和字节流的两种方法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | static String readInput() { StringBuffer buffer = new StringBuffer(); try { FileInputStream fis = new FileInputStream("test.txt"); InputStreamReader isr = new InputStreamReader(fis, "cp937"); Reader in = new BufferedReader(isr); int ch; while ((ch = in.read()) > -1) { buffer.append((char)ch); } in.close(); return buffer.toString(); } catch (IOException e) { e.printStackTrace(); return null; } } |
和
1 2 3 4 5 6 7 8 9 10 11 | static void writeOutput(String str) { try { FileOutputStream fos = new FileOutputStream("test.txt"); Writer out = new OutputStreamWriter(fos,"UTF8"); out.write(str); out.close(); } catch (IOException e) { e.printStackTrace(); } } |