关于utf 8:如何将带有中文字符的EBCDIC转换为UTF-8格式

How to convert EBCDIC with chinese chars to UTF-8 format

我需要将使用EBCDIC编码的文件转换为使用IBM937代码页编码的UTF-8格式,以便将文件加载到启用了多字节的DB2数据库中。

我尝试过unix recode和iconv。它们都没有能力将IBM937转换为UTF8。我正在寻找在这个世界上可以在基于UNIX的系统上实现的任何实用工具(Java、Perl、UNIX)。有人能帮我吗?

SL


请看一下ICU(Unicode的国际组件):http://site.icu-project.org/

它有一个用于IBM-937的转换器:http://demo.icu-project.org/icu-bin/converxp?conv=ibm-937_p110-1999&s=all(全部)

CU is a mature, widely used set of
C/C++ and Java libraries providing
Unicode and Globalization support for
software applications. ICU is widely
portable and gives applications the
same results on all platforms and
between C/C++ and Java software. ICU
is released under a nonrestrictive
open source license that is suitable
for use with both commercial software
and with other open source or free
software.

Here are a few highlights of the
services provided by ICU:

  • Code Page Conversion: Convert text
    data to or from Unicode and nearly any
    other character set or encoding. ICU's
    conversion tables are based on charset
    data collected by IBM over the course
    of many decades, and is the most
    complete available anywhere.

  • Collation: Compare strings according
    to the conventions and standards of a
    particular language, region or
    country. ICU's collation is based on
    the Unicode Collation Algorithm plus
    locale-specific comparison rules from
    the Common Locale Data Repository, a
    comprehensive source for this type of
    data.

  • Formatting: Format numbers, dates,
    times and currency amounts according
    the conventions of a chosen locale.
    This includes translating month and
    day names into the selected language,
    choosing appropriate abbreviations,
    ordering fields correctly, etc. This
    data also comes from the Common Locale
    Data Repository.

  • Time Calculations: Multiple types of
    calendars are provided beyond the
    traditional Gregorian calendar. A
    thorough set of timezone calculation
    APIs are provided.

  • Unicode Support: ICU closely tracks
    the Unicode standard, providing easy
    access to all of the many Unicode
    character properties, Unicode
    Normalization, Case Folding and other
    fundamental operations as specified by
    the Unicode Standard.

  • Regular Expression: ICU's regular
    expressions fully support Unicode
    while providing very competitive
    performance.

  • Bidi: support for handling text
    containing a mixture of left to right
    (English) and right to left (Arabic or
    Hebrew) data.

  • Text Boundaries: Locate the positions
    of words, sentences, paragraphs within
    a range of text, or identify locations
    that would be suitable for line
    wrapping when displaying the text.

And much more. Refer to the ICU User Guide for details.


看来,Java可以将IBM M937代码页转换为UTF-8。

您将输入格式指定为"CP937"。

以下是Oracle页面中关于字符和字节流的两种方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
static String readInput() {

    StringBuffer buffer = new StringBuffer();
    try {
        FileInputStream fis = new FileInputStream("test.txt");
        InputStreamReader isr = new InputStreamReader(fis,
                         "cp937");
        Reader in = new BufferedReader(isr);
        int ch;
        while ((ch = in.read()) > -1) {
            buffer.append((char)ch);
        }
        in.close();
        return buffer.toString();
    } catch (IOException e) {
        e.printStackTrace();
        return null;
    }
}

1
2
3
4
5
6
7
8
9
10
11
static void writeOutput(String str) {

    try {
        FileOutputStream fos = new FileOutputStream("test.txt");
        Writer out = new OutputStreamWriter(fos,"UTF8");
        out.write(str);
        out.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
}