iconv any encoding to UTF-8
我试图将iconv指向一个目录,不管当前的编码是什么,所有文件都将转换为utf-8
我正在使用这个脚本,但您必须指定要使用的编码。如何使其自动检测当前编码?
DelyICONV.SH
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | #!/bin/bash ICONVBIN='/usr/bin/iconv' # path to iconv binary if [ $# -lt 3 ] then echo"$0 dir from_charset to_charset" exit fi for f in $1/* do if test -f $f then echo -e" Converting $f" /bin/mv $f $f.old $ICONVBIN -f $2 -t $3 $f.old > $f else echo -e" Skipping $f - not a regular file"; fi done |
终端线
1 | sudo convert/dir_iconv.sh convert/books CURRENT_ENCODING utf8 |
也许你在找
Enca is an Extremely Naive Charset Analyser. It detects character set and encoding of text files and can also convert them to other encodings using either a built-in converter or external libraries and tools like libiconv, librecode, or cstocs.
Currently it supports Belarusian, Bulgarian, Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Russian, Slovak, Slovene, Ukrainian, Chinese, and some multibyte encodings independently on language.
注意,一般来说,自动检测当前编码是一个困难的过程(同一个字节序列在多个编码中可以是正确的文本)。
您可以使用标准的gnu-utils文件和awk获得所需的。例子:
所以
我在类似这样的脚本中使用它:
1 2 3 4 5 6 7 | CHARSET="$(file -bi"$i"|awk -F"=" '{print $2}')" if ["$CHARSET" != utf-8 ]; then iconv -f"$CHARSET" -t utf8"$i" -o outfile fi |
编译它们。转到dir,创建dir2utf8.sh:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | #!/bin/bash # converting all files in a dir to utf8 for f in * do if test -f $f then echo -e" Converting $f" CHARSET="$( file -bi"$f"|awk -F"=" '{print $2}')" if ["$CHARSET" != utf-8 ]; then iconv -f"$CHARSET" -t utf8"$f" -o"$f" fi else echo -e" Skipping $f - it's a regular file"; fi done |
以下是我的解决方案,用于替换所有文件:
1 2 3 4 5 6 7 8 9 10 11 | #!/bin/bash apt-get -y install recode uchardet > /dev/null find"$1" -type f | while read FFN # 'dir' should be changed... do encoding=$(uchardet"$FFN") echo"$FFN: $encoding" enc=`echo $encoding | sed 's#^x-mac-#mac#'` set +x recode $enc..UTF-8"$FFN" done |
https://gist.github.com/demofly/25f856a96c29b89baa32
放入
1 | bash convert-dir-to-utf8.sh /pat/to/my/trash/dir |
注意,sed是一个针对mac编码的解决方案。许多不常见的编码需要这样的解决方法。
这是我的答案…= D
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | #!/bin/bash find <YOUR_FOLDER_PATH> -name '*' -type f -exec grep -Iq . {} \; -print0 | while IFS= read -r -d $'\0' LINE_FILE; do CHARSET=$(uchardet $LINE_FILE) echo"Converting ($CHARSET) $LINE_FILE" # NOTE: Convert/reconvert to utf8. By Questor iconv -f"$CHARSET" -t utf8"$LINE_FILE" -o"$LINE_FILE" # NOTE: Remove"BOM" if exists as it is unnecessary. By Questor # [Refs.: https://stackoverflow.com/a/2223926/3223785 , # https://stackoverflow.com/a/45240995/3223785 ] sed -i '1s/^\xEF\xBB\xBF//'"$LINE_FILE" done # [Refs.: https://justrocketscience.com/post/handle-encodings , # https://stackoverflow.com/a/9612232/3223785 , # https://stackoverflow.com/a/13659891/3223785 ] |
进一步的问题:我不知道我的方法是否最安全。我这样说是因为我注意到一些文件没有正确转换(字符将丢失)或被"截断"。我怀疑这与"iconv"工具或"uchardet"工具获得的字符集信息有关。我很好奇在https://stackoverflow.com/a/22841847/3223785(@demofly)上提出的解决方案,因为它可能更安全。
另一个答案,现在基于@demofly的答案…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | #!/bin/bash find <YOUR_FOLDER_PATH> -name '*' -type f -exec grep -Iq . {} \; -print0 | while IFS= read -r -d $'\0' LINE_FILE; do CHARSET=$(uchardet $LINE_FILE) REENCSED=`echo $CHARSET | sed 's#^x-mac-#mac#'` echo""$CHARSET" "$LINE_FILE"" # NOTE: Convert/reconvert to utf8. By Questor recode $REENCSED..UTF-8"$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP STDERR_OP=$(cat STDERR_OP) rm -f STDERR_OP if [ -n"$STDERR_OP" ] ; then # NOTE: Convert/reconvert to utf8. By Questor iconv -f"$CHARSET" -t utf8"$LINE_FILE" -o"$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP STDERR_OP=$(cat STDERR_OP) rm -f STDERR_OP fi # NOTE: Remove"BOM" if exists as it is unnecessary. By Questor # [Refs.: https://stackoverflow.com/a/2223926/3223785 , # https://stackoverflow.com/a/45240995/3223785 ] sed -i '1s/^\xEF\xBB\xBF//'"$LINE_FILE" if [ -n"$STDERR_OP" ] ; then echo"ERROR: "$STDERR_OP"" fi STDOUT_OP=$(cat STDOUT_OP) rm -f STDOUT_OP if [ -n"$STDOUT_OP" ] ; then echo"RESULT: "$STDOUT_OP"" fi done # [Refs.: https://justrocketscience.com/post/handle-encodings , # https://stackoverflow.com/a/9612232/3223785 , # https://stackoverflow.com/a/13659891/3223785 ] |
带重编码和VIM的混合解决方案…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | #!/bin/bash find <YOUR_FOLDER_PATH> -name '*' -type f -exec grep -Iq . {} \; -print0 | while IFS= read -r -d $'\0' LINE_FILE; do CHARSET=$(uchardet $LINE_FILE) REENCSED=`echo $CHARSET | sed 's#^x-mac-#mac#'` echo""$CHARSET" "$LINE_FILE"" # NOTE: Convert/reconvert to utf8. By Questor recode $REENCSED..UTF-8"$LINE_FILE" 2> STDERR_OP 1> STDOUT_OP STDERR_OP=$(cat STDERR_OP) rm -f STDERR_OP if [ -n"$STDERR_OP" ] ; then # NOTE: Convert/reconvert to utf8. By Questor bash -c"</dev/tty vim -u NONE +"set binary | set noeol | set nobomb | set encoding=utf-8 | set fileencoding=utf-8 | wq" "$LINE_FILE"" else # NOTE: Remove"BOM" if exists as it is unnecessary. By Questor # [Refs.: https://stackoverflow.com/a/2223926/3223785 , # https://stackoverflow.com/a/45240995/3223785 ] sed -i '1s/^\xEF\xBB\xBF//'"$LINE_FILE" fi done |
注意:这是完美转换次数最多的解决方案。此外,我们没有任何截断的文件。
警告:备份文件并使用合并工具检查/比较更改。可能会出现问题!
提示:在转换后与合并工具进行初步比较后,可以执行命令
注意:使用"查找"进行搜索时,会从"您的文件夹路径"及其子文件夹中找到所有非二进制文件。
谢谢!
对于使用GB2312编码的简体中文文本文件,enca命令不起作用。
相反,我使用下面的函数为我转换文本文件。当然,您可以将输出重新定向到一个文件中。
它需要chardet和iconv命令。
1 2 3 4 5 6 | detection_cat () { DET_OUT=$(chardet $1); ENC=$(echo $DET_OUT | sed"s|^.*: \(.*\) (confid.*$|\1|"); iconv -f $ENC $1 } |
在linux cli中查看可用于数据转换的工具:https://www.debian.org/doc/manuals/debian-reference/ch11.en.html
此外,还有一个任务是找出在