Why does Java allow control characters in its identifiers?
在准确地探索Java标识符中允许哪些字符时,我偶然发现了一些非常奇怪的东西,看起来几乎肯定是一个bug。
我希望Java标识符符合他们从具有Unicode属性EDOCX1×0的字符开始的要求,其次是属性EDCX1(1)的字符,而对于领先的下划线和美元符号则有例外。但事实并非如此,我发现这与我所听说的正常标识符或其他任何概念存在极大的差异。
简短演示考虑下面的示例,证明在Java标识符中允许ASCII ESC字符(八进制033):
1 2 3 4 | $ perl -le 'print qq(public class escape { public static void main(String argv[]) { String var_\033 ="i am escape: \033"; System.out.println(var_\033); }})' > escape.java $ javac escape.java $ java escape | cat -v i am escape: ^[ |
但比这更糟。事实上,几乎是无限糟糕。甚至可以为空!以及数千个甚至不是标识符字符的其他代码点。我已经在Solaris、Linux和运行达尔文的Mac上测试过了这一点,并且都给出了相同的结果。
长演示这里是一个测试程序,它将显示所有这些意外的代码点,Java是相当合法地允许的,作为合法标识符名称的一部分。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | #!/usr/bin/env perl # # test-java-idchars - find which bogus code points Java allows in its identifiers # # usage: test-java-idchars [low high] # e.g.: test-java-idchars 0 255 # # Without arguments, tests Unicode code points # from 0 .. 0x1000. You may go further with a # higher explicit argument. # # Produces a report at the end. # # You can ^C it prematurely to end the program then # and get a report of its progress up to that point. # # Tom Christiansen # tchrist@perl.com # Sat Jan 29 10:41:09 MST 2011 use strict; use warnings; use encoding"Latin1"; use open IO =>":utf8"; use charnames (); $| = 1; my @legal; my ($start, $stop) = (0, 0x1000); if (@ARGV != 0) { if (@ARGV == 1) { for (($stop) = @ARGV) { $_ = oct if /^0/; # support 0OCTAL, 0xHEX, 0bBINARY } } elsif (@ARGV == 2) { for (($start, $stop) = @ARGV) { $_ = oct if /^0/; } } else { die"usage: $0 [ [start] stop ] "; } } for my $cp ( $start .. $stop ) { my $char = chr($cp); next if $char =~ /[\s\w]/; my $type ="?"; for ($char) { $type ="Letter" if /\pL/; $type ="Mark" if /\pM/; $type ="Number" if /\pN/; $type ="Punctuation" if /\pP/; $type ="Symbol" if /\pS/; $type ="Separator" if /\pZ/; $type ="Control" if /\pC/; } my $name = $cp ? (charnames::viacode($cp) ||"<missing>") :"NULL"; next if $name eq"<missing>" && $cp > 0xFF; my $msg = sprintf("U+%04X %s", $cp, $name); print"testing \\p{$type} $msg..."; open(TESTPROGRAM,">:utf8","testchar.java") || die $!; print TESTPROGRAM <<"End_of_Java_Program"; public class testchar { public static void main(String argv[]) { String var_$char ="variable name ends in $msg"; System.out.println(var_$char); } } End_of_Java_Program close(TESTPROGRAM) || die $!; system q{ ( javac -encoding UTF-8 testchar.java \ && \ java -Dfile.encoding=UTF-8 testchar | grep variable \ ) >/dev/null 2>&1 }; push @legal, sprintf("U+%04X", $cp) if $? == 0; if ($? && $? < 128) { print"<interrupted> "; exit; # from a ^C } printf"is %s in Java identifiers. ", ($? == 0) ? uc"legal" :"forbidden"; } END { print"Legal but evil code points: @legal "; } |
下面是在前33个既不是空白字符也不是标识符字符的代码点上运行该程序的示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | $ perl test-java-idchars 0 0x20 testing \p{Control} U+0000 NULL...is LEGAL in Java identifiers. testing \p{Control} U+0001 START OF HEADING...is LEGAL in Java identifiers. testing \p{Control} U+0002 START OF TEXT...is LEGAL in Java identifiers. testing \p{Control} U+0003 END OF TEXT...is LEGAL in Java identifiers. testing \p{Control} U+0004 END OF TRANSMISSION...is LEGAL in Java identifiers. testing \p{Control} U+0005 ENQUIRY...is LEGAL in Java identifiers. testing \p{Control} U+0006 ACKNOWLEDGE...is LEGAL in Java identifiers. testing \p{Control} U+0007 BELL...is LEGAL in Java identifiers. testing \p{Control} U+0008 BACKSPACE...is LEGAL in Java identifiers. testing \p{Control} U+000B LINE TABULATION...is forbidden in Java identifiers. testing \p{Control} U+000E SHIFT OUT...is LEGAL in Java identifiers. testing \p{Control} U+000F SHIFT IN...is LEGAL in Java identifiers. testing \p{Control} U+0010 DATA LINK ESCAPE...is LEGAL in Java identifiers. testing \p{Control} U+0011 DEVICE CONTROL ONE...is LEGAL in Java identifiers. testing \p{Control} U+0012 DEVICE CONTROL TWO...is LEGAL in Java identifiers. testing \p{Control} U+0013 DEVICE CONTROL THREE...is LEGAL in Java identifiers. testing \p{Control} U+0014 DEVICE CONTROL FOUR...is LEGAL in Java identifiers. testing \p{Control} U+0015 NEGATIVE ACKNOWLEDGE...is LEGAL in Java identifiers. testing \p{Control} U+0016 SYNCHRONOUS IDLE...is LEGAL in Java identifiers. testing \p{Control} U+0017 END OF TRANSMISSION BLOCK...is LEGAL in Java identifiers. testing \p{Control} U+0018 CANCEL...is LEGAL in Java identifiers. testing \p{Control} U+0019 END OF MEDIUM...is LEGAL in Java identifiers. testing \p{Control} U+001A SUBSTITUTE...is LEGAL in Java identifiers. testing \p{Control} U+001B ESCAPE...is LEGAL in Java identifiers. testing \p{Control} U+001C INFORMATION SEPARATOR FOUR...is forbidden in Java identifiers. testing \p{Control} U+001D INFORMATION SEPARATOR THREE...is forbidden in Java identifiers. testing \p{Control} U+001E INFORMATION SEPARATOR TWO...is forbidden in Java identifiers. testing \p{Control} U+001F INFORMATION SEPARATOR ONE...is forbidden in Java identifiers. Legal but evil code points: U+0000 U+0001 U+0002 U+0003 U+0004 U+0005 U+0006 U+0007 U+0008 U+000E U+000F U+0010 U+0011 U+0012 U+0013 U+0014 U+0015 U+0016 U+0017 U+0018 U+0019 U+001A U+001B |
下面是另一个演示:
1 2 3 4 5 6 7 | $ perl test-java-idchars 0x600 0x700 | grep -i legal testing \p{Control} U+0600 ARABIC NUMBER SIGN...is LEGAL in Java identifiers. testing \p{Control} U+0601 ARABIC SIGN SANAH...is LEGAL in Java identifiers. testing \p{Control} U+0602 ARABIC FOOTNOTE MARKER...is LEGAL in Java identifiers. testing \p{Control} U+0603 ARABIC SIGN SAFHA...is LEGAL in Java identifiers. testing \p{Control} U+06DD ARABIC END OF AYAH...is LEGAL in Java identifiers. Legal but evil code points: U+0600 U+0601 U+0602 U+0603 U+06DD |
问题
有人能解释一下这种看似疯狂的行为吗?这里有许多,许多,许多其他无法解释的被允许的代码点,从U+0000开始,这可能是最奇怪的。如果在第一个0x1000代码点上运行它,您确实会看到出现某些模式,例如允许使用
Java语言规范部分3.8遵从字符。ISJavaIdIsAcistSistar()和字符。除其他条件外,后者还具有character.isIdentifierIgnorable(),它允许非空白控制字符(包括整个c1范围,请参见列表的链接)。
另一个问题可能是:为什么Java不应该允许标识符中的控制字符?
在设计一种语言或其他系统时,一个好的原则是不要无正当理由地禁止任何东西,因为你永远不知道它是如何被使用的,而且规则实现者和用户必须面对的越少越好。
确实,您不应该利用这一点,通过在变量名中嵌入转义,您将不会看到任何流行的库公开其中包含空字符的类。
当然,这可能会被滥用,但是用这种方式保护程序员不受影响并不是语言设计者的工作,而不仅仅是强制使用正确的缩进或精心选择的变量名。
我不知道有什么大不了的。这对你有什么影响?
如果开发人员想要混淆他的代码,他可以使用ASCII来实现。
如果开发人员想让他的代码易于理解,他将使用行业的通用语言:英语。标识符不仅是ASCII码,而且来自普通的英语单词。否则,没有人会使用或阅读他的代码,他可以使用任何他喜欢的疯狂字符。