In what JS engines, specifically, are toLowerCase & toUpperCase locale-sensitive?
在一些库的代码中(例如AngularJS,链接指向代码中的特定行),我可以看到使用自定义大小写转换函数而不是标准的函数。假设在土耳其语言环境的浏览器中,标准函数不能按预期工作,这是合理的:
1 2 | console.log("SCRIPT".toLowerCase()); //"scr?pt" console.log("script".toUpperCase()); //"SCR?PT" |
但这是真的还是曾经的事?浏览器真的是这样吗?如果是,他们中的哪一个做的?那么node.js呢?其他JS引擎?
具体来说,对于什么浏览器,Angular团队会保留代码:
如果您的浏览器(设备)使用土耳其语或阿塞拜疆地区,请运行此代码段,如果您发现问题确实存在,请给我写信。
1 2 3 4 5 6 7 8 | if ('i' !== 'I'.toLowerCase()) { document.write('Ooops! toLowerCase is locale-sensitive in your browser. ' + 'Please write your user-agent in the comments to this question: ' + navigator.userAgent); } else { document.write('toLowerCase isn\'t locale-sensitive in your browser. ' + 'Everything works as expected!'); } |
1 | <html lang="tr"> |
任何遵循ECMA-262 5.1标准的JS实现都必须实现
并且根据标准
其中,as
对于大多数语言,
您使用的库/框架(jquery、angular、node等)没有任何区别。在JS实现中,您使用它来运行JS库,从而生成和更改内容。
对于所有实际用途,可以准确地得出结论:节点/角度或任何其他JS库和框架在处理字符串时的行为完全相同(只要它们被实现ECMA-2623及更高版本的JS引擎使用)。尽管如此,我确信许多框架都扩展了字符串对象来添加更多的功能,但是ECMA-2625.1定义的基本属性和函数总是存在的,并且行为完全相同。
了解更多信息:http://www.ecma-international.org/ecma-262/5.1/sec-15.5.4.17
就浏览器而言,所有现代浏览器都在其JS引擎中实现ECMA-2625.1标准。我不确定节点,但是从我对节点的有限接触来看,我认为他们也使用了按照ECMA-262 5.1标准实现的JS。
注意:请注意,我不能测试它!
根据ECMAScript规范:
String.prototype.toLowerCase ( )
[...]
For the purposes of this operation, the 16-bit code units of the
Strings are treated as code points in the Unicode Basic Multilingual
Plane. Surrogate code points are directly transferred from S to L
without any mapping.The result must be derived according to the case mappings in the
Unicode character database (this explicitly includes not only the
UnicodeData.txt file, but also the SpecialCasings.txt file that
accompanies it in Unicode 2.1.8 and later).[...]
String.prototype.toLocaleLowerCase ( )
This function works exactly the same as toLowerCase except that its
result is intended to yield the correct result for the host
environment’s current locale, rather than a locale-independent result.
There will only be a difference in the few cases (such as Turkish)
where the rules for that language conflict with the regular Unicode
case mappings.[...]
根据Unicode字符数据库特殊大小写:
[...]
Format
The entries in this file are in the following machine-readable format:
; ; ; ( ;)? # 无条件映射
[…]
Preserve canonical equivalence for I with dot. Turkic is handled
below.
0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE […]
Language-Sensitive Mappings
These are characters whose full case mappings depend on language and perhaps also
context (which characters come before or after). For more information
see the header of this file and the Unicode Standard.立陶宛人
Lithuanian retains the dot in a lowercase i when followed by accents.
Remove DOT ABOVE after"i" with upper or titlecase
0307; 0307; ; ; lt After_Soft_Dotted; # COMBINING DOT ABOVE Introduce an explicit dot above when lowercasing capital I's and J's
whenever there are more accents above.
(of the accents used in Lithuanian: grave, acute, tilde above, and ogonek)
0049; 0069 0307; 0049; 0049; lt More_Above; # LATIN CAPITAL LETTER I
004A; 006A 0307; 004A; 004A; lt More_Above; # LATIN CAPITAL LETTER J
012E; 012F 0307; 012E; 012E; lt More_Above; # LATIN CAPITAL LETTER I WITH OGONEK
00CC; 0069 0307 0300; 00CC; 00CC; lt; # LATIN CAPITAL LETTER I WITH GRAVE
00CD; 0069 0307 0301; 00CD; 00CD; lt; # LATIN CAPITAL LETTER I WITH ACUTE 土耳其语和阿塞拜疆语
0128; 0069 0307 0303; 0128; 0128; lt; #LATIN CAPITAL LETTER I WITH TILDE I and i-dotless; I-dot and i are case pairs in Turkish and Azeri
The following rules handle those cases.
0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i.
This matches the behavior of the canonically equivalent I-dot_above
0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE When lowercasing, unless an I is before a dot_above, it turns into a dotless i.
0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I
0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I When uppercasing, i turns into a dotted capital I
0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I Note: the following case is already in the UnicodeData.txt file.
0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I
EOF < /块引用>
另外,根据绝对初学者的javascript(作者:terry mcnavage):
1
2
3
4 >"I".toLowerCase() //"i"
>"i".toUpperCase() //"I"
>"I".toLocaleLowerCase() //"<dotless-i>"
>"i".toLocaleUpperCase() //"<dotted-I>"Note:
toLocaleLowerCase() andtoLocaleUpperCase() convert case based on your OS settings. You'd have to change those settings to Turkish for the previous sample to work. Or just take my word for it!根据Bobine关于将javascript字符串转换为小写的评论?问题:
Accept-Language andnavigator.language are two completely separate
settings.Accept-Language reflects the user's chosen preferences for
what languages they want to receive in web pages (and this setting is
unfortuately inaccessible to JS).navigator.language merely reflects
which localisation of the web browser was installed, and should
generally not be used for anything. Both of these values are unrelated
to the system locale, which is the bit that decides what
toLocaleLowerCase() will do; that's an OS-level setting out of scope
of the browser's prefs.因此,将
lang="tr-TR" 设置为html 不会反映真实的测试用例,因为它是一个操作系统设置,需要复制特殊的外壳示例。我认为在使用
toLowerCase() 或toUpperCase() 时,只有小写的dotted-i或大写的dotless-i特定于区域。根据那些可信/官方的消息来源,我认为你是对的:
'i' !== 'I'.toLowerCase() 总是认为是错误的。但是,正如我说的,我不能在这里测试它。