com.google.javascript.jscomp.regex
Class CaseCanonicalize

java.lang.Object
  extended by com.google.javascript.jscomp.regex.CaseCanonicalize

public final class CaseCanonicalize
extends Object

Implements the EcmaScript 5 Canonicalize operation used to specify how case-insensitive regular expressions match.

From section 15.10.2.9,

The abstract operation Canonicalize takes a character parameter ch and performs the following steps:


Field Summary
static com.google.javascript.jscomp.regex.CharRanges CASE_SENSITIVE
          Set of code units that are case-insensitively equivalent to some other code unit according to the EcmaScript Canonicalize operation described in section 15.10.2.8.
 
Method Summary
static char caseCanonicalize(char ch)
          Returns the case canonical version of the given code-unit.
static String caseCanonicalize(String s)
          Returns the case canonical version of the given string.
static com.google.javascript.jscomp.regex.CharRanges expandToAllMatched(com.google.javascript.jscomp.regex.CharRanges ranges)
          Given a character range that may include case sensitive code-units, such as [0-9B-M], returns the character range that includes all the code-units in the input and those that are case-insensitively equivalent to a code-unit in the input.
static com.google.javascript.jscomp.regex.CharRanges reduceToMinimum(com.google.javascript.jscomp.regex.CharRanges ranges)
          Given a character range that may include case sensitive code-units, such as [0-9B-M], returns the character range that includes the minimal set of code units such that for every code unit in the input there is a case-sensitively equivalent canonical code unit in the output.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

CASE_SENSITIVE

public static final com.google.javascript.jscomp.regex.CharRanges CASE_SENSITIVE
Set of code units that are case-insensitively equivalent to some other code unit according to the EcmaScript Canonicalize operation described in section 15.10.2.8. The case sensitive characters are the ones that canonicalize to a character other than themselves or have a character that canonicalizes to them. Canonicalize is based on the definition of String.prototype.toUpperCase which is itself based on Unicode 3.0.0 as specified at UnicodeData-3.0.0 and SpecialCasings-2.txt .

This table was generated by running the below on Chrome:

 for (var cc = 0; cc < 0x10000; ++cc) {
   var ch = String.fromCharCode(cc);
   var u = ch.toUpperCase();
   if (ch != u && u.length === 1) {
     var cu = u.charCodeAt(0);
     if (cc <= 128 || u.charCodeAt(0) > 128) {
       print('0x' + cc.toString(16) + ', 0x' + cu.toString(16) + ',');
     }
   }
 }
 

Method Detail

caseCanonicalize

public static String caseCanonicalize(String s)
Returns the case canonical version of the given string.


caseCanonicalize

public static char caseCanonicalize(char ch)
Returns the case canonical version of the given code-unit. EcmaScript 5 explicitly says that code-units are to be treated as their code-point equivalent, even surrogates.


expandToAllMatched

public static com.google.javascript.jscomp.regex.CharRanges expandToAllMatched(com.google.javascript.jscomp.regex.CharRanges ranges)
Given a character range that may include case sensitive code-units, such as [0-9B-M], returns the character range that includes all the code-units in the input and those that are case-insensitively equivalent to a code-unit in the input.


reduceToMinimum

public static com.google.javascript.jscomp.regex.CharRanges reduceToMinimum(com.google.javascript.jscomp.regex.CharRanges ranges)
Given a character range that may include case sensitive code-units, such as [0-9B-M], returns the character range that includes the minimal set of code units such that for every code unit in the input there is a case-sensitively equivalent canonical code unit in the output.