public abstract class CharMatcher extends java.lang.Object implements Predicate<java.lang.Character>
char
value, just as Predicate
does
for any Object
. Also offers basic text processing methods based on this function.
Implementations are strongly encouraged to be side-effect-free and immutable.
Throughout the documentation of this class, the phrase "matching character" is used to mean
"any char
value c
for which this.matches(c)
returns true
".
Warning: This class deals only with char
values, that is, BMP characters. It does not understand
supplementary Unicode code
points in the range 0x10000
to 0x10FFFF
which includes the majority of
assigned characters, including important CJK characters and emoji.
Supplementary characters are encoded
into a String
using surrogate pairs, and a CharMatcher
treats these just as
two separate characters. countIn(java.lang.CharSequence)
counts each supplementary character as 2 char
s.
For up-to-date Unicode character properties (digit, letter, etc.) and support for supplementary code points, use ICU4J UCharacter and UnicodeSet (freeze() after building). For basic text processing based on UnicodeSet use the ICU4J UnicodeSetSpanner.
Example usages:
String trimmed =whitespace()
.trimFrom
(userInput); if (ascii()
.matchesAllOf
(s)) { ... }
See the Guava User Guide article on CharMatcher
.
Modifier and Type | Class and Description |
---|---|
private static class |
CharMatcher.And
Implementation of
and(CharMatcher) . |
private static class |
CharMatcher.Any
Implementation of
any() . |
private static class |
CharMatcher.AnyOf
Implementation of
anyOf(CharSequence) for three or more characters. |
private static class |
CharMatcher.Ascii
Implementation of
ascii() . |
private static class |
CharMatcher.BitSetMatcher
Fast matcher using a
BitSet table of matching characters. |
private static class |
CharMatcher.BreakingWhitespace
Implementation of
breakingWhitespace() . |
private static class |
CharMatcher.Digit
Implementation of
digit() . |
(package private) static class |
CharMatcher.FastMatcher
A matcher for which precomputation will not yield any significant benefit.
|
private static class |
CharMatcher.ForPredicate
Implementation of
forPredicate(Predicate) . |
private static class |
CharMatcher.InRange
Implementation of
inRange(char, char) . |
private static class |
CharMatcher.Invisible
Implementation of
invisible() . |
private static class |
CharMatcher.Is
Implementation of
is(char) . |
private static class |
CharMatcher.IsEither
Implementation of
anyOf(CharSequence) for exactly two characters. |
private static class |
CharMatcher.IsNot
Implementation of
isNot(char) . |
private static class |
CharMatcher.JavaDigit
Implementation of
javaDigit() . |
private static class |
CharMatcher.JavaIsoControl
Implementation of
javaIsoControl() . |
private static class |
CharMatcher.JavaLetter
Implementation of
javaLetter() . |
private static class |
CharMatcher.JavaLetterOrDigit
Implementation of
javaLetterOrDigit() . |
private static class |
CharMatcher.JavaLowerCase
Implementation of
javaLowerCase() . |
private static class |
CharMatcher.JavaUpperCase
Implementation of
javaUpperCase() . |
(package private) static class |
CharMatcher.NamedFastMatcher
CharMatcher.FastMatcher which overrides toString() with a custom name. |
private static class |
CharMatcher.Negated
Implementation of
CharMatcher.Negated.negate() . |
(package private) static class |
CharMatcher.NegatedFastMatcher
Negation of a
CharMatcher.FastMatcher . |
private static class |
CharMatcher.None
Implementation of
none() . |
private static class |
CharMatcher.Or
Implementation of
or(CharMatcher) . |
private static class |
CharMatcher.RangesMatcher
Implementation that matches characters that fall within multiple ranges.
|
private static class |
CharMatcher.SingleWidth
Implementation of
singleWidth() . |
(package private) static class |
CharMatcher.Whitespace
Implementation of
whitespace() . |
Modifier and Type | Field and Description |
---|---|
private static int |
DISTINCT_CHARS |
Modifier | Constructor and Description |
---|---|
protected |
CharMatcher()
Constructor for use by subclasses.
|
Modifier and Type | Method and Description |
---|---|
CharMatcher |
and(CharMatcher other)
Returns a matcher that matches any character matched by both this matcher and
other . |
static CharMatcher |
any()
Matches any character.
|
static CharMatcher |
anyOf(java.lang.CharSequence sequence)
Returns a
char matcher that matches any BMP character present in the given character
sequence. |
boolean |
apply(java.lang.Character character)
Deprecated.
Provided only to satisfy the
Predicate interface; use matches(char)
instead. |
static CharMatcher |
ascii()
Determines whether a character is ASCII, meaning that its code point is less than 128.
|
static CharMatcher |
breakingWhitespace()
Determines whether a character is a breaking whitespace (that is, a whitespace which can be
interpreted as a break between words for formatting purposes).
|
java.lang.String |
collapseFrom(java.lang.CharSequence sequence,
char replacement)
Returns a string copy of the input character sequence, with each group of consecutive matching
BMP characters replaced by a single replacement character.
|
int |
countIn(java.lang.CharSequence sequence)
Returns the number of matching
char s found in a character sequence. |
static CharMatcher |
digit()
Deprecated.
Many digits are supplementary characters; see the class documentation.
|
private java.lang.String |
finishCollapseFrom(java.lang.CharSequence sequence,
int start,
int end,
char replacement,
java.lang.StringBuilder builder,
boolean inMatchingGroup) |
static CharMatcher |
forPredicate(Predicate<? super java.lang.Character> predicate)
Returns a matcher with identical behavior to the given
Character -based predicate, but
which operates on primitive char instances instead. |
int |
indexIn(java.lang.CharSequence sequence)
Returns the index of the first matching BMP character in a character sequence, or
-1 if
no matching character is present. |
int |
indexIn(java.lang.CharSequence sequence,
int start)
Returns the index of the first matching BMP character in a character sequence, starting from a
given position, or
-1 if no character matches after that position. |
static CharMatcher |
inRange(char startInclusive,
char endInclusive)
Returns a
char matcher that matches any character in a given BMP range (both endpoints
are inclusive). |
static CharMatcher |
invisible()
Deprecated.
Most invisible characters are supplementary characters; see the class
documentation.
|
static CharMatcher |
is(char match)
Returns a
char matcher that matches only one specified BMP character. |
private static CharMatcher.IsEither |
isEither(char c1,
char c2) |
static CharMatcher |
isNot(char match)
Returns a
char matcher that matches any character except the BMP character specified. |
private static boolean |
isSmall(int totalCharacters,
int tableLength) |
static CharMatcher |
javaDigit()
Deprecated.
Many digits are supplementary characters; see the class documentation.
|
static CharMatcher |
javaIsoControl()
Determines whether a character is an ISO control character as specified by
Character.isISOControl(char) . |
static CharMatcher |
javaLetter()
Deprecated.
Most letters are supplementary characters; see the class documentation.
|
static CharMatcher |
javaLetterOrDigit()
Deprecated.
Most letters and digits are supplementary characters; see the class documentation.
|
static CharMatcher |
javaLowerCase()
Deprecated.
Some lowercase characters are supplementary characters; see the class
documentation.
|
static CharMatcher |
javaUpperCase()
Deprecated.
Some uppercase characters are supplementary characters; see the class
documentation.
|
int |
lastIndexIn(java.lang.CharSequence sequence)
Returns the index of the last matching BMP character in a character sequence, or
-1 if
no matching character is present. |
abstract boolean |
matches(char c)
Determines a true or false value for the given character.
|
boolean |
matchesAllOf(java.lang.CharSequence sequence)
Returns
true if a character sequence contains only matching BMP characters. |
boolean |
matchesAnyOf(java.lang.CharSequence sequence)
Returns
true if a character sequence contains at least one matching BMP character. |
boolean |
matchesNoneOf(java.lang.CharSequence sequence)
Returns
true if a character sequence contains no matching BMP characters. |
CharMatcher |
negate()
Returns a matcher that matches any character not matched by this matcher.
|
static CharMatcher |
none()
Matches no characters.
|
static CharMatcher |
noneOf(java.lang.CharSequence sequence)
Returns a
char matcher that matches any BMP character not present in the given
character sequence. |
CharMatcher |
or(CharMatcher other)
Returns a matcher that matches any character matched by either this matcher or
other . |
CharMatcher |
precomputed()
Returns a
char matcher functionally equivalent to this one, but which may be faster to
query than the original; your mileage may vary. |
(package private) CharMatcher |
precomputedInternal()
This is the actual implementation of
precomputed() , but we bounce calls through a method
on Platform so that we can have different behavior in GWT. |
private static CharMatcher |
precomputedPositive(int totalCharacters,
java.util.BitSet table,
java.lang.String description)
Helper method for
precomputedInternal() that doesn't test if the negation is cheaper. |
java.lang.String |
removeFrom(java.lang.CharSequence sequence)
Returns a string containing all non-matching characters of a character sequence, in order.
|
java.lang.String |
replaceFrom(java.lang.CharSequence sequence,
char replacement)
Returns a string copy of the input character sequence, with each matching BMP character
replaced by a given replacement character.
|
java.lang.String |
replaceFrom(java.lang.CharSequence sequence,
java.lang.CharSequence replacement)
Returns a string copy of the input character sequence, with each matching BMP character
replaced by a given replacement sequence.
|
java.lang.String |
retainFrom(java.lang.CharSequence sequence)
Returns a string containing all matching BMP characters of a character sequence, in order.
|
(package private) void |
setBits(java.util.BitSet table)
Sets bits in
table matched by this matcher. |
private static java.lang.String |
showCharacter(char c)
Returns the Java Unicode escape sequence for the given
char , in the form "ካ" where
"12AB" is the four hexadecimal digits representing the 16-bit code unit. |
static CharMatcher |
singleWidth()
Deprecated.
Many such characters are supplementary characters; see the class documentation.
|
java.lang.String |
toString()
Returns a string representation of this
CharMatcher , such as CharMatcher.or(WHITESPACE, JAVA_DIGIT) . |
java.lang.String |
trimAndCollapseFrom(java.lang.CharSequence sequence,
char replacement)
Collapses groups of matching characters exactly as
collapseFrom(java.lang.CharSequence, char) does, except that
groups of matching BMP characters at the start or end of the sequence are removed without
replacement. |
java.lang.String |
trimFrom(java.lang.CharSequence sequence)
Returns a substring of the input character sequence that omits all matching BMP characters from
the beginning and from the end of the string.
|
java.lang.String |
trimLeadingFrom(java.lang.CharSequence sequence)
Returns a substring of the input character sequence that omits all matching BMP characters from
the beginning of the string.
|
java.lang.String |
trimTrailingFrom(java.lang.CharSequence sequence)
Returns a substring of the input character sequence that omits all matching BMP characters from
the end of the string.
|
static CharMatcher |
whitespace()
Determines whether a character is whitespace according to the latest Unicode standard, as
illustrated here.
|
private static final int DISTINCT_CHARS
protected CharMatcher()
toString()
to provide a useful description.public static CharMatcher any()
ANY
)public static CharMatcher none()
NONE
)public static CharMatcher whitespace()
All Unicode White_Space characters are on the BMP and thus supported by this API.
Note: as the Unicode definition evolves, we will modify this matcher to keep it up to date.
WHITESPACE
)public static CharMatcher breakingWhitespace()
whitespace()
for a
discussion of that term.BREAKING_WHITESPACE
)public static CharMatcher ascii()
ASCII
)@Deprecated public static CharMatcher digit()
inRange('0', '9')
.DIGIT
)@Deprecated public static CharMatcher javaDigit()
inRange('0',
'9')
.JAVA_DIGIT
)@Deprecated public static CharMatcher javaLetter()
inRange('a', 'z').or(inRange('A', 'Z'))
.JAVA_LETTER
)@Deprecated public static CharMatcher javaLetterOrDigit()
JAVA_LETTER_OR_DIGIT
).@Deprecated public static CharMatcher javaUpperCase()
JAVA_UPPER_CASE
)@Deprecated public static CharMatcher javaLowerCase()
JAVA_LOWER_CASE
)public static CharMatcher javaIsoControl()
Character.isISOControl(char)
.
All ISO control codes are on the BMP and thus supported by this API.
JAVA_ISO_CONTROL
)@Deprecated public static CharMatcher invisible()
See also the Unicode Default_Ignorable_Code_Point property (available via ICU).
INVISIBLE
)@Deprecated public static CharMatcher singleWidth()
false
(that is, it tends to assume a character is
double-width).
Note: as the reference file evolves, we will modify this matcher to keep it up to date.
See also UAX #11 East Asian Width.
SINGLE_WIDTH
)public static CharMatcher is(char match)
char
matcher that matches only one specified BMP character.public static CharMatcher isNot(char match)
char
matcher that matches any character except the BMP character specified.
To negate another CharMatcher
, use negate()
.
public static CharMatcher anyOf(java.lang.CharSequence sequence)
char
matcher that matches any BMP character present in the given character
sequence. Returns a bogus matcher if the sequence contains supplementary characters.public static CharMatcher noneOf(java.lang.CharSequence sequence)
char
matcher that matches any BMP character not present in the given
character sequence. Returns a bogus matcher if the sequence contains supplementary characters.public static CharMatcher inRange(char startInclusive, char endInclusive)
char
matcher that matches any character in a given BMP range (both endpoints
are inclusive). For example, to match any lowercase letter of the English alphabet, use CharMatcher.inRange('a', 'z')
.java.lang.IllegalArgumentException
- if endInclusive < startInclusive
public static CharMatcher forPredicate(Predicate<? super java.lang.Character> predicate)
Character
-based predicate, but
which operates on primitive char
instances instead.public abstract boolean matches(char c)
public CharMatcher negate()
negate
in interface java.util.function.Predicate<java.lang.Character>
public CharMatcher and(CharMatcher other)
other
.public CharMatcher or(CharMatcher other)
other
.public CharMatcher precomputed()
char
matcher functionally equivalent to this one, but which may be faster to
query than the original; your mileage may vary. Precomputation takes time and is likely to be
worthwhile only if the precomputed matcher is queried many thousands of times.
This method has no effect (returns this
) when called in GWT: it's unclear whether a
precomputed matcher is faster, but it certainly consumes more memory, which doesn't seem like a
worthwhile tradeoff in a browser.
CharMatcher precomputedInternal()
precomputed()
, but we bounce calls through a method
on Platform
so that we can have different behavior in GWT.
This implementation tries to be smart in a number of ways. It recognizes cases where the negation is cheaper to precompute than the matcher itself; it tries to build small hash tables for matchers that only match a few characters, and so on. In the worst-case scenario, it constructs an eight-kilobyte bit array and queries that. In many situations this produces a matcher which is faster to query than the original.
private static CharMatcher precomputedPositive(int totalCharacters, java.util.BitSet table, java.lang.String description)
precomputedInternal()
that doesn't test if the negation is cheaper.private static boolean isSmall(int totalCharacters, int tableLength)
void setBits(java.util.BitSet table)
table
matched by this matcher.public boolean matchesAnyOf(java.lang.CharSequence sequence)
true
if a character sequence contains at least one matching BMP character.
Equivalent to !matchesNoneOf(sequence)
.
The default implementation iterates over the sequence, invoking matches(char)
for each
character, until this returns true
or the end is reached.
sequence
- the character sequence to examine, possibly emptytrue
if this matcher matches at least one character in the sequencepublic boolean matchesAllOf(java.lang.CharSequence sequence)
true
if a character sequence contains only matching BMP characters.
The default implementation iterates over the sequence, invoking matches(char)
for each
character, until this returns false
or the end is reached.
sequence
- the character sequence to examine, possibly emptytrue
if this matcher matches every character in the sequence, including when
the sequence is emptypublic boolean matchesNoneOf(java.lang.CharSequence sequence)
true
if a character sequence contains no matching BMP characters. Equivalent to
!matchesAnyOf(sequence)
.
The default implementation iterates over the sequence, invoking matches(char)
for each
character, until this returns true
or the end is reached.
sequence
- the character sequence to examine, possibly emptytrue
if this matcher matches no characters in the sequence, including when the
sequence is emptypublic int indexIn(java.lang.CharSequence sequence)
-1
if
no matching character is present.
The default implementation iterates over the sequence in forward order calling matches(char)
for each character.
sequence
- the character sequence to examine from the beginning-1
if no character matchespublic int indexIn(java.lang.CharSequence sequence, int start)
-1
if no character matches after that position.
The default implementation iterates over the sequence in forward order, beginning at start
, calling matches(char)
for each character.
sequence
- the character sequence to examinestart
- the first index to examine; must be nonnegative and no greater than sequence.length()
start
,
or -1
if no character matchesjava.lang.IndexOutOfBoundsException
- if start is negative or greater than sequence.length()
public int lastIndexIn(java.lang.CharSequence sequence)
-1
if
no matching character is present.
The default implementation iterates over the sequence in reverse order calling matches(char)
for each character.
sequence
- the character sequence to examine from the end-1
if no character matchespublic int countIn(java.lang.CharSequence sequence)
char
s found in a character sequence.
Counts 2 per supplementary character, such as for whitespace()
().negate()
().
public java.lang.String removeFrom(java.lang.CharSequence sequence)
CharMatcher.is('a').removeFrom("bazaar")
... returns "bzr"
.public java.lang.String retainFrom(java.lang.CharSequence sequence)
CharMatcher.is('a').retainFrom("bazaar")
... returns "aaa"
.public java.lang.String replaceFrom(java.lang.CharSequence sequence, char replacement)
CharMatcher.is('a').replaceFrom("radar", 'o')
... returns "rodor"
.
The default implementation uses indexIn(CharSequence)
to find the first matching
character, then iterates the remainder of the sequence calling matches(char)
for each
character.
sequence
- the character sequence to replace matching characters inreplacement
- the character to append to the result string in place of each matching
character in sequence
public java.lang.String replaceFrom(java.lang.CharSequence sequence, java.lang.CharSequence replacement)
CharMatcher.is('a').replaceFrom("yaha", "oo")
... returns "yoohoo"
.
Note: If the replacement is a fixed string with only one character, you are better
off calling replaceFrom(CharSequence, char)
directly.
sequence
- the character sequence to replace matching characters inreplacement
- the characters to append to the result string in place of each matching
character in sequence
public java.lang.String trimFrom(java.lang.CharSequence sequence)
CharMatcher.anyOf("ab").trimFrom("abacatbab")
... returns "cat"
.
Note that:
CharMatcher.inRange('\0', ' ').trimFrom(str)
... is equivalent to String.trim()
.public java.lang.String trimLeadingFrom(java.lang.CharSequence sequence)
CharMatcher.anyOf("ab").trimLeadingFrom("abacatbab")
... returns "catbab"
.public java.lang.String trimTrailingFrom(java.lang.CharSequence sequence)
CharMatcher.anyOf("ab").trimTrailingFrom("abacatbab")
... returns "abacat"
.public java.lang.String collapseFrom(java.lang.CharSequence sequence, char replacement)
CharMatcher.anyOf("eko").collapseFrom("bookkeeper", '-')
... returns "b-p-r"
.
The default implementation uses indexIn(CharSequence)
to find the first matching
character, then iterates the remainder of the sequence calling matches(char)
for each
character.
sequence
- the character sequence to replace matching groups of characters inreplacement
- the character to append to the result string in place of each group of
matching characters in sequence
public java.lang.String trimAndCollapseFrom(java.lang.CharSequence sequence, char replacement)
collapseFrom(java.lang.CharSequence, char)
does, except that
groups of matching BMP characters at the start or end of the sequence are removed without
replacement.private java.lang.String finishCollapseFrom(java.lang.CharSequence sequence, int start, int end, char replacement, java.lang.StringBuilder builder, boolean inMatchingGroup)
@Deprecated public boolean apply(java.lang.Character character)
Predicate
interface; use matches(char)
instead.Predicate
input
(Java 8 users, see notes in the
class documentation above). This method is generally expected, but not absolutely
required, to have the following properties:
Objects.equal
(a, b)
implies that predicate.apply(a) ==
predicate.apply(b))
.
public java.lang.String toString()
CharMatcher
, such as CharMatcher.or(WHITESPACE, JAVA_DIGIT)
.toString
in class java.lang.Object
private static java.lang.String showCharacter(char c)
char
, in the form "ካ" where
"12AB" is the four hexadecimal digits representing the 16-bit code unit.private static CharMatcher.IsEither isEither(char c1, char c2)