Emoji4j is a high-performance, standards-compliant emoji processor supporting Unicode 16.0 for Java 8 or later.
- Comply with the Unicode 16.0 standard for emoji and pictographs
- Provide library support for the most common emoji processing tasks
- Go fast
- Keeps JAR size and dependency footprint small
- Support all emoji processing tasks
- Provide complex emoji building support, e.g. adding and removing modifiers
- Provide structured emoji metadata, e.g. person representations, skin color, etc.
According to the Unicode standard, an emoji is "A colorful pictograph that can be used inline in text."
Using typeset text imagery to convey emotion -- so-called "emoticons," like ":-)" -- date back at least to 1982, when a Carnegie Mellon computer scientist published the following to message to an internal BBS:
19-Sep-82 11:44 Scott E Fahlman :-)
From: Scott E Fahlman <Fahlman at Cmu-20c>
I propose that the following character sequence for joke markers:
:-)
Read it sideways. Actually, it is probably more economical to mark
things that are NOT jokes, given current trends. For this, use
:-(
Emoji are images appearing in text that convey this same information, but more visually.
Proper emoji are a convergence of a couple related technologies:
- Emoticon "ASCII art" mentioned above
- Proto-emoji image characters on Japanese mobile phones in the 1990s
- Pictograph image characters in fonts like Wingdings
The "modern" emoji era started in 2010, when "emoji" were properly standardized into Unicode version 6.0. For more information, see the timeline.
Due to their convergent nature, there are a couple of different "kinds" of emoji. Each type represents a concept in one grapheme, or visually distinct character.
First, there are proper emoji. A good example of an emoji is "263A FE0F
, smiling face. Note that it is colorful, and does not visually match the text around it. When reading the standard, these are emoji characters with the Emoji_Presentation
property, or an emoji sequence qualified with the emoji variation.
Second, there are pictographs, which technically are not emoji, but are still images captured in Unicode text, so the difference is largely pedantic. The corresponding pictograph to the smiling face emoji above is "☺", code points 263A
, smiling face. Note that it is monochrome, and matches the text around it. This is characteristic of pictographs. When reading the standard, these are emoji characters without the Emoji_Presentation
property but with the Extended_Pictographic
property, or an emoji sequence qualified with the text variation.
Not all emoji have a corresponding pictograph. Not all pictograph have a corresponding emoji. They are separate and distinct, but related.
The latest Unicode version defines 3,790 official emoji, plus several hundred additional pictographs. The emoji4j library supports all of these graphemes.
The workhorse of emoji4j is the GraphemeMatcher
class, which is modeled after the Matcher
class. To manually scan a string text
for all emoji, use the find()
method:
GraphemeMatcher m=new GraphemeMatcher(text);
while(m.find()) {
System.out.println("Found grapheme "+m.grapheme().getName());
}
To replace all emoji with their names, one could use this snippet:
text = new GraphemeMatcher(text).replaceAll(mr -> mr.grapheme().getName());
If users care about support only emoji (as opposed to all image graphemes), then they could check the type attribute of the returned Grapheme
:
GraphemeMatcher m=new GraphemeMatcher(text);
while(m.find()) {
if(m.grapheme().getType() == Grapheme.Type.EMOJI) {
System.out.println("Found emoji "+m.grapheme().getName());
}
}
Solutions to common problems are being collected in the cookbook. If you have a solution or request for the cookbook, then please open an issue!
Performance matters! High performance is an explicitly-stated goal of the emoji4j project.
Performance measurement is a complex topic. These (simple) benchmarks are put together in good faith solely to compare the performance of different libraries and text processing methods.
The benchmarks were built using JMH. Measurements were taken on an EC2 m6i.large
instance using openjdk-17-jdk-headless
.
The emoji4j-benchmarks project defines three benchmarks:
EmojiJavaBenchmark.tweets
-- Use emoji-java to extract all emoji from an emoji-rich text snippet.GraphemeMatcherBenchmark.tweets
-- Use emoji4j to extract all emoji and pictographs from an emoji-rich text snippet.NopBenchmark.tweets
-- Use a simple for loop to iterate over the code points in an emoji-rich text snippet.
The "text snippet" is 1MB of tweet text captured from the Twitter Streaming API. As a result, the benchmark results can loosely be interpreted as MB/s.
The benchmark results are as follows:
Benchmark Mode Cnt Score Error Units
EmojiJavaBenchmark.tweets thrpt 15 80.258 ± 11.278 ops/s
GraphemeMatcherBenchmark.tweets thrpt 15 215.996 ± 21.979 ops/s
NopBenchmark.tweets thrpt 15 1358.919 ± 268.917 ops/s
According to the benchmarks, emoji4j runs about 2.7x as fast as emoji-java. However, there is still a lot of performance to gain back versus a simple code point scan.