5 min read
const name = ‘Nick’ console.log(name.length) // 4
name has 4 characters ‘N’, ‘i’, ‘c’, ‘k’ and length is also 4.
Everything seems logical.
Let’s go further and add emoji to my name.
const name = ‘Nick 🐃’ console.log(name.length) // 7
name must have 6 characters ‘N’, ‘i’, ‘c’, ‘k’, ‘ ‘ (whitespace) and ‘🐃’
But have 7.
It seems like the bull has 2 characters.
const emoji = ‘🐃’ console.log(emoji.length) // 2
Let’s figure out why.
Scroll to “6.1.4 The String Type.”
And find this:
“The String type is the set of all ordered sequences of zero or more 16-bit unsigned integer values (“elements”) up to a maximum length of 2⁵³ - 1 elements. The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a UTF-16 code unit value.”
❓What is UTF-16?
💬 A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point to a unique byte sequence.
One UTF-16 code unit value is a number from 0x0000 to 0xFFFF.
❓What is 0x0000 and 0xFFFF?
💬 0x represent the hexadecimal numeral system, often shortened to "hex", is a numeral system made up of 16 symbols (base 16). The standard numeral system is called decimal (base 10) and uses ten symbols: 0,1,2,3,4,5,6,7,8,9. Hexadecimal uses the decimal numbers and six extra symbols.
0x004e = N
0x0069 = i
0x0063 = c
0x006b = k
In UTF-16, Unicode characters from the Basic Multilingual Plane (contains characters for almost all modern languages) are encoded with one code unit.
Other characters from the non-Basic Multilingual Plane (emojis, musical notations, cards, hieroglyphs, etc) require two code units.
So UTF-16 format represents 🐃 emoji with two code units (0Xd83d 0Xdc03).
‘🐃’.length gives 2.
const name = ‘Nick’ const nameInUnicode = ‘\u004e\u0069\u0063\u006b’ console.log(name === nameInUnicode) // true console.log(nameInUnicode.length) // 4 const fullName = ‘Nick 🐃’ const fullNameInUnicode = ‘\u004e\u0069\u0063\u006b\u0020\ud83d\udc03’ console.log(fullName === fullNameInUnicode) // true console.log(fullNameInUnicode.length) // 7
❓ What is \u?
💬 A Unicode character escape sequence represents the single Unicode code point formed by the hexadecimal number following the “\u” or “\U” characters.
In the end
If you like this article, share it with your friends and follow me on Twitter.
Also, every week I send out a "3–2–1" newsletter with 3 tech news, 2 articles, and 1 piece of advice for you.