Spread the love

The emergence of languages ​​has made it possible to transmit and preserve knowledge over the generations. The development of writing has made the transfer of knowledge even easier.

Most writing systems used today are hundreds if not thousands of years old. In contrast, digital text is relatively new. The digital representation of texts was mainly oriented towards the English language in the early years. But today, a considerable part of human interactions takes place on the Internet on a global scale.

People exchange information across languages ​​and national borders. This change required a new approach and the creation of a uniform structure for exchanging texts using different writing systems and various signs. At the same time, technological advancements have opened up new possibilities for displaying characters.

An example: think about popular emojis on your phone. Icons can be used with a special keyboard like letters, almost as if they are a natural part of the alphabet. How is it possible? The Unicode standard provides the explanation.

Definition and explanations

Unicode is the short version of ” Universal Character Encoding encoding ” in English, that is to say, “Universal character “. This is a standardized standard for encoding characters in binary representation. Unicode makes it possible to store and process texts in digital systems.

What makes Unicode unique is that this standard is not linked to the formats and encodings of the alphabet of a particular language. Instead, Unicode has been created to serve as a uniform standard for representing all writing systems and all characters that exist worldwide.

Since the release of Unicode 1.0 in 1991, the standard has lived up to its goal. Unicode is used internally by browsers and operating systems as a single format. With version 13.0 released by the Unicode Consortium in 2020, the Unicode standard now includes a directory of 143,859 characters in total.

The Unicode Consortium is a California-based nonprofit organization committed to advancing the standard. Members of the consortium are leading technology companies such as Adobe, Apple, Facebook, Google, IBM, Microsoft, Netflix, and SAP. The character set covered by the Unicode standard coincides with the UCS ( Universal Coded Character Set ), which is standardized internationally under the designation ISO / IEC 10646.

Technical basis for character encoding

Writing and texts are ubiquitous in the life of a modern person. Reading and writing are among the first skills learned in school. So it’s no surprise that many people simply take the existence of digital writing for granted. But how exactly does the technical representation of writing work? Take a journey with us into the world of digital font encoding.

First of all, it is important to understand that all the information present in a digital system is in fact made up of endless strings of 0’s and 1’s. This is also referred to as a “binary representation”. The binary code is itself a kind of alphabet. However, there are only two “letters” in the binary code: 0’s and 1’s. Each digit in a sequence of 0’s and 1’s is called a “bit”.

The basic trick of computer technology is to account for the characters of different alphabets as sequences of 0s and 1s. Numbers and letters can be coded in this way, but also all other recognizable characters. In general, we speak of “symbols”. The longer the sequence of 0s and 1s is for displaying a single symbol, the more symbols can be displayed. The number of possible symbols doubles for each bit added.

A concrete example: imagine that we have binary “words” that are two bits long. They allow us to code 4 digits :

2-bit wordFigure
000
011
102
113

If we add another bit at the start of the sequence, the number of possible binary words doubles. The new words are made up of already known bit sequences, each preceded by a 0 or a 1. Eight digits can therefore be coded:

3-bit wordFigure
0000
0011
0102
0113
1004
1015
1106
1117

Note

An 8-bit word is called a byte.

For the sake of simplicity, we’ve shown you the coding of the digits here as an example. However, the same principle is also used in computer systems for encoding letters or any other character. Here is a very simplified example of binary letter encoding:

3-bit wordLetter
000A
001B
010C

Note that our explanations so far have nothing to do with fonts. We are only discussing the internal model on the basis of which the characters are represented numerically. The graphical representation of a characteris called a glyph. Depending on the font used, there are different glyphs for the same character, and even within the same font, there can be more than one variation for a glyph. Think for example of different accents, cases, italics, etc. Here is an extended representation, which includes assigning a character to a glyph:

Binary representationDecimal numberCoded characterGlyph
100000165Capital letter A of the Latin alphabetA
110000197a lowercase of the Latin alphabeta
011000048Arabic numeral 0 
011100157Arabic numeral 99
11000100196UppercaseÄ
11000001193UppercaseÁ

Character encoding terminology

Numerical character encoding involves a number of specific concepts. In English, the different terms are sometimes used as synonyms. In order to be able to give a precise Unicode definition, we also give the English terms here:

ConceptDefinitionEnglish term
Character setSet of possible characters, eg. the numbers 0 to 9, the letters a to z, etc.Character set
Code pointThe numeric value assigned to each specific character in the character encodingCode point
Coded character setCharacter set in which each character has exactly one code pointCoded character set
Character encodingProcess of converting a sign into a technical structure e.g. binary representationCharacter encoding

Overview of the encoding of some common characters

Before the creation of Unicode, there was a wide variety of specific encodings. It was the norm then to have a separate coding for each language or language family. This often led to display errors and data inconsistencies. To avoid this, character encodings have often been modeled as a new set encompassing, and compatible with, an existing standard. The modern Unicode standard is based on the older ISO Latin-1 character encoding, which in turn is based on the ASCII code.

Character encodingBits per characterPossible charactersCharacter set
ASCII7 Bits128Letters, numbers, and special characters from the US keyboard, as well as control characters for tickers
ISO Latin-1 (ISO 8859-1)8 Bits256The first 128 characters are those of ASCII, the other 128 characters correspond to special characters of European languages
Universal Coded Character Set 2 (UCS-2)16 Bits65 536Characters of the “Basic Multilingual Plane” (BMP); the first 256 characters are those of ISO Latin-1
Universal Coded Character Set 4 (UCS-4)32 Bits1 114 111Characters from the BMP and others added to it; 143,859 characters in total in Unicode 13.0; the first 256 characters are those of ISO Latin-1
UCS Transformation Format 8 Bit (UTF-8)8/16/24/32 Bits1 114 111All characters of UCS-2 and UCS-4; the first 256 characters are those of ISO Latin-1

Structure of the Unicode standard

The Unicode standard defines the characters and code points corresponding to letters, syllables, ideograms, punctuation marks, special characters, and numbers. In addition to the Latin alphabet, the Greek, Cyrillic, Arabic, Hebrew, and Thai alphabets are integrated. Japanese (Katakana, Hiragana), Chinese and Korean (Hangeul) signs are also taken into account. There are also special mathematical, business, and technical characters, as well as historical control characters for tickers.

The characters are summarized in a series of character tables. Here we give an overview of the most common character tables.

Writing systems of the Unicode standard

Character tablesContains among others the following alphabets
European writing systemsArmenian, Georgian, Greek, Latin
African writing systemsCoptic, Ethiopian, Egyptian Hieroglyphics
Middle Eastern writing systemsArabic, Hebrew, Syrian
Central Asian writing systemsMongolian, Tibetan, Old Turk
South Asian writing systemsBrahmi, Tamil, Vedic
Southeast Asian Writing SystemsKhmer, Rohingya, Thai
Writing systems of Indonesia and OceaniaBalinese, Bugi, Javanese
Far Eastern writing systemsCJC (Chinese, Japanese, Korean), Hangeul (Korean), Hiragana (Japanese)
Writing Systems of AmericaCherokee, Canadian Native Syllabary, Osage

Symbols and Punctuation marks of the Unicode standard

Character tablesContains among others the following characters
Punctuation marksEnglish punctuation, European language punctuation, CJC punctuation
Alphanumeric charactersLetter type symbols delimited alphanumeric symbols
Technical symbolsAPL Symbols, Optical Character Recognition (OCR)
Digits and numbersMayan numerals, ottoman siyaq numbers, punctuation, and cuneiform numbers
Mathematical symbolsArrows, mathematical operators, geometric shapes
Pictograms and emojisEmoticons, dingbats, other pictograms
Other symbolsAlchemical symbols, currency symbols, game symbols
Rating systemsBraille combinations, musical symbols, Duployé shorthand

What is the Unicode standard for?

The Unicode Standard is primarily a universal base for treatment, storage, and exchange of text in any language. Most modern software components, such as libraries, protocols, databases, etc. which work with text, are based on Unicode. We illustrate the range of possible uses with the help of the following examples.

Operating systems

Unicode is the internal standard for the representation of text in most modern operating systems. Some operating systems, such as Apple’s macOS, allow the use of Unicode characters in file names.

Websites

The Unicode variant UTF-8 has established itself as the standard for encoding HTML documents. As of 2016, over 80% of the world’s most visited websites were using UTF-8 to store and display their HTML documents. The Punycode standard has been established for the use of non-ASCII letters in domain names.

Programming languages

Many modern programming languages ​​use Unicode as the basis for processing text. Since a more recent development, it is possible to use Unicode characters to name variables and functions. This is possible in ECMAScript / JavaScript as can be seen

in the following code:

let ✔︎ = true;
let ✘ = false;
if (bool_var === ✔︎) {
 // …
}

Databases

The popular and widely used MySQL database supports the full Unicode character set with the “utf8mb4” character encoding. However, when using the “utf8” character encoding, characters with a codepoint greater than 3 bytes are lost.

Polices

Fonts contain the glyphs used to graphically represent text. Due to the large number of characters contained in the Unicode standard, no font contains all characters. Even the subset of BMP is only completely covered by a few policies. Here are some examples :

Police UnicodeGlyphsLicense
Knownenv. 65 000Open Font License
Sun-ExtA/Benv. 50 000Freeware
Unifontenv. 63 000GNU GPL
Code2000env. 63 000Shareware

How to use the Unicode standard?

Often, people are using Unicode without knowing it. In most documents and applications, digital text is available in Unicode format and can be copied, inserted, and edited as needed by the user. Sometimes the user needs to insert a specific Unicode character in the text. There are different ways to do this, which we will introduce below.

Special character keyboards

Using special keyboards is probably the most common way to insert Unicode characters into the text. Ubiquitous on mobile devices, special keyboards allow you to switch between different languages ​​and alphabets. All the characters coming from the Unicode directory, by clicking on the same key, you can type different characters. These can be mixed at will and combined with each other in texts.

Emojis are a good example. In Unicode standard, emojis are characters just like letters, numbers, and special characters. As with numeric characters, the display of emojis is independent of their internal modeling. Each operating system presents the same emoji slightly differently.

These useful special keyboards aren’t just found on mobile devices. They are also available on the desktop of computers. They can be easily opened in Windows, macOS, and many Linux distributions to display a different number of characters depending on the language selected. As the number of keys is limited, not all Unicode characters are displayed. Rather, it is a selection of the most common characters specific to a language.

Unicode character tables

In addition to special character keyboards, Unicode character arrays are arguably the most useful way to access various Unicode characters. As a reminder, a Coded Character Set is the set of all characters with their corresponding unique code points. For such a structure, the layout as a table is ideal, and the Unicode standard includes precisely such tables called Code Charts. On one hand, specific characters can be copied from these tables for use elsewhere, on the other hand, the user can read the corresponding code point, for example, use it as a numeric character reference.

Many PC operating systems also contain a Unicode character table. It provides an overview of all available Unicode characters, including code point, description, and glyph. A character can be inserted or copied with one click. You can also create a character table yourself with just a few lines of code. We will show you an example using the Python programming language in this article.

Numerical references

The Unicode standard emphasizes the assignment of characters to code points. If you know the code point of a character, you can use it to embed the corresponding character in different contexts. In Windows, inserting Unicode symbols is done using the normal keyboard or using a combination of different keys. Note that the code point number should normally be entered in hexadecimal notation.

More often than not, programmers need reference numbers. The hexadecimal representation of the code point allows representation of a Unicode character by character ASCII character set. We’ll show the process here in HTML. In principle, the operation is the same in Python, C ++, etc.

The general scheme to include a digital character reference includes reference itself, as well as a term opening and closing: in HTML documents, numeral opens with “& # x” and ends with “; “. In between, the two to four-digit hexadecimal code point is entered without space. The result is the pattern “& # xNNNN; “.

For example, to insert the copyright symbol “©” in an HTML document, we proceed as follows:

  1. Find the character in a Unicode array .
  2. Read the code point corresponding to the character.

In our example, the code point is “U + 00A9”, which is the hexadecimal representation.

  1. Compose the character reference and enter it in the HTML source text or Markdown file.

In our case, we enter “©”; which gives us the result “©”.

Another approach, less common, allows the use of code points in decimal representation instead of a hexadecimal representation. In this case, the numeric reference begins with “& #” (without the “x”) and ends with “; As with the first method. In between, the code point is written in decimal notation. In our example, the numerical reference “©” corresponds to the copyright symbol.

Named references

Since writing Unicode characters with numeric references is not really intuitive, there is another way: using named references. These are defined for frequently used characters and give the character a short name that can be remembered. A named reference begins with the ampersand “&” and ends with a semicolon “; “. The defined name is placed between the two without spaces. To insert the copyright symbol “©” in HTML, simply write “©”.

Programming languages

Most programming languages ​​contain basic functions that can be used to convert characters and code points. The corresponding functions are often referred to as “ord (character)” and “chr (code point)”. The following applies:

“Chr (ord (character)) == character”

Note that it is always possible to determine the code point corresponding to a character. Conversely, the assignment only works for numbers that are actually defined as code points of the encoded character set. We show the basic schema here using a short Python example:

# determining the decimal code point of a character
ord('A') # `65`
# determining the haxadecimal code point of a character
hex(ord('A')) # `0x41`
# determining the code
chr(65) # `'A'`
chr(0x41) # `'A'`
chr(0x110001) # erreur, puisque le point de code > `0x110000`

Using these functions, it is possible to easily create a character table for the code points of the Unicode character set. To do this, you repeat the code points and output the corresponding characters. With Python, this is done in a few lines of code:


# Give ASCII characters
# for code_point in range(32, 128):
# Give ISO Latin-1
for code_point in range(32, 256):
# Print the point codes in decimal and hexadecimal notations with associated characters
  print(code_point, hex(code_point), chr(code_point))

ICU Library

The International Components for Unicode (ICU) are summarized in a program library provided by the Unicode Consortium. The library is released under an open-source license and can be used on many operating systems. The software is used for programmatic internationalization ( “Internationalization”, often shortened to “i18n”). Its fields of application include:

  • Unicode text processing
  • Support for regular expressions in Unicode
  • Analysis and formatting of dates, times, numbers, currencies and calendar messages

The ICU library is available in two versions:

  • “Icu4c” is written in C / C ++ and provides an API for these languages.
  • “Icu4j” is written in Java and provides an API for this language.

The use of components provides consistent results regardless of the platform.

Charset in the metadata at the head of an HTML document

Most current HTML documents are encoded in UTF-8 characters. To ensure that the document is displayed to visitors to the page without incorrect characters, a “charset” metadata attribute must be placed in the head of the HTML document. It asks the browser to interpret the retrieved document as UTF-8. Here is an example:

<head>
 <meta charset="utf-8">
 <!-- autres éléments du header -->
</head>

Polices Twitter

The Twitter short message service does not allow text formatting for tweets, profiles, or usernames. The creative possibilities of the users are thus limited. However, some ingenious developers have found a trick. Twitter uses Unicode everywhere, so it is possible to compose formatted text from special characters. In particular, characters resembling Latin letters are used. The easiest way to generate such text is to use a Twitter Font Generator.

Newsletter SubscriptionYou don't wanna miss our exclusive guides

Get Exclusive Online Business Guides and Tips That I Only Share With Email Subscribers


Spread the love

Leave a Reply

Your email address will not be published.