The Unicode Blog: September 2023

Friday, September 22, 2023

Unicode Version 15.1 – Tips for Implementers

The Unicode Version 15.1 release includes the UCD (Unicode Character Database), Code Charts, and Annexes, but the Core Specification is unchanged from Unicode Version 15.0. In addition to new characters, a small number of errata were fixed, along with improved representative glyphs.

Implementers should also take careful note of important changes that were made to the following UAXes:

For UAX #9 (Unicode Bidirectional Algorithm), the text for BD16, the interaction of control flow between W4 through W6, the use of sos, and the treatment of AN/EN with brackets in N0 were clarified, and a reference to UTS #55 was added.
For UAX #14 (Unicode Line Breaking Algorithm), line breaking at orthographic syllable boundaries was added, the handling of French-style quotation marks was improved, and allowed tailorings were more clearly characterized.
For UAX #29 (Unicode Text Segmentation), explicit conformance rules were added, support for ConjunctLinker clusters was added, the definition of “crlf” was updated, and multiple changes were made to the table of Word_Break Property Values.
For UAX #31 (Unicode Identifiers and Syntax), multiple changes were made to Section 2, Section 4 was completely rewritten, Section 7 was added, limited contexts for joining controls was moved to UTS #39, and a reference to UTS #55 was added.
For UAX #38 (Unicode Han Database), 6 new provisional properties were added, 7 provisional properties were removed, the syntax of several properties was updated, and the description of several properties was improved.
For UAX #45 (U-Source Ideographs), records for 39 new ideographs were added to its data file, Section 3 was added, “ExtI” was added as a new status, two obsolete status values were removed, and four status values were improved.

🌻🌻🌻🌻🌻 SUPPORT UNICODE 🌻🌻🌻🌻🌻 Finally, if you are already a contributor — or member of Unicode (or your company or organization is), thank you, Danke, Děkuju, धन्यवाद, merci, 谢谢你, grazie, நன்றி, and gracias! What we accomplish is only possible because of supporters like you.

To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider
adopting a character, making a gift of stock, or making a donation.

As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction.

Please consult with a tax advisor for details.

Make your adoption today!

Thursday, September 14, 2023

Unicode CLDR v44 Alpha available for testing

The Unicode CLDR v44 Alpha is now available for integration testing.

CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.) For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

The alpha has already been integrated into the development version of ICU. We would especially appreciate feedback from non-ICU consumers of CLDR data and on Migration issues. Feedback can be filed at CLDR Tickets.

Alpha means that the main data and charts are available for review, but the specification, JSON data, and other components are not yet ready for review. Some data may change if showstopper bugs are found. The planned schedule is:

Sep 27 — Beta (data)
Oct 04 — Beta2 (spec)
Nov 01 — Release

In CLDR 44, the focus is on:

Formatting Person Names. Added further enhancements (data and structure) for formatting people's names. For more information on why this feature is being added and what it does, see Background.
Emoji 15.1 Support. Added short names, keywords, and sort-order for the new Unicode 15.1 emoji.
Unicode 15.1 additions. Made the regular additions and changes for a new release of Unicode, including names for new scripts, collation data for Han characters, etc.
Digitally disadvantaged language coverage. Work began to improve DDL coverage, with the following DDL locales now having higher coverage levels:
1. Modern: Cherokee, Lower Sorbian, Upper Sorbian
2. Moderate: Anii, Interlingua, Kurdish, Māori, Venetian
3. Basic: Esperanto, Interlingue, Kangri, Kuvi, Kuvi (Devanagari), Kuvi (Odia), Kuvi (Telugu), Ligurian, Lombard, Low German, Luxembourgish, Makhuwa, Maltese, N’Ko, Occitan, Prussian, Silesian, Swampy Cree, Syriac, Toki Pona, Uyghur, Western Frisian, Yakut, Zhuang

There are many other changes: to find out more, see the draft CLDR v44 release page, which has information on accessing the date, reviewing charts of the changes, and — importantly — Migration issues.

In version 44, the following levels were reached:

v44 Level	Langs	Usage
Modern	95	Suitable for full UI internationalization
	čeština, ‎Deutsch, ‎français, Kiswahili‎, Magyar‎, O‘zbek‎, Română‎‎, Tiếng Việt‎, Ελληνικά‎, Беларуская‎, ‎ᏣᎳᎩ‎, Ქართული‎, ‎Հայերեն‎, ‎עברית‎, ‎اردو‎, አማርኛ‎, ‎नेपाली‎, অসমীয়া‎, ‎বাংলা‎, ‎ਪੰਜਾਬੀ‎, ‎ગુજરાતી‎, ‎ଓଡ଼ିଆ‎, தமிழ்‎, ‎తెలుగు‎, ‎ಕನ್ನಡ‎, ‎മലയാളം‎, ‎සිංහල‎, ‎ไทย‎, ‎ລາວ‎, မြန်မာ‎, ‎ខ្មែរ‎, ‎한국어‎, 中文, 日本語‎, … ‎
Moderate	13	Suitable for “document content” internationalization, eg. in spreadsheet
	brezhoneg, ‎føroyskt, IsiXhosa, ‎sardu, чӑваш, …
Basic	50	Suitable for locale selection, eg. choice of language on mobile phone
	asturianu, ‎Rumantsch, Māori, ‎Wolof, тоҷикӣ, ‎‎کٲشُر, ‎ትግርኛ, कॉशुर‎, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ, …

We are currently planning for CLDR version 45 to be a closed release with no submission period. The focus will be on improving the Survey Tool used for data submission, making necessary infrastructure changes, and some high priority data quality fixes.

Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Wednesday, September 13, 2023

Source Code Handling: Preventing Spoofing at the Source

By: Mark Davis, Cofounder and CTO

The Unicode Consortium is providing a new resource to help programming tooling developers, programming language developers, and programming language users to deal with Unicode spoofing.

Background

Encompassing letters and symbols (over 149,000 in Unicode 15.1) across the world’s writing systems, it was inevitable that many of them would look similar — and sometimes identical. And of course, there are those who would take advantage of that to swindle. An example of this is “pаypal.com”, where the first ‘а’ is actually a Cyrillic character that is confusable with the Latin alphabet ‘a’. 😵‍💫

In 2004, the Unicode Consortium began working to address this issue, focusing on URLs and other identifiers that could be spoofed, and produced a specification and technical report with best practices for detecting such cases. Implementations using those specifications have been widely deployed in operating systems.

In November of 2021, another class of problems was documented. It was demonstrated that malicious agents could write source code that would look to human reviewers as if it was secure, but actually contain hidden traps. There are three main categories of these spoofs: line-break spoofs, confusable spoofs, and bidirectional ordering spoofs.

Examples

Line-break spoofs can cause what appears to be a line of code to be actually commented out, as far as the compiler is concerned. This can happen with C11, for example:

To a reviewer, this is an active line of code. But when U+2028 Line Separator is at the end of the first line, the C11 compiler will interpret this as one line consisting only of a comment!
The “pаypal.com” above is an example of a confusable spoof.
As for a bidirectional spoof, take pair of variables named Aא1 and A1א; these look identical, but the former consists of the letters A and א followed by the digit 1, whereas the latter consists of the letter A, the digit 1, and the letter א, in that order.

Such code might not even be malicious — it is too easy to accidentally give reviewers (or even the writer!) the wrong impression, leading to hidden software bugs — and just be very hard to understand; here’s an example:

The text “Error: {0} {1}", message” becomes RTL in translation.

The text “Error: {0} {1}", message” becomes RTL in translation.

The earlier work on spoofing identifiers was relevant to this work, but did not explicitly deal with the environment surrounding software development. Moreover, the guidance was aimed at internationalization experts, not programming language and software tooling developers.

Process

In response to this problem, the Consortium started a project in early 2022 to put together a cross-functional group of experts in Unicode processing, programming languages, and software development tooling to address these problems. That project resulted in the Source Code Working Group (SCWG), which brought together a set of experts to work through the possible problems.

The first results of this group were a number of enhancements to core Unicode specifications in September of 2022. UAX #9 provided an extended example of use of the important higher-level protocol HL4, and emphasized the use to mitigate misleading bidirectional ordering of source code, including potential spoofing attacks; UAX #31 provided important guidance on profiles for default identifiers and clarified that requirement on Pattern_White_Space and Pattern_Syntax characters applies to programming languages, and is relevant to issues of bidirectional ordering and potential spoofing attacks.

Impact

The final output of the group is Unicode Technical Standard #55, Source Code Handling. This new specification brings together in one place a description of the problems specific to source code, together with guidance and best practices for programming language and software tooling developers. Many of the APIs necessary for supporting those best practices were already specified and implemented in ICU, Unicode’s software library that is already in all modern operating systems. However, one new useful API has been added to ICU, and will be released in October 2023. This is the new bidiSkeleton function, used to detect identifiers such as Aא1 above.

Coordinated security-related updates have been made to UAX #9, Unicode Bidirectional Algorithm and UAX #31, Unicode Identifiers and Syntax along with updates to UTS #39, Unicode Security Mechanisms.

This work would not have been possible without the set of dedicated and knowledgeable people that made up the SCWG, especially Robin Leroy, the vice chair. Others include Alexei Chimendez, Asmus Freytag, Barry Dorrans, Catherine “whitequark”, Chris Ries, Corentin Jabot, Dante Gagne, Deborah Anderson, Ed Schonberg, Elnar Dakeshov, Jan Lahoda, Julie Allen, Ken Whistler, Liang Hai (梁海), Manish Goregaokar, Mark Davis, Markus Scherer, Michael Fanning, Nathan Lawrence, Ned Holbrook, Peter Constable, Randy Brukardt, Rich Gillam, Richard Smith, Roozbeh Pournader, Steve Dower, and Tom Honermann. For more details on their contributions, see Acknowledgements.

Having completed its main task, the SCWG is formally being retired — but we are keeping the list of participants in case we need to call on their expertise in the future!

Tuesday, September 12, 2023

Announcing The Unicode® Standard, Version 15.1

Version 15.1 of the Unicode Standard is now available. This minor version update includes updated code charts, data files and annexes. The core specification is unchanged from Unicode Version 15.0.

This version adds 627 characters, bringing the total number of characters to 149,813. The additions include 622 CJK unified ideographs in a new block, CJK Unified Ideographs Extension I. These new ideographs are urgently needed in China for use in public service databases, and are expected to be included in a forthcoming amendment to China’s GB 18030-2022 standard. The other new characters are five ideographic description characters that enhance the ability to describe rare or not-yet-encoded CJK ideographs.

There are six completely new emoji, such as for phoenix and lime and (finally) an edible mushroom. For 108 people emoji, you can now switch the direction that they are facing (for example, person walking facing right versus facing left).

Security-related updates have been made to UAX #9, Unicode Bidirectional Algorithm and UAX #31, Unicode Identifiers and Syntax along with updates to UTS #39, Unicode Security Mechanisms. These updates complement the release of a new Unicode Technical Standard, UTS #55, Unicode Source Code Handling.

The new characters are limited to three blocks, and the code charts for several other blocks have changed. The most significant change to charts is for the CJK Unified Ideographs, CJK Unified Ideographs Extension A and CJK Unified Ideographs Extension B blocks with the addition of representative glyphs and source references for over 24,000 KP-source (North Korea) ideographs. There are also many other glyph corrections and improvements—see the 15.1 delta code charts for details.

Significant updates have been made to UAX #14, Unicode Line Breaking Algorithm and UAX #29, Unicode Text Segmentation adding better support for scripts of South and Southeast Asia, including grapheme cluster support for aksaras and consonant conjuncts, and line breaking at orthographic syllable boundaries.

For complete details on Unicode Version 15.1, see https://www.unicode.org/versions/Unicode15.1.0/.

Friday, September 1, 2023

NEW Virtual Event - Open House on Script and Character Encoding

Registration is Now Open!

The Unicode Standard aims to make the scripts used to write the languages of the world accessible on computers and devices. However, the process of getting characters and scripts into the Unicode Standard has often been puzzling. How does one successfully propose a script or a handful of characters? How are decisions made?

Join us for a virtual Open House event, where you will be able to ask these (and other) script and character encoding questions to seasoned Unicode experts.

When: Tuesday, Oct 17, 2023 at 11am-12pm Pacific Time (California)

Supporting Resources

Documenting and Preserving Languages: A Talk on Character Encoding, Keyboards, and Fonts by Deborah Anderson and Andrew Glass
Scripts and Character Encoding by Deborah Anderson, Script Ad Hoc Group Chair
Other Script and Character Encoding-related talks on the Unicode YouTube Channel

Friday, September 22, 2023

Unicode Version 15.1 – Tips for Implementers

Thursday, September 14, 2023

Unicode CLDR v44 Alpha available for testing

Wednesday, September 13, 2023