By: Mark Davis, Cofounder and CTO
The Unicode Consortium is providing a new resource to help
programming tooling developers, programming language developers, and
programming language users to deal with Unicode spoofing.
Background
Encompassing letters and symbols (over 149,000 in Unicode 15.1)
across the world’s writing systems, it was inevitable that many of them would
look similar — and sometimes identical. And of course, there are those who would
take advantage of that to swindle. An example of this is “pаypal.com”, where the
first ‘а’ is actually a Cyrillic character that is confusable with the Latin
alphabet ‘a’. 😵💫
In 2004, the Unicode Consortium began working to address this
issue, focusing on URLs and other identifiers that could be spoofed, and
produced a specification and technical report with best practices for detecting
such cases. Implementations using those specifications have been widely deployed
in operating systems.
In November of 2021, another class of problems was documented. It
was demonstrated that malicious agents could write source code that would look
to human reviewers as if it was secure, but actually contain hidden traps. There
are three main categories of these spoofs:
line-break spoofs,
confusable spoofs,
and
bidirectional ordering spoofs.
Examples
- Line-break spoofs can cause what appears to
be a line of code to be actually commented out, as far as the compiler is
concerned. This can happen with C11, for example:
To a reviewer, this is an active line of code. But when U+2028 Line Separator
is at the end of the first line, the C11 compiler will interpret this as one
line consisting only of a comment!
- The “pаypal.com” above is an example of a
confusable spoof.
- As for a bidirectional spoof, take pair of
variables named Aא1 and A1א; these look identical, but the former consists
of the letters A and א followed by the digit 1, whereas the latter consists
of the letter A, the digit 1, and the letter א, in that order.
Such code might not even be malicious — it is too easy to
accidentally give reviewers (or even the writer!) the wrong impression, leading
to hidden software bugs — and just be very hard to understand; here’s an
example:
The earlier work on spoofing identifiers was relevant to this work,
but did not explicitly deal with the environment surrounding software
development. Moreover, the guidance was aimed at internationalization experts,
not programming language and software tooling developers.
Process
In response to this problem, the Consortium started a
project
in early 2022 to put together a cross-functional group of experts in Unicode
processing, programming languages, and software development tooling to address
these problems. That project resulted in the Source Code Working Group (SCWG),
which brought together a set of experts to work through the possible problems.
The first results of this group were a number of enhancements to
core Unicode specifications in September of 2022.
UAX #9
provided an extended example of use of the important higher-level protocol HL4,
and emphasized the use to mitigate misleading bidirectional ordering of source
code, including potential spoofing attacks;
UAX
#31 provided important guidance on profiles for default identifiers
and clarified that requirement on Pattern_White_Space and Pattern_Syntax
characters applies to programming languages, and is relevant to issues of
bidirectional ordering and potential spoofing attacks.
Impact
The final output of the group is
Unicode
Technical Standard #55, Source Code Handling. This new specification brings
together in one place a description of the problems specific to source code,
together with guidance and best practices for programming language and software
tooling developers. Many of the APIs necessary for supporting those best
practices were already specified and implemented in ICU, Unicode’s software
library that is already in all modern operating systems. However, one new useful
API has been added to ICU, and will be released in October 2023. This is the new
bidiSkeleton function, used to detect identifiers such as Aא1 above.
Coordinated security-related updates have been made to
UAX #9, Unicode
Bidirectional Algorithm and
UAX #31, Unicode
Identifiers and Syntax along with updates to
UTS #39, Unicode
Security Mechanisms.
This work would not have been possible without the set of dedicated
and knowledgeable people that made up the SCWG, especially Robin Leroy, the vice
chair. Others include Alexei Chimendez, Asmus Freytag, Barry Dorrans, Catherine
“whitequark”, Chris Ries, Corentin Jabot, Dante Gagne, Deborah Anderson, Ed
Schonberg, Elnar Dakeshov, Jan Lahoda, Julie Allen, Ken Whistler,
Liang Hai (梁海), Manish Goregaokar, Mark Davis, Markus Scherer, Michael Fanning,
Nathan Lawrence, Ned Holbrook, Peter Constable, Randy Brukardt, Rich Gillam,
Richard Smith, Roozbeh Pournader, Steve Dower, and Tom Honermann. For more
details on their contributions, see
Acknowledgements.
Having completed its main task, the SCWG is formally being retired
— but we are keeping the list of participants in case we need to call on their
expertise in the future!
Support Unicode
To support Unicode’s mission to ensure everyone can communicate in
their languages across all devices, please consider
adopting a character,
making a gift of stock,
or
making a donation. As
Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3
organization, your contribution may be eligible for a tax deduction. Please
consult with a tax advisor for details.