Why you should use wcwidth() to calculate a character width

Table of Contents

TL;DR

  • Character width depends on the font, and Unicode Consortium does not provide explicit width definitions for all characters.
  • There are characters that have ambiguous widths other than those defined as "Ambiguous (A)" in EastAsianWidth.txt. For example, "☀ (U+2600)" is defined as "Neutral (N)" in EastAsianWidth.txt, but its width may be full-width in a CJK font or an emoji font. This means that a non-East Asian character may be also the ambiguous-width character.
  • Character width tables in default locales is problematic for both CJK and non-CJK users. You can create a custom locale to define a better character width table.
  • wcwidth() respects defined by the locale. All TUI applications should consistently use wcwidth() to calculate the width of the character without an embedded character width table. If there is a mismatch in character width between applications, the screen will be broken.

Introduction

Modern TUI applications provide a rich UI using some Unicode characters, such as Box Drawing (U+2500..U+257F), beyond the ASCII code range. However, the difference between expected by the application and the character width actually rendered is often a problem. "East Asia Ambiguous Width" is well known as one of the causes for this problem, but it is not all. This problem maybe affects all users who use Unicode (emoji is also one of the causes).

Currently, there is no way to specify the character width to be rendered in the terminal by the application, so we can only respect the character width defined by the locale. If so users can create custom locales and define character widths that suit their environment (character widths are controlled by the user, not the application).

In this article, I explain why applications should use locale-dependent character widths by wcwidth(), and how users can create custom locales.

Calculating character widths

In glibc, table of the locale is automatically generated by utf8_gen.py based on the information defined by the Unicode Consortium as follows:

It is well known that this character width table can cause problems in the environments of East Asian users, as the characters whose "East Asian Width" is "Ambiguous" will be half-width. Therefore, some applications, such as Vim and iTerm2, have a option to set these characters to full-width.

In fact, not all "Ambiguous" characters have a full width, even in CJK fonts. Even so, displaying half-width characters as full-width characters often works well (although there is white-space). One exception is the characters of Box Drawing (U+2500..U+257f). When these characters are drawn as full-width, the screen may break.

In addition, characters that have ambiguous width also exist in non-East Asian characters defined as "Natural (N)" in EastAsianWidth.txt. For example, the widths of the following characters can be either full-width or half-width (note that "-" represents the absence of a glyph):

Chracter Code Point Width (Consolas) Width (Menlo) Width (Meiryo) Width (Hiragino Sans)
U+2600 - 1 2 2
U+2601 - 1 2 2
U+2602 - 1 2 2
U+2603 - 1 2 2
U+2604 - 1 2 -
Ambiguous width characters (Here is a list of all)

Therefore, to draw characters correctly on the screen, it is necessary to have a way to set the width of characters more flexibly. We already have that way. It is locale.

Using a custom locale

Linux or MacOS(BSD) operating systems can get the width of a character defined in the locale using wcwidth(). You can adjust the behavior of applications that call wcwidth() by using a custom locale.

However, there are many applications that use embedded character width tables that are independent of the locale. So that applications may have different character width tables between each other, and the inconsistency often causes problems. Therefore, all applications should consistently use wcwidth() to calculate the width of the character.

The way to create a custom locale differs between Linux and BSD. The scripts to do this is available in the following repository:

You can create custom locales for Linux and MacOS with the default config using those scripts as follows:

git clone https://github.com/emonkak/locale-patchers.git
cd locale-patchers
# Download latest "UTF-8" charmap and "en_US.UTF-8.src" ctype,
# and then create # "UTF-8-PATCHED" and "en_US.UTF-8-PATCHED.src".
make

To use a custom locale created in Linux, follow this step:

sudo localedef -i en_US -c -f UTF-8-PATCHED en_US.UTF-8

Or overwrite the existing locale:

gzip -c UTF-8-PATCHED | sudo dd of=/usr/share/i18n/charmaps/UTF-8.gz
sudo sh -c 'cd / && locale-gen'

If you are using MacOS, follow these steps:

mkdir -p ~/.locale/UTF-8
mklocale -o ~/.locale/UTF-8/LC_CTYPE en_US.UTF-8-PATCHED.src
cat > ~/Library/LaunchAgents/setup-locale.plist <<EOF
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>setup-locale</string>
  <key>ProgramArguments</key>
  <array>
    <string>/bin/launchctl</string>
    <string>setenv</string>
    <string>PATH_LOCALE</string>
    <string>/Users/YOUR_USER_NAME/.locale</string>
  </array>
  <key>RunAtLoad</key>
  <true/>
</dict>
</plist>
EOF
launchctl load -w ~/Library/LaunchAgents/setup-locale.plist

I recommend you follow those steps to unify the character width table of the locale on all systems you SSH into. Mismatched character widths can cause the screen to break.

The default config make the following changes to a character width table (of course you can change them). In this config, regardless of the East Asian Width property, full-width characters are specified for some block:

Code Start Code End Width Note
U+2010 U+2027 2 General Punctuation (HYPHEN ... HYPHENATION POINT)
U+2030 U+205E 2 General Punctuation (PER MILLE SIGN ... VERTICAL FOUR DOTS)
U+25A0 U+25FF 2 Geometric Shapes
U+2600 U+26FF 2 Miscellaneous Symbols
U+2700 U+27BF 2 Dingbats
U+FFFC U+FFFD 2 Specials (OBJECT REPLACEMENT CHARACTER, REPLACEMENT CHARACTER)
U+1F000 U+1F02F 2 Mahjong Tiles
U+1F030 U+1F09F 2 Domino Tiles
U+1F0A0 U+1F0FF 2 Playing Cards
U+1F100 U+1F1FF 2 Enclosed Alphanumeric Supplement
U+1F200 U+1F2FF 2 Enclosed Ideographic Supplement
U+1F300 U+1F5FF 2 Miscellaneous Symbols and Pictographs
U+1F600 U+1F64F 2 Emoticons
U+1F650 U+1F67F 2 Ornamental Dingbats
U+1F680 U+1F6FF 2 Transport and Map Symbols
U+1F700 U+1F77F 2 Alchemical Symbols
U+1F780 U+1F7FF 2 Geometric Shapes Extended
U+1F800 U+1F8FF 2 Supplemental Arrows-C
U+1F900 U+1F9FF 2 Supplemental Symbols and Pictographs
U+1FA00 U+1FA6F 2 Chess Symbols
U+1FA70 U+1FAFF 2 Symbols and Pictographs Extended-A
U+1FB00 U+1FBFF 2 Symbols for Legacy Computing

This config solves some problems with the traditional approach of treating all East Asian ambiguous characters as full-width:

  • Box Drawing characters are kept half-width, so the screen does not break in TUI applications.
  • Some symbols (emojis), such as "☀ (U+2600)", which are non-East Asian character, are changed into full-width. These symbols are usually rendered half-width on MacOS, but it is better to unify them in full-width for interoperability with Linux (consider connecting to Linux from MacOS via SSH).
  • Some Latin and Cyrillic characters are kept half-width, therefore it is also suitable for non-CJK users.

Patches to use wcwidth()

Applications calculate the width of characters in different ways. I would like all applications can provide a way to use a character table in the locale, but in reality, many applications use embedded character width tables. I wrote a patch to force those applications to use wcwidth() (on Gentoo Linux, there is a useful way to automatically apply patches in a specific directory when building packages).

Application Use wcwidth() Patch Related Issues
Alacritty No Link #1295
iTerm2 No Link
NeoVim No Link
Vim No 1 Link #4380
Zsh No 2 Link
tmux Yes 3

Conclusion

We can adjust the character width drawn in the terminal using a custom locale. However, for this to work correctly, applications must consistently use wcwidth() to get character width in the locale. I believe that all TUI applications should provide a way to use wcwidth().

If an application has an embedded character width table, it must also be updated every time Unicode is updated. This should be the responsibility of the locale, not the application. It is nonsense for TUI application developers to deal with problems of symbols and emojis not being rendered correctly. Leave it to the locale!.