mbox series

[00/11] vt: implement proper Unicode handling

Message ID 20250410011839.64418-1-nico@fluxnic.net
Headers show
Series vt: implement proper Unicode handling | expand

Message

Nicolas Pitre April 10, 2025, 1:13 a.m. UTC
The Linux VT console has many problems with regards to proper Unicode
handling:

- All new double-width Unicode code points which have been introduced since
  Unicode 5.0 are not recognized as such (we're at Unicode 16.0 now).

- Zero-width code points are not recognized at all. If you try to edit files
  containing a lot of emojis, you will see the rendering issues. When there
  are a lot of zero-width characters (like "variation selectors"), long
  lines get wrapped, but any Unicode-aware editor thinks that the content
  was rendered properly and its rendering logic starts to work in very bad
  ways. Combine this with tmux or screen, and there is a huge mess going on
  in the terminal.

- Also, text which uses combining diacritics has the same effect as text
  with zero-width characters as programs expect the characters to take fewer
  columns than what they actually do.

Some may argue that the Linux VT console is unmaintained and/or not used
much any longer and that one should consider a user space terminal
alternative instead. But every such alternative that is not less maintained
than the Linux VT console does require a full heavy graphical environment
and that is the exact antithesis of what the Linux console is meant to be.

Furthermore, there is a significant Linux console user base represented by
blind users (which I'm a member of) for whom the alternatives are way more
cumbersome to use reducing our productivity. So it has to stay and
be maintained to the best of our abilities.

That being said...

This patch series is about fixing all the above issues. This is accomplished
with some Python scripts leveraging Python's unicodedata module to generate
C code with lookup tables that is suitable for the kernel. In summary:

- The double-width code point table is updated to the latest Unicode version
  and the table itself is optimized to reduce its size.

- A zero-width code point table is created and the console code is modified
  to properly use it.

- A table with base character + combining mark pairs is created to convert
  them into their precomposed equivalents when they're encountered.
  By default the generated table contains most commonly used Latin, Greek,
  and Cyrillic recomposition pairs only, but one can execute the provided
  script with the --full argument to create a table that covers all
  possibilities. Combining marks that are not listed in the table are simply
  treated like zero-width code points and properly ignored.

- All those tables plus related lookup code require about 3500 additional
  bytes of text which is not very significant these days. Yet, one
  can still set CONFIG_CONSOLE_TRANSLATIONS=n to configure this all out
  if need be.

Note: The generated C code makes scripts/checkpatch.pl complain about
      "... exceeds 100 columns" because the inserted comments with code
      point names, well, make some inlines exceed 100 columns. Please make
      an exception for those files and disregard those warnings. When
      checkpatch.pl is used on those files directly with -f then it doesn't
      complain.

This series was tested on top of v6.15-rc1.

diffstat:

 drivers/tty/vt/Makefile             |   3 +-
 drivers/tty/vt/gen_ucs_recompose.py | 321 ++++++++++++++++++
 drivers/tty/vt/gen_ucs_width.py     | 336 +++++++++++++++++++
 drivers/tty/vt/ucs_recompose.c      | 170 ++++++++++
 drivers/tty/vt/ucs_width.c          | 536 ++++++++++++++++++++++++++++++
 drivers/tty/vt/vt.c                 | 111 ++++---
 include/linux/consolemap.h          |  18 +
 7 files changed, 1448 insertions(+), 47 deletions(-)

Comments

Greg KH April 11, 2025, 2:49 p.m. UTC | #1
On Wed, Apr 09, 2025 at 09:13:52PM -0400, Nicolas Pitre wrote:
> The Linux VT console has many problems with regards to proper Unicode
> handling:

Wow, very nice work, thanks for doing all of this.  I'll go queue it up
now, the kernel test robot warnings for comments can be fixed up later
if you want to.

greg k-h