unicode: introduce UTF-8 character database
authorGabriel Krisman Bertazi <krisman@collabora.com>
Thu, 25 Apr 2019 17:38:44 +0000 (13:38 -0400)
committerTheodore Ts'o <tytso@mit.edu>
Thu, 25 Apr 2019 17:38:44 +0000 (13:38 -0400)
commit955405d1174eebcd1b89ab335f720adc27d52b67
treedf420b2703e110c3ac1cc51f508918500f30715f
parent310a997fd74de778b9a4848a64be9cda9f18764a
unicode: introduce UTF-8 character database

The decomposition and casefolding of UTF-8 characters are described in a
prefix tree in utf8data.h, which is a generate from the Unicode
Character Database (UCD), published by the Unicode Consortium, and
should not be edited by hand.  The structures in utf8data.h are meant to
be used for lookup operations by the unicode subsystem, when decoding a
utf-8 string.

mkutf8data.c is the source for a program that generates utf8data.h. It
was written by Olaf Weber from SGI and originally proposed to be merged
into Linux in 2014.  The original proposal performed the compatibility
decomposition, NFKD, but the current version was modified by me to do
canonical decomposition, NFD, as suggested by the community.  The
changes from the original submission are:

  * Rebase to mainline.
  * Fix out-of-tree-build.
  * Update makefile to build 11.0.0 ucd files.
  * drop references to xfs.
  * Convert NFKD to NFD.
  * Merge back robustness fixes from original patch. Requested by
    Dave Chinner.

The original submission is archived at:

<https://linux-xfs.oss.sgi.narkive.com/Xx10wjVY/rfc-unicode-utf-8-support-for-xfs>

The utf8data.h file can be regenerated using the instructions in
fs/unicode/README.utf8data.

- Notes on the update from 8.0.0 to 11.0:

The structure of the ucd files and special cases have not experienced
any changes between versions 8.0.0 and 11.0.0.  8.0.0 saw the addition
of Cherokee LC characters, which is an interesting case for
case-folding.  The update is accompanied by new tests on the test_ucd
module to catch specific cases.  No changes to mkutf8data script were
required for the updates.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
fs/Kconfig
fs/Makefile
fs/unicode/Kconfig [new file with mode: 0644]
fs/unicode/Makefile [new file with mode: 0644]
fs/unicode/README.utf8data [new file with mode: 0644]
fs/unicode/utf8data.h [new file with mode: 0644]
scripts/Makefile
scripts/mkutf8data.c [new file with mode: 0644]