Feature Wiki

Information about planned and released features

Tabs

Database: UTF8MB4 as default Character Set

1 Initial Problem

As discussed in several Mantis-tickets (#13294, #21888, #12860) using Characters with more than three bytes cannot be stored in the database using MySQL (see https://stackoverflow.com/questions/30074492/what-is-the-difference-between-utf8mb4-and-utf8-charsets-in-mysql ) with utf8_general_ci (current standard in ILIAS).

As ILIAS is often used with Smartphones and Smartphone-SUers often use Emojis, problems while storing those strings to database will occur (see tickets).

This can be "fixed" by implementing a filter as discussed in #12860, but therefore ILIAS won't support Emojis.

According to the MySQL 8.0 documentation, utf8mb3 is deprecated and is expected to be removed in future MySQL versions. The upgrade to utf8mb4 should be realized early in order to be prepared for new DBMS versions and to be able to fix possible errors at an early stage when they occur.

2 Conceptual Summary

This is Work in Progress

Using utf8mb4_general_ci as default-collation would support storing Emojis, multiple asian character sets and other characters with more than three bytes. Filtering those would then be obsolete.

Changing this in a stable release is not recommended. There are several issues concerning this (thanks to Alex and Thomas, moved to Discussion).

Some of these issues are not relevant with current versions of DBMS anymore but should still be discussed.

3 User Interface Modifications

3.1 List of Affected Views

No views need to be changed since it affects the backend.

3.2 User Interface Details

-

3.3 New User Interface Concepts

-

4 Technical Information

install.md recommends MariaDB 10.3 or later or MySQL 8.0.x for ILIAS 9 which uses InnoDB as default storage engine. According to the MariaDB documentation of InnoDB Limitations: Until MariaDB 10.3.1, large prefixes could be enabled with the innodb_large_prefix system variable. From MariaDB 10.3.1, InnoDB always uses large index key prefixes. This allows index keys up to 3072 Bytes for DYNAMIC and COMPRESSED row formats - enough for most of core ILIAS tables.

Discussion is needed: Should other DBMS like MyISAM still be supported. DIfferent handling is necessary then.

Regarding the number of characters: Columns with text have character_maximum_length and character_octet_length. During my testing of conversion from utf8mb3 to utf8mb4, the maximum_length (which is the amount of characters) stays the same, just the octet_length increases. Might require further testing.
One problem arised though in two existing tables because the total number of bytes for a row exceeds the maximum of 65.535 bytes after converting to utf8mb4: "crs_settings" and "event". Both contain 4 VARCHARs with 4000 characters, leading to 64.000 bytes - with more columns adding up to the limit. One solution might be the conversion of all or some of these columns to type TEXT since these are stored separately, but requires the feedback of the responsible maintainer.

After applying the conversion of the database alone, some errors occur that needs further modification in ILIAS, especially string handling and serialisation and deserialisation of existing data in the DB.

Further work is needed by maintainers. Modules and Services should be checked for new  issues with the database and updated accordingly.

5 Contact

6 Funding

If you are interest in funding this feature, please add your name and institution to this list and get in contact with Schmid, Fabian [fschmid].

  • ILIAS Society could contribute (some) funds

7 Discussion

// moved from Conceptual

I'd like to add that using utf8mb instead of utf8 imposes different limits on MyISAM and InnoDB indexes without making further changes [*1]. 

Indexes in MyISAM are limited to 1000 bytes [*2], which fits 332 three-byte (utf8) characters, but only 250 four-byte (utf8mb) characters, 
while InnoDB indexes are limited to 767 bytes [*3] meaning 255 three-byte (utf8) characters or 191 (utf8mb) chracters.

This means we would need to change character limits for (VAR)CHAR indexes that are longer then 250 when still supporting MyISAM.
For InnoDB we are in more luck, since larger index support can be enabled, see [*1].

Switching from utf8 to utf8mb also implicitly changes the amount of characters than can be stored inside TINYTEXT (!), TEXT, MEDIUMTEXT and LONGTEXT, which are limited in bytes not characters [*4].


[*1]: InnoDB has support for indexes up to 3072 bytes by enabling 'innodb_large_prefix', 
which was introcuded in version 5.5.14 (default OFF) and changed to default ON with 5.7.7, see:
- https://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_large_prefix [^]
- https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_large_prefix [^]
[*2]: Source - https://dev.mysql.com/doc/refman/5.5/en/myisam-storage-engine.html [^]
[*3]: Source - https://dev.mysql.com/doc/refman/5.7/en/innodb-restrictions.html [^]
[*4]: Source - https://stackoverflow.com/a/23169977/3012851 [^]


Further reading and a more in-depth guide can be found here:
https://mathiasbynens.be/notes/mysql-utf8mb4 [^]


******

Since utf8mb4 support seems to be more a feature request, the question is how to tackle the current MySQL behaviour. Two things:

1. There are ways to filter out 4byte characters, see https://stackoverflow.com/questions/8491431/how-to-replace-remove-4-byte-characters-from-a-utf-8-string-in-php?noredirect=1&lq=1 [^]

We could try do this within our database service or the consuming components, like in this case the page editor component could do this. It will cost performance anyway. 

2. I noticed that my MariaDB 10.2.9 does not simply truncate at the first 4byte character like MySQL (which is a disaster for the page editor, since it breaks the pages). It replaces these 4byte characters with ????. I wonder whether this is some configuration, but did not find it. But I think this is exactly the behaviour we want (if still using "3byte" utf8). In the case of the page editor it keeps the XML structure intact.

Kruse, Fabian [Fabian], 27.04.2022: Desite the mentioned bugs having been fixed in the meantime thorugh work-arounds, the problem persists in ILIAS. Among other things, it is preventing usage of emojis in our system. From my point of view, UTF8MB4 should be supported as soon as possible.

Roeser, Nico [nicoroeser], 2023-03-20: very loosely related: Abandon Emoticons in Repository Chat and OnScreen-Chat. That other possibility to use emoji is scheduled for removal, so users may be happy to be able to use “regular” Unicode emoji.

Schmid, Fabian [fschmid], 2024-05-03: As the maintainer of the "database" components, I cannot support the feature request as it is currently drafted. I very much welcome the general idea and am also of the opinion that we must continue to pursue the idea of delivering ILIAS with MB4 as the default, but also offer an option for migration. Introducing the new methods as proposed in https://github.com/ILIAS-eLearning/ILIAS/pull/5943 is not enough. As I have already mentioned in various places (Teammeeting bw.ILIAS, PR, discussion in Discord, direct contact), the database is only the first layer. PHP follows after that. For example, it is about the use of string functions, which must be checked whether the mb_* alternatives must be used here. In the course of this, such string operations would have to be centralised (Refinery/Data) in order, for example, to replace old - now outdated - helpers (e.g. ilStr). There should also be a strategy for how native string functions are no longer used in future (e.g. via PHPStan rules?).
It is also about field lengths in the database (for example, you yourself have identified fields where the primray key no longer works). In addition, there is no general strategy on how to proceed with MB3 in the future (will all installations with ILIAS 10 simply be converted to MB4? can MB3 still be used? ...) and how do we deal with errors related to this changeover in the future? I assume that there will be a great many in the various components. I simply switched ILIAS to MB4 as a test, and relatively little works... The maintainers need to know what to expect here. "Further work is needed by maintainers. Modules and Services should be checked for new issues with the database and updated accordingly." is a bit short here.

I think you should consider registering the topic as a BigProject. In my opinion, practically all maintainers will be needed here, the procedure must be coordinated and discussed and this feature request alone is not enough.

In my opinion, the feature request does not fulfil the requirements for the JF. "This is Work in Progress" is an indication of this, the "Technical Information" section is very general in the sense of "one would have to".

One more note: the trunk has been open for ILIAS 10 since 30 Oct 2023, this step would be preferable to take early in development. Now there are only 5 months left for development, many maintainers are already working at full capacity. I had already given feedback on the initial PR on 23 June 2023, unfortunately I never received any feedback (https://github.com/ILIAS-eLearning/ILIAS/pull/5943#issuecomment-1603833453). I simply think that the time is too far advanced for the potential size of the project.

8 Implementation

https://github.com/ILIAS-eLearning/ILIAS/pull/5943

Changes the character set and collation to UTF8MB4 if the DB does not already use this.

Changes the columns of the tables mentioned in the "Technical information" that require special attention ("event" and "crs_settings") by converting them to the TEXT type

Iterates over every table in the database and changes the character set and collation to UTF8MB4 if the table does not already use this.

Test Cases

Test cases completed at {date} by {user}

  • {Test case number linked to Testrail} : {test case title}

Approval

Approved at {date} by {user}.

Last edited: 3. May 2024, 11:02, Schmid, Fabian [fschmid]