Feature Wiki

Information about planned and released features

Tabs

Generalized Management Functions for Referential Integrity

1 Initial Problem

This feature request is part of FR Project: Establish Referential Integrity.
In order to enable maintainers of modules and services to establish Referential Integrity (RI) in the Database Management System (DBMS), supporting functions MUST be implemented, provided by the database abstraction layer.

In addition, a base strategy how RI can be provided to an "old grown ILIAS instance" is discussed here to depict the context of the RI management functions.

2 Conceptual Summary

2.1 Updating strategy

The basic strategy for implementing 'a portion of RI for a Component or a Service' MAY consist of these parts:
(a 'portion of RI' consist of at least one new or altered FK, plus the neccessary adaptations of PHP code)

in resposibility of the maintainer

(1)  providing a diagnostic tool to check existing data for potential violations of the upcoming RI for a component - see: Database: Detecting Violations for Referential Integrity

(2) providing an ILIAS\Setup\MigrationObjective to implement all pending FKs for this component
Each FK gets an own numerated step in the objective.  The step number of the last succesfull creation of a FK SHOULD be stored in a RI-dedicated new table to persist the 'RI-Version' for each Component or Service - see "Discussion".
The objective SHOULD take use of the component own metrics - see (1) - to ensure the mandatory data integrity.
In the case the data integrity check fails, the setup 'migrate' command SHOULD be extended to accept one or more options to be transferred to the objective - see "Discussion". This option(s) are used to control the fixing of violating data.
If the data integrity check succeeds, the FKs can be established using the General Management Functions described below.

(3) enabling the corresponding PHP code of the component to work with AND without this 'portion of RI' having established in DBMS.
The PHP code MUST be able to work consistently for both scenarios:  there are pending FKs or all FKs (from part (2)) have been established

So, for a while, different code sequences for potentially every data manipulating action MAY be neccessary.  The decision on runtime which sequence MUST be performed is made by using the current 'RI-version' of the component mentionend in (2).

in responsibilty of the adminstrator of an ILIAS instance

(4) ensuring the data integrity concerning the Component or Service to be receiving FKs
For checking the integrity the admin SHOULD use (1) from above:  performing command 'php setup/cli.php status --show_RI'
('show_RI' MAY be a new option for the status command to force the evaluation of data integrity for pending FKs).

If the check shows up no problems concerning the data integrity, this part is fullfilled (for this very moment).

If there are problems shown, the existing data must be fixed!
Which actions are possible or which actions are preferred MUST be proposed by the maintainer per FK in the referenced documentation,  see (1) above.
Possible actions CAN be special updates per SQL to fix some or all problematic entries,
or the maintainer nominates some migrate-command option(s) from section "Migrate-Command Options" below.

(5) establish the complete 'portion of RI'
This CAN be done by performing the migrate command for the MigrationObjective in (2).

Sequences of implementation and use
The steps (1) to (5) above MAY be available (respectively executed) in two alternative timelines:

1.   providing diagnostics AND foreign keys at once
The maintaner provides parts (1) to (3) in one major/minor release, then the administrator(s) can execute parts (4) and (5) at any time after checking out the release.

Sequence: 
(1) > (2) > (3),   then multiple: (4) > (5)

2.  providing diagnostics AND foreign keys in separate releases
The maintainer provides part (1) - the diagnostics - to 'quickly' enable all administrators to check their databases for violations.
He, the maintainer, MAY use this also to gather informations "from the wild" regarding the intended FKs...

Sequence:
"first" release: (1) > multiple (4)
"later" release: (2) > (3),   then multiple: (4) > (5)

2.1.1 Migrate-CommandOptions

The setup's migrate command SHOULD be extended to transfer additional information to an MigrationObjective - see "Discussion".
This additional information MAY enable the admin to easily control what shall be done with the violating entries in behalf of the next pending FK.

Option A  (create csv dump):
A CSV dump of the violating entries is generated as a file in folder error_log_path, the filename SHOULD contain at least:

  • the name or type of the Component or Service
  • the FK step number
  • a timestamp
Option B  (copy to violation table):
The violating entries are copied to a to be created or reused 'violation table'. The table name MAY be  ri_<component-or-service-nickname>_<step-nr>_violation.
The table COULD be created using a CREATE TABLE ri_..._violation LIKE ... directive.  If the source table already has existing foreign keys, those foreign keys MUST NOT be established for the violation table!
If the violation table exists already (from a former run of this certain step), it should be emptied (e.g. with TRUNCATE) before inserting the actual identified violations.

Option C  (remove violating entries):
The violating entries are deleted from the source table. This puts the table in the required state to perform the corresponding pending FK.

2.2 Required Management Functions

Like existing abstract functionality to manipulate the database structure in scope of a table, its indexes or its fields without knowing the particular kind of DBMS being used,
equivalent functions for RI are REQUIRED to:

  1. check the existance of a certain FK
  2. check table content for potential violation of an intended FK
  3. create a new FK
  4. drop a FK

Altering a foreign key is limited to changes of ON DELETE and ON UPDATE rules - this kind of change is very unlikely to happen. A separate function for altering an existing key is NOT REQUIRED therefore - should this situation arise nevertheless, a change can be achieved by dropping and recreating the foreign key.

2.2.1 Detecting Violations Upfront

To detect possible issues with an existing ILIAS databsase in advance Database: Detecting Violations for Referential Integrity MUST be implemented by the demanding component.

2.3 'Mail' - A Sample Scenario

The pure technical goals to be achieved by this scenario are:

  • etsablish RI in data context 'mail'
  • consolidate data structures and field attributes of involved tables (if necessary or appropriate)
  • ensure proper deletion of:
    • all owned mails and associates (folder, attachments, settings, ..) when the owning user is about to be deleted
    • a user's mail folder including all contained mails
    • (one ore more) selected mails including its associates (attachments)
  • removement of former (normal) indexes which functionality are now redundant due to new indexes created to facilitate a certain FK
The scenario covers:
  • foreign keys (FK) concerning ONE and MORE THAN ONE column(s) in a certain FK
  • the necessity of various 'ON DELETE' rules
  • a self referencing FK
  • discovery of a suspected data redundancy
  • discovery of prerequisite data manipulation (implying corresponding code adaptions)
  • a scenario in which our productive ILIAS installation (NIC 1525) has in fact violating datarows - discovered while developing this scenario
The following sceanario steps are structured alike in a maintainers job later to establish RI in a module / a service.
So this scenario may become a part of the documentation for the maintainers.

In this step the semantic correlation of data in the considered context is to be transformed in a technical representation - in foreign keys.
The knowledge of 'semantic correlations' usually should be present in the maintainers mind - in this case (writing this scenario by ME, NOT the maintainer) the correlations are extracted by analysis of table structures, interpreting field names und comparing the value-sets of guessed correlations.
Therefore 'my' collection of foreign keys for 'Mail' may not be accurate, of course.

In abscence of the "Generalized Management Funktions to establish RI" currently, pure SQL is used to document the desired database changes.

The collection of FK's revealed as:

Table [mail]:
(A)  ALTER TABLE mail ADD CONSTRAINT IF NOT EXISTS fk_mail_user_id FOREIGN KEY (user_id) REFERENCES usr_data(usr_id) ON DELETE CASCADE ON UPDATE RESTRICT;
(B)  ALTER TABLE mail ADD CONSTRAINT IF NOT EXISTS fk_mail_sender_id FOREIGN KEY (sender_id) REFERENCES usr_data(usr_id) ON DELETE SET NULL ON UPDATE RESTRICT;
(C)  ALTER TABLE mail ADD CONSTRAINT IF NOT EXISTS fk_mail_folder_id FOREIGN KEY (folder_id, user_id) REFERENCES mail_obj_data(obj_id, user_id) ON DELETE CASCADE ON UPDATE RESTRICT;

Table [mail_obj_data]:
(D)  ALTER TABLE mail_obj_data ADD CONSTRAINT IF NOT EXISTS fk_mail_obj_data_user_id FOREIGN KEY (user_id) REFERENCES usr_data(usr_id) ON DELETE CASCADE ON UPDATE RESTRICT;

Table [mail_attachment]
(E)  ALTER TABLE mail_attachment ADD CONSTRAINT IF NOT EXISTS fk_mail_attachment_mail_id FOREIGN KEY (mail_id) REFERENCES mail(mail_id) ON DELETE CASCADE ON UPDATE RESTRICT; 

Table [mail_cron_orphaned]
(F)  ALTER TABLE mail_cron_orphaned ADD CONSTRAINT IF NOT EXISTS fk_mail_cron_mail_id FOREIGN KEY (mail_id) REFERENCES mail(mail_id) ON DELETE CASCADE ON UPDATE RESTRICT;
(G)  ALTER TABLE mail_cron_orphaned ADD CONSTRAINT IF NOT EXISTS fk_mail_cron_folder_id FOREIGN KEY (folder_id) REFERENCES mail_obj_data(obj_id) ON DELETE CASCADE ON UPDATE RESTRICT;

Table [mail_options]
(H)  ALTER TABLE mail_options ADD CONSTRAINT IF NOT EXISTS fk_mail_options_user_id FOREIGN KEY (user_id) REFERENCES usr_data(usr_id) ON DELETE CASCADE ON UPDATE RESTRICT;

Table [mail_saved]
(I)  ALTER TABLE mail_saved ADD CONSTRAINT IF NOT EXISTS fk_mail_saved_user_id FOREIGN KEY (user_id) REFERENCES usr_data(usr_id) ON DELETE CASCADE ON UPDATE RESTRICT;

Table [mail_tree]
(J)  ALTER TABLE mail_tree ADD CONSTRAINT IF NOT EXISTS fk_mail_tree_tree FOREIGN KEY (tree) REFERENCES usr_data(usr_id) ON DELETE CASCADE ON UPDATE RESTRICT;
(K)  ALTER TABLE mail_tree ADD CONSTRAINT IF NOT EXISTS fk_mail_tree_child FOREIGN KEY (child, tree) REFERENCES mail_obj_data(obj_id, user_id) ON DELETE CASCADE ON UPDATE RESTRICT;
(L)  ALTER TABLE mail_tree ADD CONSTRAINT IF NOT EXISTS fk_mail_tree_parent FOREIGN KEY (parent) REFERENCES mail_tree(child) ON DELETE CASCADE ON UPDATE RESTRICT;

As mentioned before, I didn't inspect PHP-code for this collection.  The table name [mail_cron_orphaned] lets assume it contains 'orphaned mails found by a cron job' - so possibly (F) and (G) are fairly contra productive  OR  the cron job and the table become obsolete after implementing RI in 'mail' ..?

FKs in context 'mail' (FK (L) is not drawn in this picture, it's a FK "inside" table mail_tree)

Update-Rules
The universal decision for an UPDATE rule in ILIAS SHOULD be: RESTRICT, which means: if the value in the superior (foreign) table is about to be changed, this very change would be rejected by the FK.
I assume that in no condition a certain value of the primary key (!) of a foreign table is ever altered, so no UPDATE event SHOULD happen at all on a table defining a FK.

Delete-Rules:
Usually a datarow in a subordinate table SHOULD be deleted too if the corresponding datarow in the foreign table is about to be deleted. As an Example: when a user is about to be deleted, all own mails have to be deleted (first). Consequently the default decision for DELETE rules is CASCADE.

Nicely (for the scenario) 'mail' has one case where this default is not desired:  the FK (B) ensures data integrity for the sender's ID in table [mail].
If the sender is to be deleted, no mail in a mail basket of another user SHOULD be deleted implicitly, however the reference to the erstwhile sender of this mails MUST be invalidated (set to NULL)!
Consequently the decision for the DELETE rule of (B) is SET NULL.

For some desired FK's modifications to the table structure or data may be necessary.

(1)
In this scenario column parent in table [tree] is such a candidate.
To be able to establish the FK of directive (L), all current values in column parent MUST be existing values in column child.  Obviously this isn't the case actually.
All 'childs' having a 'depth' of '1' (which means: this is the one and only 'root'-entry of the users mail folder tree) currently get a 'parent' value of '0' always.  This value '0' in parent is no valid child value and directive (L) would be rejected with error message:
ERROR 1452 (23000): Cannot add or update a child row: a foreign key constraint fails (CONSTRAINT `fk_mail_tree_parent` FOREIGN KEY (`parent`) REFERENCES `mail_tree` (`child`) ON DELETE CASCADE)

To establish FK (L) nonetheless:

  • existing data must be fixed prior to (L) by:  UPDATE mail_tree SET parent = NULL WHERE parent = 0;
  • all INSERTs or SELECTs into/from column parent in PHP code MUST be revised accordingly - see Step six
Fortunately column parent is declared as 'NULL' already (means: MAY be NULL), but this ability wasn't used yet.
Otherwise the column would have to be modified to accept a NULL value now.

This brings up an additional directive for the main job prior to (L):
(L-0) UPDATE mail_tree SET parent = NULL WHERE parent = 0;

(2)
The same correlation as in (1) matches on directive (B): invalidated sender_ids MUST be NULL, not '0',
(column sender_id is declared as NULL-able already).
... we have an additional directive prior to (B):
(B-0) UPDATE mail SET sender_id = NULL WHERE sender_id = 0;

(3)
btw.:  I see a general data redundancy in design of table [mail] concerning column user_id.
A certain mail is identified by its mail_id, it's "location" is identified by folder_id, and each folder_id in [mail_obj_data] (stored there as coulmn obj_id) gets a mandatory user_id associated.  There (in table [mail_obj_data]) the column user_id is essential for not to have a folder assigned 'by accident' to different users.
Conclusion: column user_id in table [mail] is the redundant one and SHOULD be removed.

This requirement is not considered further in this scenario, last but not least to keep directive (C) as an example for a 2-column FK  ;-).
When context 'mail' is processed in respect of RI sometime later by its maintainer "in reality", THIS topic (3) SHOULD be considered!

Every table getting one or more FK SHOULD be inspected for superflous indexes.
This are indexes having same column(s) like the index of a new FK or the "old" index have fewer columns but in same order like the new FK.  Follwing cases matches this criterias:

  1. both, an old index and a FK index, includes the same one column
  2. both, an old index and a FK index, includes the same multiple columns in the same order
  3. an old index includes just one column and the FK index has more then one column, but its first column is the same as the only column of the old index
In fact, using DBMS MariaDB cases 1 and case 2 are not relevant, because the DBMS reuses an existing key if it exactly matches the criterias for case 1 or 2.
Additionally (old) indexes with multiple columns will be used for a one-column-FK if the first columns match.

As an example let's have a look on the indexes of table [mail] after creating the foreign keys:

Indexes used/created for FK in table [main]:

  • FK (A) did not require an extra index, the existing index i1_idx is being used for (A)
  • FK (B) did not require an extra index, the existing index i4_idx is being used for (B) although it has a second column
  • FK (C) needs an extra index because no former index exists with matching criteria "first column: folder_id , second column: user_id". The new index inherits its name from the FK directive: fk_mail_folder_id.
Redundant indexes after establishing FK in table [mail]:
First we have to look at newly created FK-indexes, here its just one: indexfk_mail_folder_id.

Now we must check for "old" indexes having the same first column - we have a match on index i2_idx.  This index was (presumably) intended as a performance index for searching / filtering using column folder_id.  This intention is also guaranteed by new index fk_mail_folder_id - and the old index is to be treated as redundant and can be deleted savely.
This brings up another update step:
DROP INDEX i2_idx ON TABLE mail;

(This review for redundant indexes on table [mail] is intended as an example only - no further table reviews are discussed here, nor was the identified DROP directive be placed in ri-update.php.)

For a single data check step following abstract information ar required:

  1. the table name of table to be checked
  2. the column name(s) of the FK to be created (on table above), comma separated
  3. the table name of the foreign table
  4. the column name(s) in the foreign table for the FK to be created, comma separated
  5. a recommendation from the maintainer how to handle occuring data violations
The informations 1. to 4. can be extracted out from the directives (A) to (L). A call of the 'generalized data check step' MUST accepted these informations as function arguments.
A water-proofed automatic parser from 'a FK directive' to a 'data check select' seems to be too expensive, so the extraction SHOULD be left to the callers duties.
The resulting query inside the called function could be:
SELECT * FROM <1> WHERE (<2>) NOT IN (SELECT <4> FROM <3>);

The result of this query SHOULD be empty, meaning: no violating data was found in <1>.
If the result is not empty, this same query can be used as subquery for processing data check options A, B and C.

All directives (A) to (L) get similar recommendation:  "Please use Option C (delete violating entries) to clean up your data in table [<1>] for step %s." .

See also: Database: Detecting Violations for Referential Integrity.

At this point the maintainer should successfully have established the foreign keys from 'Step one' above in his/her development instance.
Now all classes of service 'Mail' MUST be reviewed whether these foreign keys may require code changes in following use cases:

  • using values which NOW can contain a NULL value - see: (1) and (2) in Step two
  • using values which NOW are located at another location - see: (3) in Step two (abandoned column user_id in [mail])
  • updating to or inserting a 'dummy' value in a column which NOW should contain a NULL value for dummy purposes - see: (1) and (2) in Step two
  • 'sequences of insert actions' which constitutes the insertion of an entity which parts are distributed over more than ONE table. NOW the sequence MUST respect FK's - e.g.: FIRST insert into [mail],  THEN insert into [mail_attachment]. BEFORE RI the sequence could have been reversed (and was functional OK anyhow...)
  • every deletion of data needs to be checked for simplification:
    • profiting from ON DELETE CASCADE rules: deletes from subordinated tables may be removed
    • profiting from ON DELETE SET NULL rules: partial invalidating of still existing data may be removed
The most benfit by this 'portion of RI' is gained when deleting a user:
  1. the call to ilMailbox::delete() in ilObjUser::delete() can be removed
  2. the whole function ilMailbox::delete() can be removed
The maintainer of 'Mail' now must handle point 1. (a pending change of code 'outside' his/her own responsibility), in this case I see these options for the maintainer of 'Mail':
  • he/she can provide an appropriate PR for the maintainer of 'User'
  • or he/she can communicate the change requirement to the maintainer of 'User' to prepare the change
  • or he/she can suspend the adaption by splitting it in 3 (commit) phases:
    • commit 1: maintainer of 'Mail' do NOT remove function ilMailbox::delete(), but degrade it by removing all former actions and return immediately to the caller
    • commit 2: maintainer of 'User' removes call to ilMailbox::delete()
    • commit 3: maintainer of 'Mail' removes function ilMailbox::delete()
If NOT using the 3-phase commit, the merge of the PR and the commit of 'RI for mail' MUST happen as simultaneously as possible to avoid an inconsistent git fetch/checkout by others.

Allthough hard wired code like described in 1. and 2. can be removed, dependent child entities need to be informed when a parent entity gets deleted. Required mechanisms are discussed here: Database: Announcing Entity Deletions with Established RI

Assuming Step five is done and everything is pushed to github, others may process this changes in their installations.  Here is a fictious protocol of such an update:
(needs adaption after RI-workshop on 2021-12-02 ...) !

## 01 ## fetching an checking out the release from the repository, do the normal duties for an update ..
> cd /<path_to_instance_codebase>
> git pull
> composer update --no-dev
> sudo -u www-data php setup/cli.php update --no-plugins --yes


## 02 ## run the 'status'-command 
...

## 03 ## executing pending update steps of ri-update.php, FIRST attempt
> php setup/setup.php ri-update <name_of_instance_config_to_be_used>.json

If this ILIAS instance does not have any data violating the 'new foreign keys', this 'portion of RI' is completely set up in DBMS.  Great!
The output of 'setup' on CLI may be:
The processing of ri-update was succeful. The ri-version for this ILIAS instance is: 32.

The admin can continue with the usual update at Step ## 06 ##.

In our installation we DO HAVE data violations for FK (D):  table [mail_obj_data] contains abondened entries of 29 users deleted sometime in the past - 174 violating entries in total.
The reaction of 'setup' on CLI may be:
... the processing of data check step #14 in setup/ri-update.php was unsuccessful.
A CSV dump of the violating entries was exported to: <path_to_error_logs>/ri-update_#14_<a_current_timestamp_with_seconds>.csv
The recommendation to handle this violations is:
Please use Option C (delete violating entries) to clean up your data in table [mail_obj_data] for step #14.


## 04 ## re-run ri-update.php applying option A and C
> php setup/setup.php ri-update <name_of_instance_config_to_be_used>.json AC

In variation to the maintainers recommendation option A+C are applied here when rerun the setup to have an exact dump of the deleted entries by option C. Theoretically this second dump may differ to the dump from step ## 03 ##!
The output of 'setup' on CLI may be:
... the processing of data check step #14 in setup/ri-update.php was unsuccessful.
A CSV dump of the violating entries was exported to: <path_to_error_logs>/ri-update_#14_<a_current_timestamp_with_seconds>.csv
Option C was applied and the violating entries are removed from the table now.
You may rerun the ri-update now without applying any extra option.


## 05 ## re-run ri-update.php applying NO extra options to the next data check step
> php setup/setup.php ri-update <name_of_instance_config_to_be_used>.json

The data check step #14 is still pending at this point! Step #14 stays the very next step to be processed for this ILIAS instance.
Due to cleaning up the violating entries in step ## 04 ## the data check step should be successful now and the processing of ri-update.php continues.
Assuming no further violating entries are found by other data checks, this run will be successful overall.
The output of 'setup' on CLI may be:
The processing of ri-update was successful. The ri-version for this ILIAS instance is: 32.

(currently 32 is the number of the very last step in ri-update.php)


## 06 ## continue with the 'normal' update process
> git checkout
> php setup/setup.php update <name_of_instance_config_to_be_used>.json
...

2.4 Documentation

A guideline and/or documentation is REQUIRED to qualify maintainers to:

  • decide, where foreign keys should be established (in the scope of one Module/Service)
  • decide, which kinds of DELETE- and UPDATE-rules are appropriate for a certain key
  • provide RI update steps using above mentioned functions to establish foreign keys
  • reflect the upcoming foreign keys to the PHP coding of their Module/Service
  • provide a recommendation for each data check step, which action should/may be taken in case of a failing data check step

3 User Interface Modifications

3.1 List of Affected Views

{Please list all views (screens) of ILIAS that should be modified, newly introduced or removed.}

3.2 User Interface Details

{For each of these views please list all user interface elements that should be modified, added or removed. Please provide the textual appearance of the UI elements and their interactive behaviour.}

3.3 New User Interface Concepts

{If the proposal introduces any completely new user interface elements, you might consult UI Kitchen Sink in order to find the necessary information to propose new UI-Concepts. Note that any maintainer might gladly assist you with this.}

4 Technical Information

{The maintainer has to provide necessary technical information, e.g. dependencies on other ILIAS components, necessary modifications in general services/architecture, potential security or performance issues.}

5 Privacy Information

{ Please list all personal data that will need to be stored or processed to implement this feature. For each date give a short explanation why it is necessary to use that date. }

6 Contact

  • Author of the Request: Tesche, Uwe [utesche]
  • Maintainer: {Please add your name before applying for an initial workshop or a Jour Fixe meeting.}
  • Implementation of the feature is done by: {The maintainer must add the name of the implementing developer.}

7 Funding

If you are interest in funding this feature, please add your name and institution to this list.

  • Helmut-Schmidt-Universität / Universität der Bundeswehr Hamburg

8 Discussion

Jansen, Michael [mjansen] To be discussed:

  • We will need a centralized table (and a repository for that) for the purpose of storing a RI state per component/service.
  • Do we need the possibility to dynamically interact with the 'RI Migrate Command' when violations exist to handle them (to choose or select one or more of the options: a. dump / b. copy / c. remove)? Or are these just further arguments when executing the 'RI Migration Command', e.g. '--option=a  or --option=b   or --option=ac)

Tesche, Uwe [utesche] To be discussed:

  • Should a nominated option (as argument, not interactively) be used exclusively for the very next pending FK of the objective?  And in consequence - if the option was used, the migration stops further processing:  just the requested option(s) are executed, but no FK is being established at all for this run.  If the option was not used (= unnesseccarly specified), the normal processing of the migration continues.

9 Implementation

{The maintainer has to give a description of the final implementation and add screenshots if possible.}

Test Cases

Test cases completed at {date} by {user}

  • {Test case number linked to Testrail} : {test case title}

Approval

Approved at {date} by {user}.

Last edited: 13. Oct 2023, 15:50, Kunkel, Matthias [mkunkel]