Feature Wiki

Information about planned and released features

Tabs

Improve Search Index with Tika

1 Initial Problem

The current lucene indexer implementation is able to analyze and index file content.

Supported file types are:

  • (Plain)Text-Files
  • PDF-Files
  • HTML-Files
  • Office-Files like (.odt, .doc, .docx, .rtf, ...)

Since the text extraction is based on different libraries in different versions, which are not optimized for the use of the lucene indexer, the following problems occur:

  • Specific files (corrupted content) might produce crashes of indexer threads
  • The text extraction performance is not optimized
  • The Lucene indexer uses a simple charset detection algorithm which is not suitable for the pupose of content extr

These issues could be tackled by replacing the existing libraries by Apache -TIKA

Apache Tika is a widely used Java content detection framework licensed under the Apache License 2.0. See Wikipedia for further information and usages.

2 Conceptual Summary

The existing libraries for content extration are to be removed. These are removed from the codebase. These are

  • PDFbox (.pdf)
  • JTidy (.html)
  • RTFEditorKit (.rtf)
  • POI (Office-Files)

Instead Apache Tika 2.4 will introduced preferably as maven dependency.  (Depends on status Using Maven to manage dependencies

3 User Interface Modifications

3.1 List of Affected Views

No views affected

3.2 User Interface Details

No new user interface elements are introduced.

3.3 New User Interface Concepts

None

3.4 Accessibility Implications

None

4 Technical Information

The Tika framework will be introduced as maven dependency. Thus no new libraries will be commited to the code base.

Deprecated / unused libraries will be removed from the codebase.

Backward compatability to older ILIAS version is a main target of the implemantation. Thus the new lucene server build could be used for all supported ILIAS releases.

5 Privacy

No additional personal data is stored in the lucene index besided the content of files.

6 Security

New tika versions are handled via maven dependency updates.

7 Contact

8 Funding

9 Discussion

JourFixe, ILIAS [jourfixe], 17 OCT 2022 : We highly appreciate this suggestion and schedule the feature for ILIAS 9. 

10 Implementation

 No changes in interface, thus no screenshot.

Test Cases

Test cases checked at 2023-07-21 by Meyer, Stefan [smeyer]

  • No changes in community test cases 

Approval

Approved at 2023-10-13 by Brauns, Johanna [jbrauns].

Last edited: 13. Oct 2023, 14:19, Brauns, Johanna [jbrauns]