Feature Wiki
Tabs
Improve Search Index with Tika
Page Overview
[Hide]1 Initial Problem
The current lucene indexer implementation is able to analyze and index file content.
Supported file types are:
- (Plain)Text-Files
- PDF-Files
- HTML-Files
- Office-Files like (.odt, .doc, .docx, .rtf, ...)
Since the text extraction is based on different libraries in different versions, which are not optimized for the use of the lucene indexer, the following problems occur:
- Specific files (corrupted content) might produce crashes of indexer threads
- The text extraction performance is not optimized
- The Lucene indexer uses a simple charset detection algorithm which is not suitable for the pupose of content extr
These issues could be tackled by replacing the existing libraries by Apache -TIKA
Apache Tika is a widely used Java content detection framework licensed under the Apache License 2.0. See Wikipedia for further information and usages.
2 Conceptual Summary
The existing libraries for content extration are to be removed. These are removed from the codebase. These are
- PDFbox (.pdf)
- JTidy (.html)
- RTFEditorKit (.rtf)
- POI (Office-Files)
Instead Apache Tika 2.4 will introduced preferably as maven dependency. (Depends on status Using Maven to manage dependencies
3 User Interface Modifications
3.1 List of Affected Views
No views affected
3.2 User Interface Details
No new user interface elements are introduced.
3.3 New User Interface Concepts
None
3.4 Accessibility Implications
None
4 Technical Information
The Tika framework will be introduced as maven dependency. Thus no new libraries will be commited to the code base.
Deprecated / unused libraries will be removed from the codebase.
Backward compatability to older ILIAS version is a main target of the implemantation. Thus the new lucene server build could be used for all supported ILIAS releases.
5 Privacy
No additional personal data is stored in the lucene index besided the content of files.
6 Security
New tika versions are handled via maven dependency updates.
7 Contact
- Author of the Request: Meyer, Stefan [smeyer]
- Maintainer: Meyer, Stefan [smeyer]
- Implemented by: Meyer, Stefan [smeyer]
8 Funding
9 Discussion
JourFixe, ILIAS [jourfixe], 17 OCT 2022 : We highly appreciate this suggestion and schedule the feature for ILIAS 9.
10 Implementation
No changes in interface, thus no screenshot.
Test Cases
Test cases checked at 2023-07-21 by Meyer, Stefan [smeyer]
- No changes in community test cases
Approval
Approved at 2023-10-13 by Brauns, Johanna [jbrauns].
Last edited: 13. Oct 2023, 14:19, Brauns, Johanna [jbrauns]