Feature Wiki

Information about planned and released features

Tabs

Extended Test and Item Statistics

Page Overview

[Hide]

1 Initial Problem
2 Conceptual Summary
3 Aggregated Test Results
4 User Interface Modifications
- 4.1 List of Affected Views
- 4.2 User Interface Details
- 4.3 New User Interface Concepts
5 Technical Information
6 Contact
7 Funding
8 Discussion
9 Implementation
10 Requirements

Tabellenübersicht für Fragen > Eigener Artikel > [[https://docu.ilias.de/goto_docu_wiki_wpage_5876_1357.html]]
PDF Export der Einzelfragen aus Ergebnisse zu Einzelfragen wandert in die Nachkorrektur

1 Initial Problem

The basic test statistics that are currently provided within test objects does not provide enough information to allow test authors to increase the quality of assessment items and tests in general.

2 Conceptual Summary

3 Aggregated Test Results

The "Aggregated Test Results" screen should be extended by figures that are widely calculated in other test systems, too.

For the whole test (calculated from the scored passes of the participants):

Average Points of Passed Test p of n (Mittelwert)
Median Points p of n (Median)
Average Mark as Short Form (
Median Mark as Short Form or [Short Form; Short Form]
Standard Deviation (Standardabweichung)
Cronbachs Alpha as 0,x (also known as Internal Consistency)

There should be a toolbar offering

View > to choose whether Table and Chart should be displayed or just one of the two
Export > calling a modal to export the data as excel or csv
Print > Preparing a Print View

These figures should be automatically calculated for all types of tests (fixed question, random questions and continuous testing mode). Formulas can be found on this page. This page also gives advice on how to handle incomplete passes or random tests.

Sugested textual information for suggested levels for key figures

Internal consistency (Cronbachs alpha)
Indicates how well items in the test measure the same concept. Values range from 0 to 1.
If your test assesses highly related topics, than Cronbach's alpha is an important key figure and higher values are desirable. If your tests assesses a broad range of different topics, then you must expect low and meaningless figures.

0 to 0,7 Arbitrary This test should not be the only source of input to assign a fair grade to a student. The quality of the test is too low, the grade assigned to a student will be kind of arbitrary. This arbitrariness will be balanced out if the result of this test contributes to the overall grade along with other assessments i.e. this test + presentation + oral exam combined would more be more likely to produce a fair grade. To improve this test compute the correlation of each test item with the total score test; delete items with low correlations add questions.
0,7 to 0,8 Relevant Feedback This test should not be used as the single source of input to assign a fair grade to a student. The quality of the test is good enough to inform students within a semester about their progress. It cannot be used for a high-stakes or low-stakes exams. To improve this test compute the correlation of each test item with the total score test; delete items with low correlations add questions
0,8 to 0,85 Good This test is well developed and can be used for low stakes exams that will assign fair grades to students even if the test is the only source of grading. It is not fit to be used in a high-stakes scenario.
0,85 to 0,9 Excellent This is a professionally developed standardized test that can be deployed for high - stakes exams: It is fit to be administered only once to a student and will produce a grade so sound that decisions about an entire career can be taken on the basis of this grade.
0,9 to 1,0 Redundant test questions Shorten you test, it contains questions that are mearuring identical things. These 'doubles' should be removed from the test.

For all test questions (calculated from the scored passes of the participants):

Item Difficulty / Facility Index (Schwierigkeitsgrad)
Standard Deviation
Item discrimination / Discrimination Index (Trennschärfe)

There should be a toolbar offering

Figures > to choose whether Absoultes and Percentages should be displayed or just one of the two
View > to choose whether Table and Chart should be displayed or just one of the two
Facility Index > to select between ranges
Dsicrimination Index > to select between ranges
Export > calling a modal to export the data as excel or csv
Print > Preparing a Print View

Sugested textual information for suggested levels for key figures

Item Difficulty / Facility Index
The item difficulty is the proportion of persons who answer a particular item correctly to the total attempting the item. The item difficulty index ranges from 0 to 100. The higher the value, the easier the question.
The value of desirable difficulty changes with the number of answer options a question has. This is not taken into account by the following classification.

0 to 20 Hard question, few students could answer that question correctly
20 to 80 Medium difficulty, many students could answer that question correctly
80 to 100 Easy question, almost all could answer that question correctly

Item Discrimination / Discrimination Index
Measures how well a question differentiates among students on the basis of how well they know the material being tested. Item discrimination index has no fixed range.
The values of the coefficient will tend to be lower for tests measuring a wide range of content areas than for more homogeneous tests.

Below 0,1 Poor Delete questions from test do not bother to improve it it is beyond repair and not worth the effort
0,1 to 0,3 In need of revision Revise question for ambiguous wording
Above 0,3 Good Keep question
Not computable

The table "Average Points" should be renamed to "Aggregated Item Results". It should have a column selector to choose which figures are shown.

The added figures should be included in the export of aggregated results, too.

4 User Interface Modifications

4.1 List of Affected Views

Test > Statictics > Aggregated Test Results
Test > Statictics > Aggregated Question Results

4.2 User Interface Details

Aggregated Test Results

Aggregated Question Results

4.3 New User Interface Concepts

none

5 Technical Information

In addition to this feature request there is the feature request for a Plugin Slot for Test and Item Statistics that will allow to implement further calculations for experimental purposes as a plugin.

6 Contact

Author of the Request: Neumann, Fred [fneumann], Universität Erlangen-Nürnberg
Maintainer: Heyser, Björn [bheyser]
Implementation of the feature is done by Heyser, Björn [bheyser]
Testcases by: Neumann, Fred [fneumann], Universität Erlangen-Nürnberg

7 Funding

If you are interest in funding this feature, please add your name and institution to this list.

Universität Marburg

8 Discussion

Comments on former versions

Fred Neumann, March 30, 2015: This feature was discussed and appreciated at the SIG E-Assessment Metting on March 25.

BH 02 June 2015: The maintainer supports this request. Since we decided to rename the statistics tab to "Quality of Test", this feature request is a good step to supplement the existing statistical calculations made within this tab, so we get a more exact statement about the test's quality.

Neumann, Fred [fneumann], 24 June 2015: This reature request has a "sister" Plugin Slot for Test and Item Statistics. I propose to implement both together. The core statistics provide only the basic figures which we quickly could agree on at the SIG EA meeting. But one or more plugins are able to do advanced and experimental analyses. Those can be tested in pilots from 5.1 on and the proven ones are candidates for additional statistics in the core. I think it is important to provide a "playground" for gaining practical experience before further extensions of core statistics are discussed.

JourFixe, ILIAS [jourfixe], July 13, 2015: We generally appreciate the introduction of item statistics for test questions but we still see several open conceptual questions before being able to schedule the feature.

We need a clear distinction between test statistic and test item statistic. Test item statistics focus on the use and the results of one question in several tests. But this requires a feasible solution for the lifecycle of a question (at the time being every question in a test is a destinct copy and could be changed completely which makes every statistic useless).
Only a few number of question types like single choice, KPrim or multiple choice (restricted) can be supported.
Clear use cases are needed to show which statistics can help to improve the quality of a question, e.g. distribution of distractors (and which statistics have no impact).
We see the need of a workshop to create a concept and plan a possible implementation for version 5.2.

AT 2015-07-25: In this post I would like to raise issues to point out why I think the feature needs more conceptual work.

Presentation

The aim of this feature is to enable teachers to improve question and test quality.
We can rule out that listing key figures in a table will provide a good starting point for teachers to improve individual questions or the whole test.

Teachers need information that they act upon. The presentation has to convey information in such a fashion that measures can be taken to improve questions and tests. The following suggestions are supposed to be a starting point for the discussion concerning the presentation which should include the careful selection of classes of desirability and respective suggested actions.

The following boxes put forward my iedas about presentation:

Internal consistency (Cronbachs alpha)

Indicates how well items in the test measure the same concept. Values range from 0 to 1. Higher values are desirable.

Values	Meaning	Measures
0 to 0,7	Arbitrary	This test cannot be used as the single source of input to assign a fair grade to a student. The quaility of the test is too low, the grade assigned to a student will be kind of arbitrary. This arbitrariness will be balanced out if the result of this test contributes to the overall grade along with other assessments i.e. this test + presentation + oral exam combined would more be more likely to produce a fair grade. To improve this test compute the correlation of each test item with the total score test; delete items with low correlations add questions
0,7 to 0,8	Relevant Feedback	This test cannot be used as the single source of input to assign a fair grade to a student. The quaility of the test is good enough to inform students within a semester about their progress. It cannot be used for a high-stakes or low-stakes exams. To improve this test compute the correlation of each test item with the total score test; delete items with low correlations add questions
0,8 to 0,85	Good	This test is well developed and can be used for low stakes exams that will assign fair grades to students even if the test is the only source of grading. It is not fit to be used in a high-stakes scenario.
0,85 to 0,9	Excellent	This is a professionally developed standardized test that can be deployed for high - stakes exams: It is fit to be administered only once to a student and will produce a grade so sound that decisions about an entire career can be taken on the basis of this grade.
0,9 to 1,0	Redundant test questions	Shorten you test, it contains questions that are mearuring identical things. These 'doubles' should be removed from the test.

Item Difficulty / Facility Index

The item difficulty is the proportion of persons who answer a particular itemcorrectly to the total attempting the item. The item difficulty index ranges from 0 to 100. The higher the value, the easier the question.
The value of desirable difficulty changes with the number of answer options a question has. This is not taken into account by the following classification.

Values	Meaning	Measures
0 to 20	Too hard, few students could answer that question correctly	Delete questions from test or revise questions
20 to 80	Desirable difficulty level	Keep as is
80 to 100	Too easy, almost all could answer that question correctly	Delete questions from test or revise questions

Item Discrimination / Discrimination Index

Measures how well a question differentiates among students on the basis of how well they know the material being tested. Item discrimination index has no fixed range: it can be negative, values above 0,5 are excecllent but rare.
The values of the coefficient will tend to be lower for tests measuring a wide range of content areas than for more homogeneous tests.

Values	Meaning	Measures
Below 0,1	Poor	Delete questions from test do not bother to improve it it is beyond repair and not worth the effort
0,1 to 0,3	In need of revision	Revise question for ambiguous wording
0,3 to 0,5	Good	Keep question
Above 0,5	Excellent	Keep question

Handling of Test Passes

The issue of multiple test passes has to be settled: The linked document is quite clear about using more than one test pass for the item statistics being bad practice but states that moodle will feature the option anyway.

Will allow the statistics to be computed for more than one test pass?
How is the statistics computed for the CTM-Mode that by definition never finishes? At what point in time will the statistics be presented if we never arrive at a finished test pass?

Negative Points

The moodle peper clearly states that it does not have to deal with negative points. Has somebody checked if calculation is affected by the use of negative points?

Handling Missing Values

How will missing values be handled?

Good practice would be to take out the data set with at least one missing value from the comutation altogether. This will spell trouble for the never ending CTM and can thus be ruled out.
Only to question-related statistics and only use the ansewers actually provided.
Refuse the concept of missing values altogether: Set all not-answered questions (never shown and flipped over) to 0 and compute anyway (moodle solution)
Alternative suggestion: never shown questions are set to 0 and flipped over questions are categorized s missing values.

JourFixe, ILIAS [jourfixe], May 09, 2016: Postponed to next JF when Fred Neumann and Björn Heyser both can attempt the meeting. We ask Fred to contact Björn before this meeting and also to give us a report about first experiences with the existing plugin slot.

Neumann, Fred [fneumann], June 1, 2016:
We currently develop a plugin for ILIAS 5.1 and higher that realizes the functionality of the related feature request Plugin Slot for Test and Item Statistics. Please look at this feature request for details. It addresses most of the issues mentioned here by separating the presentation of test statistics from the actual calculation of statistical values (called “evaluations”):

Evaluations are small classes that get filtered test data as input and produce statistical data as output. The produced data format allows different kind s of use: display on screen, in excel tables, diagrams etc.
Our solution already shows test and question evaluations on different screens. Evaluations have no own GUI and can easily be re-used in future functionality, e.g. in the life-cycle of questions.
Evaluations can have a description. Calculated values can be assigned with a comment and an alert status. This allows showing a detailed advice like Alexandra proposed.
Core properties of an evaluation specify for which test type it can be applied, e.g. fixed question set, random question set, CTM. It can also be specified for which question types an evaluation is available (question evaluations that only analyze the points reached in a question can be calculated for every question type).
Furthermore we can define in the ILIAS administration which evaluations are available to all users and which to admins only.
It is already foreseen to filter the source data by test passes, e.g. first pass, last pass, best pass. The initial implementation will have a fixed filter to the scored pass.

The implementation of an evaluation class as a plugin or in the core will not differ. This way we can start with evaluations as plugins and successively decide which of them is ready for the core.

I propose to first decide on the implementation of Plugin Slot for Test and Item Statistics and then decide which evaluations will be added initially to the core.

Heyser, Björn [bheyser], 04. January 2019:

The Jour Fixe mentioned on 03. July 2015 the following issues that I want to address here.

We need a clear distinction between test statistic and test item statistic. Test item statistics focus on the use and the results of one question in several tests. But this requires a feasible solution for the lifecycle of a question (at the time being every question in a test is a destinct copy and could be changed completely which makes every statistic useless).

Test and Item Statistic have now a clearly distinction. The item statistic comes from the future assessment question service and is also available within question pools. It is to be discussed wether the aggregation should be filtered to participant data from the single test itself. Test Statistic are of course calculated only for the test and within the test.

Only a few number of question types like single choice, KPrim or multiple choice (restricted) can be supported.

As far as I understood the statistical calculations the current concept does rely on a calculation using point values only. I guess this restriction to certain question types is neccessary when we come accross a detailed distractor analysis.

Clear use cases are needed to show which statistics can help to improve the quality of a question, e.g. distribution of distractors (and which statistics have no impact).

Alexandra provided detailed suggestions for a kind of statement that could be shown with each statistical calculation. Would these statements be helpful?

We see the need of a workshop to create a concept and plan a possible implementation for version 5.2.

Several Workshops were done in the past years. The problems about missing versioning of assessment items as well as the handling of item statistic data were solved with the feature requests for Versioning in Pool, Question Versioning in Test Object and Item Statistic in Pool.

Alexandra mentioned on 25. July 2015 the following issues that I want address here.

Will allow the statistics to be computed for more than one test pass?

I guess yes. I do not see any problem with using all participation data for calculating the item statistics. Of course it would be possible to restrict the calculations to the data of scored passes only. This can be discussed.

How is the statistics computed for the CTM-Mode that by definition never finishes? At what point in time will the statistics be presented if we never arrive at a finished test pass?

The data for calculation is pushed to the assessment question service manually or optinally by a cronjob. Indeed the never finished passes are no problem at all. A tutor can decide to push the data after the last participant has worked through the pool completely to get a consistent statistics presentation.

How will missing values be handled?
- Good practice would be to take out the data set with at least one missing value from the comutation altogether. This will spell trouble for the never ending CTM and can thus be ruled out.
- Only to question-related statistics and only use the ansewers actually provided.
- Refuse the concept of missing values altogether: Set all not-answered questions (never shown and flipped over) to 0 and compute anyway (moodle solution)
- Alternative suggestion: never shown questions are set to 0 and flipped over questions are categorized s missing values.

For my understanding the test statistics indeed are consistent for finished passes only. In case of CTM infinite passes we could left out the data of participants that did not finish to work through the pool as Alexandra suggested.

For the item statistics i guess we did overcome any problems by having the fact wether a question has been answered or not. With the nowadays test player navigation concept this information is given.

Jobst, Christoph [cjobst], 07. January 2019:

"Will allow the statistics to be computed for more than one test pass?"

I'd vote for a selection of the pass to be analyzed/pushed, as is present in the current implementation of ExtendedTestStatistics (etstat).

Reasoning:

The most likely case to be analyzed is cross-sectional data. Inherent to this type of data is the loss of interpretability when thrown into one pot with other timepoints, e.g. you can't combine the data from the fifth pass of participant A and the first pass of participant B because they might be uncomparable due to variing existing knowledge of the test. This differing knowledge was the reason to add another option for the pass selection in etstat 1.1.2: in addition to "best" and "last" it now features also explicitly "first", because the data generated after the first pass is flawed by preknowledge in achievement/performance tests.
Longitudinal data may use multiple passes to compute the participants progression over time. This kind of analysis is easy to implement as sub-plugin of etstat.

"How will missing values be handled?"

I'd vote for missing = wrong in achievement/performance tests. Nevertheless the numbers for missing/not seen might be shown inside the test.

Reasoning:

The ILIAS T&A-module is conceptualized for achievement/performance tests - not for medical/psychological assessment. Therefore we don't have to worry about assigning wrong psychological variables to participants when assuming things for unanswered or unseen questions in the testanalysis.
If a participant can't answer a question, no points will be granted and a lower sumscore will be achieved. No examinator will base the grading of an examination only on the individually answered/seen questions, so it's only consequent to interpret missing values as wrong answers for the teststatistics as well. To keep this psychometrically sane, one of three conditions needs to be fulfilled:
- The time for the test is generally sufficient to begin work (Inangriffnahme) on all items.
- When using a tight timeframe the questions have to be presented in a random order, so every question has the same chance to remain unseen.
- The next item can be selected by the participant in a negligible amount of time.
It seems to me, those conditions are mostly met in usual szenarios within T&A. Specific use cases, where missing/not seen has to be handled differently might be added via sub-plugins for etstat when needed.

JourFixe, ILIAS [jourfixe], 07 JAN 2019 : We accept the key figures suggested above in chap 2 for the test item statistic. But we need a clear presentation of the result screen incl. textual information about the different levels (number of levels per key figure and text information). It is important for us that this information is helpful for test authors to improve their tests. These are not just numbers but important information that need an appropriate visual representation. Please consider to use the reporting panel instead of a simple table to visualise this information.

JourFixe, ILIAS [jourfixe], 07 MAY 2019 : We highly appreciate this suggestion and schedule the feature for 6.0. Please modify the KS description for the reporting panel to allow cards in every sub-panel.

Kunkel, Matthias [mkunkel] @ SIG EA, 13 NOV 2019 : The separation of the Test&Assessment has been postponed to ILIAS 7 because the project cannot be finished satisfactorily until 'Coding Completed' at NOV 29. Therefore, this feature won't make it into ILIAS 6 but is now suggested for ILIAS 7. Please add it to the JF agenda to schedule it again.

9 Implementation

{The maintainer has to give a description of the final implementation and add screenshots if possible.}

Test Cases

Test cases completed at {date} by {user}

{Test case number linked to Testrail} : {test case title}

Approval

Approved at {date} by {user}.

10 Requirements

...

Last edited: 13. Nov 2019, 09:27, Kunkel, Matthias [mkunkel]