Rule of thumb: 40 questions in a 4-choice multiple-choice test – a short follow-up…

In my previous post I presented the rule-of-thumb (for the Netherlands) regarding chosing 40 4-choice multiple-choice test items for a typical end-of-course tests in higher education. A discussion developed on Twitter with some people about this rule of thumb. There was some critisism (of course – as it deals with a rule-of-thumb). A short impression of that discussion seems appropriate.


CITO (the national educational measurement institution) promotes as a rule of thumb to use 60 score-points. Obviously, the more test items in a classic achievement test, the better it is to increase reliability of measurement.


Review studies have shown that the 3rd distractor of 4-option multiple-choice test items does in general perform poor: it is not attractive for both competent and incompetent students. The 3rd distractor (for which a teacher has done is utter best) is mostly easily spotted by students as an incorrect option. So, in effect, it is better to develop 3-option multiple-choice test items. It then allows a teacher to administer more test items in the same testing time hence increase the representatives and reliability of the test in one sweep.

As was noted also, the use of the 4-option multiple-choice test item in the Netherlands is actually induced by A.D. de Groot who brought multiple-choice testing to the Netherlands and who took a personal stance in promoting this type of multiple-choice test item, setting effectively the ‘norm’.

Rodriguez, M. C. (2005). Three options are optimal for multiple‐choice items: a meta‐analysis of 80 years of research. Educational Measurement: Issues and Practice, 24(2), 3–13. doi:10.1111/j.1745-3992.2005.00006.x

Personally, I also think that both teachers and students are of the opinion that 3-option multiple-choice test items are inferior to 4-option multiple-choice test items. They think that they would be inherently (too) easy (forgetting that the level of difficulty does not follow from the form of the test item, but from the content …).


Tests in higher education should not be regarded as psychological tests of which the goal is to spread people as much as possible on a scale based on their degree of knowledge or skill. No, instead, tests in higher education should only discriminate between students who did not study the materials at all and students who did do the studying. If the latter is the case, then the spread in scores might/should actually be low. And if the spread is low, the reliability of a test (as measured with for example Cronbach alpha) will be low. But that would not be a problem then.

Actually, end-of-course tests should be regarded as achievement tests that should be criterion referenced and not norm-referenced.

Coscarelli, W., & Shrock, S. (2002). The two most useful approaches to estimating criterion-referenced test reliability in a single test administration. Performance Improvement Quarterly, 15(4), 74–85. doi:10.1111/j.1937-8327.2002.tb00266.x
Shrock, S. A., & Coscarelli, W. C. (2008). Criterion-referenced test development: Technical and legal guidelines for corporate training. John Wiley & Sons.


The field of educational testing and educational assessment has taken large strides since the 60-ies of the previous century. As well in terms of psychometric and analytic methods (for example IRT), in development and validation methods (for example Evidence-centered Assessment Design) and in conceptualizing the function of assessment (Constructive Alignment, Assessment of, for and as Learning for example).

Methods and philosophies for development and analysis and thinking about end-of-course testing in higher education however have stopped since the 60-ies of the previous century. I recently ran into a paper by Smith dating from 1978 and observed that not much seems to have changed with regards to end-of-course testing in higher education:

Smith, L. S. (1978). Decisions and Dilemmas in Constructing Criterion-referenced Tests: Some Questions and Issues. Center for the Study of Evaluation, UCLA Graduate School of Education. Retrieved from

I ask myself: why is this and what should we learn from this? Maybe a next post … maybe referring to Borsboom and Mellenbergh …

Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061–1071. doi:10.1037/0033-295X.111.4.1061

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s