Revolution in Assessment
The motivations and technology behind the new bksb initial and diagnostic assessments
The new initial and diagnostic assessments from bksb have been created to ensure that each candidate is given a unique set of questions and is identified at an accurate level. During early development, the phrase “random questions” was frequently touted, but it quickly became apparent that this would be a hazardous route to take. That’s because there are two very simple truths which should immediately deter anyone from using an assessment based on “random questions” or a “randomised question bank”, namely:

no two questions are equal in terms of difficulty;

and where multiplechoice and freetext questions are present within the same assessment, candidates answering the former will receive an advantage.
As an example, if the questions 5×6 and 7×8 were given to candidates for whom multiplication was a challenge, they would not have equal pass rates. One would be more difficult than the other. Consequently, the random selection of questions would introduce an unacceptable element of “luck” into any assessment. And whilst the difference in difficulty between one question and another might well be small, when the sum total of those differences was multiplied over 20 questions or more, the overall effect would undoubtedly be significant.
Moreover, where an assessment blends multiplechoice questions which can be guessed with freetext questions which cannot, controls must be introduced to balance the value of all correct answers.
For instance, if candidates were offered one of the three questions below, it is clear that the increased chance of guessing question 2 would give those who received it a distinct advantage over those presented with questions 1 or 3.
Question 1: 8×9=?
Answer: a)72 b)71 c)73 d)67
Question 2: 6×5=?
Answer: a)32 b)31 c)30
Question 3: 7×4=?
Answer:
Furthermore, even where the mixture of question types within a selection is equal, the presence of such mixes still creates problems when calculating results. Consider, for example, that the previous three questions form a sequence. If a mark is given for each, and 2 marks is considered a pass, then a number of problems will be encountered.
First, even though these questions are clearly unequal in terms of their difficulties, in the final calculation they are nonetheless awarded identical values.This means that there is an 8% chance that a candidate can pass simply by guessing the first two questions. Secondly, the third question is surely a better indicator of ability than the other two. Should we therefore allot it 2 marks and change the pass mark to 3? We may be able to agree that question 3 is more significant, but is it exactly twice as significant as the other two? Moreover, if we accept that question 3 is more significant – due to there being no opportunity to guess the answer – then to a lesser extent question 1 is more significant than question 2, because with more answer options, there’s a reduced chance of someone guessing it. Beware, therefore, any assessment which features “randomised questions”.
How, then, can question variables be balanced in order to produce an accurate and valid result?
The solution lies in candidates being given different questions depending upon how they perform within the assessment. Following several years of intensive research, bksb concluded that the only way that an adaptive assessment could be delivered was through the implementation of Item Response Theory (IRT).
Item Response Theory (IRT) has been in existence long enough to be critically appraised by leading psychometricians, statisticians and mathematicians, and the academic consensus is that IRT is by far the most reliable and accurate method for testing ability. Indeed, it is now accepted that it is impossible to create any form of adaptive assessment (any assessment where candidates are not presented with the same questions), without implementing the IRT algorithms.
Much of the intelligence behind this lies in the handling of multiplechoice questions. Whereas some have no guessing element, others might have 3, 4, 6 or even 8 options. As a result, there is a variable “chance” or “likelihood” of the answer being guessed, which in mathematical terms would be referred to as probability.
If we consider that the probability of a highability candidate successfully answering a lowdifficulty question is thus high, whilst the probability of a lowability candidate answering a highdifficulty question is low, we should be able to calculate the probability of a candidate of any given ability answering a question of any given difficulty. Likewise, the probability calculations can be combined with the “guessing” probability so that the issues around multiplechoice are themselves resolved. From this, the responses that a candidate makes can be repeatedly tested against the “likelihood” that he or she fits a certain ability group until a “best fit” is determined. This in turn provides a sophisticated system for calculating ability that is immune to the problems caused by both “randomisation” and/or the inclusion of questions with differing likelihoods of “guessing”. Given the type of content, this is the perfect approach for measuring ability as it produces the fairest, most accurate, and most valid results.
The necessary algorithms are hence deployed throughout the assessment, with constant adjustments made to a candidate’s “ability estimate” following every question. Rather than being random, the “best” question for each candidate is actually chosen after their previous responses have been analysed. Therefore, an assessment will contain 20 to 30 questions specifically selected for each candidate, with artificial intelligence used to create the most suitable experience for them.
How do we know how difficult a question is?
To tackle this, we have used the responses that have been submitted for each question to calculate the pass rates for each ability group. We then compare the latter to the probabilities for every possible difficulty, with a “best fit” algorithm thereafter suggesting which value most closely resembles the behaviour pattern taken from the database. In contrast to competing systems which merely total students’ scores and use boundaries to determine their results, this use of artificial intelligence is a massive step forward. Learners and teaching staff can therefore be assured that the values attributed to each question are entirely valid, having been obtained through the manipulation of large datasets containing hundreds of millions of student answers.