The (Disputed) Key Assessment Papers in 2014

By Jonathan Sherbino (@sherbino)

Using a very scientific method (we emailed journal editors and friends in the HPE community… if you didn’t respond… thanks for nothing 🙂 ) Eric Holmboe (@boedudley) and I compiled a list of key assessment papers from the last 12 months. We shared our analyses of the papers at a session during ICRE.  Papers were selected to provide a balance of interesting methodologies, important theoretical insights or practice changing results.  We don’t claim that the selection below represents a “best of..” Rather the collection is intended as an interesting sample of the important scholarship being conducted in education assessment.

Without further adieu…

1. Seeing the same thing differently: Mechanisms that contribute to assessor differences in directly-observed performance assessments.

Yeates P, O’Neill P, Mann K, Eva K.  Advances in health Sciences Education, 18 (3): 325-41.

Study Design

The purpose of this study was to explore the cognitive processes of assessors that lead to inter-rater variation in judging authentic performance of medical trainees.

This was a qualitative study using principles of grounded theory. Participants (practicing physicians with >2 years consultant-level practice and experience in use of mini-CEX assessments) scored standardized videos of PGY 1 physicians completing a patient ID-100228011interaction using a UK mini-CEX template with a quasi-criterion scale (i.e. meets expectations). “Good,” “borderline,” and “poor” performances were viewed.

Think aloud protocols captured participants “thought” processes in real time. After completing the assessments, an interviewer further explored participants’ thinking via a semi-structured interview.

Think aloud protocols should be interpreted with caution. Biases arise for many reasons. Within conscious processes, social desirability may influence the response of participants. With respect to unconscious processes, the automaticity of categorization of behaviour, by definition, cannot be articulated by the participant.

Main Points

Twelve participants and 13.5 hours of audio–recording were included in the study to reach saturation. There was significant variation in scoring the simulations. Three themes that describe the idiosyncrasy of assessor judgments were discovered.

  1. Differential salience
  • The behaviour that influences a trainee’s score varies between assessors
  • The point of focus among multiple behaviours during a mini CEX is unique to an individual
  1. Criterion Uncertainty
  • There is ambiguity among assessors regarding the criterion against which trainees are assessed
  • “Meets expectations for a PGY 1” is not universally clear
  • Assessors use a personal criterion based on past experience
  1. Information Integration
  • During the process of observation, interim qualitative judgments are made that do not map to the instrument. These qualitative “scores” must then be converted backwards to the quantitative scoring template.

Caution should be made about the generalizability of the study results based on observations from 10 (of 12) male internists from the UK.


The authors conclude… “Our results (whilst not precluding the operation of established biases) describe mechanisms by which assessors’ judgments become meaningfully-different or unique… They give insight relevant to assessor training, assessors’ ability to be observationally ‘‘objective’’ and to the educational value of narrative comments (in contrast to numerical ratings).”

This study suggests that variability between assessors may not reflect error but unique perspectives, which globally reflect truth. As educators how do we triangulate various assessments to effectively provide feedback on performance to a learner?

Recommended by

Javier Benitez (@jvrbntz)

See a companion paper:

Kogan JR, Conforti LN, Iobst WF, Holmboe ES. Reconceptualizing variable rater assessments as  both an educational and clinical care problem. Acad Med. May 2014; 89(5):721-7

2. Can I leave the theatre? A key to more reliable workplace-based assessment

M.Weller, M. Misur, S. Nicolson, J. Morris, S. Ure, J. Crossley and B. Jolly. British Journal of Anaesthesia. 2014; 112: 1083-91. 

Study Design

Comparative study of two different approaches to scoring the mini-CEX: a “conventional” system and one that utilized a scoring system based on levels of supervision.

Main Points:

The authors tested new versions of a scoring system that targeted developing autonomy, overall independence with the anesthestic case, and overall performance against expected stage of training. Participant was voluntary (could signal some selection effects/bias).In all 84 assessors completed 338 assessments from 80 trainees.

  1. Supervisor scores were more reliable when scoring the trainee on independence on the ID-100203595need for direct, indirect or distant supervision. The authors found, in this study, that using this scoring system produced a reliability of 0.7 with just 9 assessments.
  2. None of the trainees were appropriately identified as performing below standards when the conventional scoring system was used.


This study adds further evidence, with caveats, of using more developmental type scales that are more “constructed aligned” can enhance reliability of rating scales like the mini-CEX. These types of scales are also better aligned with the developmental philosophy of CBME. This work builds on early work by Crossley and colleagues when studying the Foundation programme tools in the UK.

Notes: Important to note that “independence” is defined in terms of level of supervision for this study. Increasingly the concepts of independence and autonomy, due to the high need for interprofessional team work in almost all aspects of care (including the OR), are seen as antiquated.

Recommended by:

Olle ten Cate

Carol Carraccio

3. Key-feature questions for assessmentof clinical reasoning: a literature review.

Hrynchak P, Glover Takahashi S, Nayer M. Med Educ. 2014 Sep;48(9):870-83.

Study Design:

Literature review (however, type of review not stated and a weakness of the study. Appears to be mostly a best evidence synthesis – not a meta-analysis or realist review).

Main Points:

ID-10052945This is a nice review of the literature on key feature type exams. As the author notes, modest correlation exists general knowledge exams and experts out-perform novices. However, as the authors note, the role and place of Key Features examinations in a programme of assessment is unclear. There is also growing concern as to how best to assess knowledge in a world where knowledge grows exponentially and the availability and improving quality of clinical decision support methods and tools.

The majority of work has been accomplished in conjunction with the Medical Council of Canada licensing exam and in Canada.


When and whether a program or testing organization should use KFQs over other methods to assess clinical reasoning and for what purpose needs additional clarity to guide future research.

Recommended by:

Kevin Eva

4. Programmatic assessment of competency-based workplace learning: when theory meets practice.

Bok HG, Teunissen PW, Favier RP, Rietbroek NJ, Theyse LF, Brommer H, Haarhuis JC, van Beukelen P, van der Vleuten CP, Jaarsma DA.  BMC Med Educ. 2013 Sep 11;13:123.

Study Design:

Implementation study of van der Vleuten programme assessment model using surveys and interviews to measure the experience of faculty and learners with the implementation in a single VETERINARY training programme.

Main Points:

More of this type of implementation study is needed. Main points from this study:

  1. Implementing a more holistic programme of assessment is just hard work, and is a major culture shift for faculty and learners.
  2. Like many have noted in implementation science (see McGaghie), preparation for the implementation is key.
  3. In a CBME-based system, greater use of lower stakes “formative assessment” is essential and special attention should be paid to the nature and quality of feedback
  4. Faculty development crucial – many of the required and needed skills are in areas of assessment and feedback that challenge most faculty.
  5. Implementation is messy and programmes have to be prepared for that. Thinking in either continuous quality improvement or realist program assessment frameworks could be helpful, but were not used in this study.


Implementation of “holistic” models of programmes of assessment is messy and hard work. Any training programme seeking to change to this type of assessment programme (and I believe personally we should) should use system science, quality improvement and realist tools and methods during the implementation and treat the implementation as an iterative process.

Recommended by:

Robert Englander

5. Developing the role of big data and analytics in health professional education.

Ellaway R, Pusic M, Galbraith R, Cameron T. Medical Teacher, 2014, 36 (3): 216-22.

Study Design

The purpose of this commentary is to highlight the opportunities (and pitfalls) that big data offer health professions education.

Main Points

Big Data sources can include:

  • Learning management systems (e.g. logins, downloads etc.)
  • Encounters (e.g. logs)
  • Exam scores

Comparisons can be:

  • ID-10073134 Within person over time
  • Cross sectional against peers, against all users
  • Correlational with other clinical (e.g. from EMR) or education data

Analysis can include:

  • Predictive: identify patterns to anticipate future behavior. Useful for student selection; issues of professionalism
  • Outlier identification: identify early dyscompetence
  • Decision support: guide the design/adaption of curricula to meet learners needs
  • Knowledge discovery: data mining to identify unpredictable associations
  • Alerting: surveillance to rapidly identify a critical events


  • Learner and patient privacy
  • Quality of data sets
  • Infrastructure costs
  • Competencies without a dataset are devalued
  • Digital Hawthorne effect (learners “game” technology to improve metrics)
  • Processes to adjudicate large data sets for summative assessments
  • Assuming correlation = causation


The authors conclude…“Education analytics and Big Data techniques have the potential to revolutionize evidence-based practice, through the standardization of  how these core components of health professional education are modeled and through real time aggregate data analyses.”

Recommended by

Teresa Chan (@tchanmd)

Other Recommended Assessment Papers

David B. Swanson & Cees P.M. van der Vleuten Assessment of Clinical Skills With Standardized Patients: State of the Art Revisited. Teaching and Learning in Medicine: An International Journal. 2013. 25:sup1, S17-S25.

Guerrasio J, Garrity MJ, Aagaard EM. Learner deficits and academic outcomes of medical students, residents, fellows, and attending physicians referred to a remediation program. 2006-2012. Acad Med. 2014 Feb;89(2):352-8.

Driessen E, Scheele F. What is wrong with assessment in postgraduate training? Lessons from clinical practice and educational research.  Med Teach. 2013 Jul;35(7):569-74. . Epub 2013 May 23.

ID-100252816Pangaro L, ten Cate O. Frameworks for learner assessment in medicine: AMEE Guide No. 78. Med Teach. 2013 Jun;35(6):e1197-210.

Cook DA, West CP. Perspective: Reconsidering the focus on “outcomes research” in medical education: a cautionary note. Acad Med. 2013 Feb;88(2):162-7.

Marjan Govaerts & Cees PM van der Vleuten. Validity in work-based assessment: expanding our horizons. Medical Education 2013: 47: 1164–1174

Kogan JR and Holmboe ES. Realizing the Promise and Importance of Performance-based Assessment. Teach Learn Med 2013;25 Suppl 1:S68-74.

Aagaard E, Kane GC, Conforti L, Hood S, Caverzagie KJ, Smith C, Chick DA, Holmboe ES, Iobst WF. Early feedback on the use of the internal medicine reporting milestones in assessment of resident performance. J Grad Med Educ. 2013 Sep;5(3):433-8.

Lineberry M, Kreiter CD, Bordage G. Threats to validity in the use and interpretation of script concordance test scores. Med Educ. 2013 Dec;47(12):1175-83.

Kogan JR, Conforti LN, Iobst WF, Holmboe ES. Reconceptualizing variable rater assessments as both an educational and clinical care problem. Acad Med.  2014 May;89(5):721-7

Byrne A, Tweed N, Halligan C. A pilot study of the mental workload of objective structured clinical examination examiners. Med Educ. 2014 Mar;48(3):262-7.

Gierl MJ, Lai H. Evaluating the quality of medical multiple-choice items created with automated processes. Med Educ. 2013 Jul;47(7):726-33.

So… what paper from the last 12 months isn’t on the list?  Flag your nomination below.

– Jonathan

Image 1 courtesy of stockimages/

Image 2 courtesy of Supertrooper/

Image 3 courtesy of digitalart/

Image 4 courtesy of adamr,/

Image 5 courtesy of tigger11th/