Assessment Literature (from the last 12 months) that every CE should read: Part 2

By Jonathan Sherbino (@sherbino)


So continuing from where we left off Tuesday, here are papers #2 and #3 from the last year that every CE should read.

2.  Do In-Training Evaluation Reports Deserve Their Bad Reputations? A Study of the Reliability and Predictive Ability of ITER Scores and Narrative Comments

Ginsburg S, K Eva , G Regehr. 2013. Academic Medicine. 88(10):1-6

This study used a single site cohort of PGY1 to 3 internal medicine residents’ in-training evaluation reports (ITERs). (For individuals outside of Canada, an ITER = a summative in-training assessment of global rating scales and narrative comments.) There were 63 residents and 903 ITERs were evaluated.

Shiphra Ginsburg and colleagues identified two factors (knowledge/clinical skills, interpersonal skills) that accounted for most of the variance (66 +5%). This variance is despite the fact that the ITER consisted of 19 elements mapped to the 7 CanMEDS Roles. One skeptical interpretation of this finding is that assessors ignore/de-emphasize CanMEDS Roles such as Health Advocate, Manager, Professional etc.

Over 9 rotations, an ITER had an average reliability of ~ 0.5. The predictive correlation of an individuals PGY 1/2 ITER with their PGY 3 ITER  was ~ 0.4.  When a PGY 1/2 ITER was compared to a rank order of best to worst PGY 3s, the correlation improved to ~0.6.

The authors conclude: “…this systematic analysis of ITER scores and narrative comments does point to the predictive value of ITER scores—despite their bad reputation—as well as offering interesting insights into the potential for the structured use of narrative comments. Although in this study the narrative comments did not offer additional predictive value, our results certainly indicated an impressive amount of “signal” in this data source…”

 My take home point is that an ITER still has a role to play in an assessment system. The key element, though, is that an ITER is only one part of a system.  Its reliability improves with repeated sampling (as does every instrument).  And, while the narrative comments did not improve predictive validity, their formative effect still has merit.

3.  Assessment in the post-psychometric era: Learning to love the subjective and collective

Hodges B. 2013. Medical Teacher; 35(7):564–568

This editorial by Brian Hodges returns CEs to the history of assessment in medical education, where the mentor provided the novice a global, subjective assessment of their performance.

The rise of classic psychometric theory in the middle part of the 20th century profoundly influenced the process of contemporary assessment in medical education. Suddenly, reliability was the most influential element in determining the appropriateness of any approach to assessment. A non-standardize, qualitative (!!) assessment was suddenly out of fashion.  However, as Brian Hodges articulates, the psychometric era did introduce positive additions to assessment theory, such as appropriate sampling and improved rater training.  However, these improvements were balanced by negative outcomes, such as:

  • excessive reduction of competencies into sub-sub elements, as a means to improve the reliability of exams;
  • standardization of scoring removed authenticity; and
  • exam feedback was lost to preserve exam security.

Hodges concludes: “the challenge before us then is to build rigor into our assessment programs, and to recognize that competence is contextual, constructed, and changeable and, at least in part, also subjective and collective.”

My take home point is that the qualitative “wisdom of crowds” increases with the number of judgments and the diversity of perspectives. An emerging assessment model endorsed by the ACGME is the competence committee, where quantitative and qualitative data can be aggregated and interpreted. Using increased sampling and scaffolding frameworks for raters, ‘subjective’ assessments can be incorporated.  As a learner, I want a judgment of my ability that is:

  • based on observation of my performance in authentic practice;
  • provided by an expert, who can appreciate nuance and complexity;
  • rich in detail (i.e. comments, rather than a “3” on a 5-point Likert scale).


If you are interested in keeping up to date on key literature in medical education, check out the KeyLIME Podcast!

Images courtesy of Creative Commons