#KeyLIMEPodcast 268: Maybe machines can solve the bias of raters?

Rater variance is a common theme on the KeyLIME podcast.  (See episode 78 for a more fulsome discussion) In today’s episode, the co hosts discuss an article that looks at possibility of removing human raters and replacing them with AI – and thus introducing a new age of assessment in HPE.

Listen to the latest KeyLIME episode here!


KeyLIME Session 268

Listen to the podcast


New KeyLIME Podcast Episode Image

Winkler-Schwartz et. al., Machine Learning Identification of Surgical and Operative Factors Associated With Surgical Expertise in Virtual Reality Simulation JAMA Netw Open. 2019;2(8):e198363.


Jon Sherbino (@sherbino)


A review of the issues of rater variance by Gingrich et al. suggests that there are three ways to understand the poor reliability of direct observation:

  • There is a lack of a shared mental model between raters, which may be addressed by frame of reference training,
  • Raters are biased by social and environmental contexts, which may be addressed by structuring the environment and screening for social pressures.
  • Each assessor is “meaningful idiosyncratic” focusing on a different element of performance that marks the complexity of any clinical task, presenting a legitimate subjective interpretation.

With the rise of the machines, perhaps there is a new way to understand assessment of performance… remove the human rater. Machine learning can find correlations and signals not dependent on being trained in a single “correct” frame of reference, unhindered by bias and not unique to a single machine, but dependent on the statistics of large data sets.

This is a proof of concept paper that attempts to get past the commentaries and hype around AI and suggests a possible new age of assessment in HPE.


“To identify surgical and operative factors selected by a machine learning algorithm to accurately classify participants by level of expertise in a virtual reality surgical procedure.”

Key Points on the Methods

This was a single centre, case series study. There was IRB approval.

Data was collected prospectively over 15 months, including medical students (rotating on the neurosurgery service), neurosurgery residents & fellows and neurosurgery attending physicians.

A visual and haptic partial task trainer mimicking resection of a brain tumour through a microscope was used. Participants completed removal of a BT five times with average scores calculated based on instrument position, force applied, contact with tumour/blood vessel/healthy tissue and blood loss.  270 metrics were captured.

Non-differentiating metrics were removed from analysis. Algorithms were optimized by randomly adding or subtracting metrics in a systematic, iterative fashion.  Four classifier algorithms were used: K-nearest neighbor, naive Bayes, discriminant analysis, and support vector machine.

Key Outcomes

n=50 (14 neurosurgeons, 4 fellows, 10 senior residents, 10 junior residents, 12 medical students). Approximately half of the group had previously used a partial task trainer.

The MLA was able to effectively classify operators into attending, senior, junior, or medical student categories.

  • K-nearest neighbor algorithm had an accuracy of 90% (45 of 50), using 6 performance metrics.
  • naive Bayes algorithm had an accuracy of 84% (42 of 50), using 9 performance metrics.
  • discriminant analysis algorithm had an accuracy of 78% (39 of 50) using 8 performance metrics.
  • support vector machine algorithm had an accuracy of 76%(38 of 50), using 8 performance metrics.

(Theoretically) improved accuracy could have been achieved with the inclusion of more metrics and more abstruse metrics, but this would have limited the transparent or explainable features of this AI.

Interestingly 3 of 4 models identified an attending neurosurgeon as a medical student!

Key Conclusions

The authors conclude…

“Our study demonstrates the ability of machine learning algorithms to classify surgical expertise with greater granularity and precision than has been previously demonstrated. Although the task involved a complex neurosurgical tumor resection task, the protocol outlined can be applied to any digitized platform to assess performance in a setting in which technical skill is paramount.”

Spare Keys – other take home points for clinician educators

The peer review process is becoming even more challenging as cross-disciplinary expertise in computer science and advanced statistics is required to review the increasing submissions that use MLA.  Math geeks unite!

Access KeyLIME podcast archives here


The views and opinions expressed in this post and podcast episode are those of the host(s) and do not necessarily reflect the official policy or position of The Royal College of Physicians and Surgeons of Canada. For more details on our site disclaimers, please see our ‘About’ page