#KeyLIMEPodcast 195: #MedEd Assessment – Judgement Day

Can machine learning (ML) techniques be used to automate physician competence assessment? Read on, and check out the podcast here.


KeyLIME Session 195:

Listen to the podcast.


New KeyLIME Podcast Episode Image

Dias et al., Using Machine Learning to Assess Physician Competence: A Systematic Review Acad Med. 2018 Aug 14.



Reviewer: Jonathan Sherbino (@sherbino)


Apparently, the current answer to everything in medicine, and health professions education is no different, is artificial intelligence.  It reminds me of my entrance to medical school when the “internets” were invented and email was going to TRANSFORM!! learning.  The outcome two decades later was pretty tepid.

Enter SkyNET, the terminator, and AI.  Nonetheless, you have to admit, the promise (even past the hype) of artificial intelligence to change assessment in #meded is intriguing.  In an era of so-called big data, the ability to make connections among multiple variables to predict future performance, exceeds the ability of traditional computational processes. So… ready player one?  Let’s talk machine learning (an automated process that extracts patterns from data) and whether assessment in HPE is about to change.


“To identify the different machine learning (ML) techniques that have been applied to automate physician competence assessment and evaluate how these techniques can be used to assess different competence domains in several medical specialties.”

Key Points on Method

  • A systematic review was performed, adhering to the PRISMA guidelines.
  • Eight databases were searched, including the big ones you know and two digital libraries relevant to computing machinery (that I have never heard of before), from inception to April 2017.
  • For consideration the article must have been published in a peer-reviewed journal, addressed one machine learning technique and assessed physician/resident/medical student competence.
  • Two authors independently conducted the selection process.  Disagreement was resolved by consensus. Data extraction was also independent.  Study quality was assessed using the MERSQI.


Key Outcomes

Initial search revealed ~5000 articles.  69 studies were included with more than 50% published in the last 6 years.

The types of studies were: cross-sectional studies, retrospective cohorts, prospective cohorts and RCTs.  More than 40% of the studies used simulation. Nearly all of the studies focused on an individual with only ~3% assessing a team. General surgery and radiology were the most common studied disciplines.  The Medical Expert Role was the overwhelming domain assessed.

The types of machine learning techniques used, included:

  • Supervised (input data – descriptive features from a data set PLUS output data – target features from a data set) (i.e. regression or classification tasks to determine the relationship between input and output data sets)
    • Bayesian algorithms
    • Random forest
    • Support vectors
    • Neural networks
  • Unsupervised (only input data -descriptive features from a data set) (i.e. clustering tasks to extract patterns)
    • Hidden Markov
    • Gaussian mixture
    • Principle component analysis
    • Hierarchical clustering
  • Both (Natural language processing)

Examples of input data included: length of training, gesture tracking for a procedure, free test clinical notes, clinical data, facial recognition etc.  Examples of output data included: OSATS scores, year of training, adherence to guidelines, test scores, free text analysis etc.

Overall study quality was good.  There was little if any validity evidence included in the studies.

Not surprisingly, that quality of the automated assessments was highly dependent on the quality of the data, which was typically a function of the context and the competency being assessed.

Key Conclusions

The authors conclude…

“A growing number of studies have attempted to apply ML techniques to physician competence assessment. Although many studies have investigated the feasibility of certain techniques, more validation research is needed.”

Spare Keys – other take home points for clinician educators

Like all discussion of AI, this study raises some philosophical questions.

  1. What happens to our learning environments when there is an increasing emphasis on data collection to support AI-facilitated assessment?
  2. How does a physician competency framework skew when the complex constructs that inform the Intrinsic Roles are under represented?
  3. Does a professional become a technician under the influence of automated assessment?
  4. What are the unintended consequences of an assessment model, where the correlation between variables does not have a natural explanation?


Access KeyLIME podcast archives here