If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Quantifying Surgeon Intuition Using a Judgment Analysis Model: Surgeon Accuracy of Predicting Patient-Reported Outcomes in Patients Undergoing Hip Arthroscopy for Femoroacetabular Impingement Is Moderate at Best
Address correspondence to Douglas A. Zaruta, M.D., Department of Orthoapedics, University of Rochester Medical Center, 601 Elmwood Ave., PO Box 665, Rochester, NY 14642.
To quantify surgeon intuition, determine whether a surgeon’s prediction of outcomes after hip arthroscopy correlates with actual patient-reported outcomes (PRO), and identify differences in clinical judgment between expert and novice examiners.
Methods
This prospective, longitudinal study was conducted at an academic medical center on adults undergoing primary hip arthroscopy for treatment of femoroacetabular impingement. A Surgeon Intuition and Prediction (SIP) score was completed preoperatively by an attending surgeon (expert) and physician assistant (novice). Baseline and postoperative outcome measures included legacy hip scores (e.g., Modified Harris Hip score) and Patient-Reported Outcomes Information System tools. Mean differences were assessed using t-tests. Generalized estimating equations assessed longitudinal changes. Pearson correlation coefficients (r) evaluated associations between SIP score and PRO scores.
Results
Data from 98 patients (mean age 36 years, 67% female) with complete data sets at 12-month follow-up were analyzed. Weak-to-moderate strength correlations were seen between SIP score and PRO scores (r = 0.36 to r = 0.53) for pain, activity and physical function. Significant improvements were seen in all primary outcome measures at 6 and 12 months postoperatively when compared to baseline scores (P < .05), with about 50% to 80% of patients achieving the minimum clinically important difference and patient acceptable symptomatic state thresholds postoperatively.
Conclusions
An experienced, high-volume hip arthroscopist had only weak-to-moderate ability to intuitively predict PRO. Surgical intuition and judgment were not superior in an expert examiner compared to a novice.
Level of Evidence
Level III, retrospective comparative prognostic trial.
Predicting outcomes after surgery has powerful clinical implications for improving patient care.
The ability to accurately forecast which patients will have the best outcomes is difficult, however, because of multiple risk factors and other patient-specific variables. It is not surprising that human cognition is regarded conceptually as a “black box,” a term describing a system that lacks a clear understanding of the internal algorithms and processes between known inputs and outputs.
Judgment is thought to develop over time based on past experiences and knowledge of risk factors.
Numerous studies have identified individual risk factors for positive and negative surgical outcomes after hip arthroscopy in the treatment of femoroacetabular impingement (FAI).
Development and internal validation of supervised machine learning algorithms for predicting clinically significant functional improvement in a mixed population of primary hip arthroscopy.
Predictors of positive outcomes include younger age, male sex, body mass index less than 25, Tönnis grade 0, pain relief from preoperative intra-articular hip injections, and lower preoperative baseline patient-reported outcomes (PRO) scores.
Arthroscopic correction of sports-related femoroacetabular impingement in competitive athletes: 2-year clinical outcome and predictors for achieving minimal clinically important difference.
Similarly, studies have identified predictors of negative outcomes, including symptom duration greater than 8 months, age greater than 45 years, chondral defects, decreased joint space greater than 2 mm, and increased lateral center edge angle.
Arthroscopic correction of sports-related femoroacetabular impingement in competitive athletes: 2-year clinical outcome and predictors for achieving minimal clinically important difference.
While individual risk factors provide good information, these types of studies are limited by the fact that actual patient outcomes are not predicated on a singular risk factor but rather are complex and multifactorial in nature.
Use of JA models can provide a valid means to quantitatively study surgical judgment and decision-making, thus overcoming the black box concerns of human cognition.
used a JA model to assess the ability of trainee surgeons to predict the likelihood that a patient undergoing a laparoscopic cholecystectomy would need to be converted to an open approach. They reported a mean correlation of prediction was 0.48 compared with the gold standard epidemiologic model; however, there was large variation among individual surgeons. Woodfield et al.
similarly used a JA model to show that surgeons were able to make meaningful preoperative predictions of major complications following abdominal surgery.
PRO measures are powerful tools commonly used in orthopaedics to track changes to a patient’s physical, social, and mental health following treatment but also can be used as a prediction tool.
Patient-Reported Outcomes Information System (PROMIS) tools have been previously validated against legacy outcome measures for hip arthroscopy, which include the Modified Harris Hip Score (mHHS), Non-Arthritic Hip Score (NAHS), Hip Outcome Score (HOS), and visual analog scale (VAS) pain score.
Use of PRO scores in a JA model to determine whether a surgeon’s intuition and judgment are actually predictive of patient outcomes after hip arthroscopy would provide valuable insight into the accuracy of a surgeon’s prediction. The purposes of this study were to quantify surgeon intuition, determine whether a surgeon’s prediction of outcomes after hip arthroscopy correlates with actual PRO, and identify differences in clinical judgment between expert and novice examiners. It was hypothesized that an experienced hip arthroscopist would have strong surgical judgment to intuitively predict patient outcomes. It was also hypothesized that an experienced examiner would have overall stronger clinical judgment compared to a novice examiner.
Methods
After institutional review board approval, patients 18 years of age and older who elected to undergo hip arthroscopy for FAI were recruited to participate in this prospective, longitudinal cohort study. All surgeries were performed at a single academic medical center by a single surgeon with an active enrollment period between November 2017 and April 2019. All patients initially did not respond to conservative treatment options and met standard indications for undergoing hip arthroscopy. Patients were excluded if they had any evidence of osteoarthritis (e.g., >2 mm joint space narrowing anywhere along the sourcil), previous hip surgery, or if they were undergoing hip arthroscopy as a staged procedure for another procedure (e.g., periacetabular osteotomy).
Quantification of Surgeon Intuition
A Surgeon Intuition and Prediction (SIP) questionnaire was created to assess the surgeon’s prediction of patient outcomes based on perceived patient response to treatment. Unlike traditional scales, the SIP score maximizes bias by incorporating one’s “gut reaction” to sensory perceptions and objective findings identified during the patient encounter that are thought to positively or negatively influence patient outcomes. Key areas in which these cognitive transactions take place are during the patient history, physical examination, and review of imaging studies. In addition, initial and final impressions during the patient encounter can provide further information. The SIP questionnaire was developed to incorporate all 5 of these domains and is detailed in Figure 1.
Fig 1Surgeon Intuition and Prediction (SIP) questionnaire.
The SIP questionnaire was completed electronically within 1 to 2 weeks of surgery date at the preoperative office visit. One attending surgeon (expert) and one physician assistant (novice) with 10 years’ difference of training and experience both completed the questionnaire. The senior author (B.G.) is an experienced hip arthroscopist with more than 12 years of posttraining surgical experience, performs more than 400 arthroscopic hip procedures annually, teaches on an international level, and maintains a practice that is committed to comprehensive hip-preservation medicine. Both examiners were given the same instructions to place a mark on a VAS in each of the 5 domains according to the perceived effect on patient outcome following surgery. The measured distance across the standardized 100-mm line on the VAS was converted into a calibrated 20-point score for each domain, giving a possible total SIP score of 100. Greater SIP scores indicate better outcomes.
Importantly, no deviations to the standard of care occurred in the treatment of any patient. All patients underwent an exhaustive trial of nonoperative treatment including rest, modified activity, pharmacologic therapy, and physiotherapy. Only after a patient was properly indicated for surgery were they then approached for enrollment into this study. Any identifiable risk factors were addressed with the patient by the surgeon and attempts were made to modify them before surgery.
Patient-Reported Outcomes
Patients completed a battery of 7 PRO questionnaires electronically at various time points in the preoperative and postoperative periods. Legacy hip outcome scores included HOS-Sport, HOS-ADL, mHHS, and NAHS. In addition, PROMIS-Physical Function (PF), PROMIS-Pain Interference (PI), and PROMIS-Depression (D) tools were used. The questionnaires were completed either on an electronic tablet in the office or via e-mail immediately following the office appointment. Results were collected in an electronic database (REDCap, v11.0.3; Vanderbilt University, Nashville, TN). In total, the surveys took approximately 10 to 20 minutes to complete. Baseline assessments were completed 1 to 2 weeks before surgery at the preoperative visit. Postoperatively, the questionnaires were readministered at routine 6-month and 12-month follow-up appointments. The surgical team was blinded to patient outcome scores during the study duration.
Data Analysis
Descriptive statistics were used to characterize the study sample. Due to the longitudinal nature of our study, there were missing data points for relevant outcome measures at various time points. Descriptive statistics were calculated on the entire sample and for subjects with complete data. A complete analysis was conducted for subjects with complete data for all PRO measures at baseline, 6 months, and 12 months. A generalized estimating regression was used as a secondary analysis on all patients. Means for all 7 outcome measures were calculated and paired t-tests were used to assess changes from baseline for all outcome measures. To expand upon this complete case analysis, we used generalized estimating equation regression models. This method is ideal for analyzing longitudinal data that have missing observations and is often used when the population-averaged effects are of primary interest, rather than individual changes. Seven separate generalized estimating equation regression models were constructed for each measure. The number and percentage of patients achieving minimal clinically important difference (MCID) and patient acceptable symptomatic state (PASS) was calculated using standard thresholds and definitions as previously described in the literature.
Arthroscopic correction of sports-related femoroacetabular impingement in competitive athletes: 2-year clinical outcome and predictors for achieving minimal clinically important difference.
How many patients achieve an acceptable state after hip arthroscopy for femoroacetabular impingement syndrome? A cross-sectional study including PASS cutoff values for the HAGOS and iHOT-33.
If a PRO did not have an established value for PASS, as reported in the literature, one-half the standard deviation of the mean at baseline in the sample population was used to assess whether subjects achieved this threshold. An absolute value for r between 0.4 and 0.6 indicated a moderate strength of correlation between variables. All data analysis was conducted in SAS, version 9.4 (SAS Institute, Cary, NC). Statistical significance was considered P < 0.05 and established a priori.
Results
A total of 188 patents with minimum 1-year follow-up met inclusion criteria and were enrolled during the study time period. Thirty-three patients voluntarily withdrew, leaving 155 patients who completed the study. Due to the longitudinal nature of the study, there were 57 patients with missing follow-up data. The primary data analysis was performed on 98 patients (mean age 36.1 years, 67% female) who had complete data sets at all time points. A secondary longitudinal analysis was performed on 155 patients, which included patients who were missing any data points. All patients underwent primary hip arthroscopy to treat symptomatic mixed-type FAI with associated labral, chondral, and synovial pathology. Mean traction time was 34 minutes. Table 1 shows demographic data for both sample populations.
The results from the primary analysis of PRO scores at baseline, 6 months, and 12 months postoperatively are shown in Table 2 and Figure 2. The difference of means between each time point for this same population is reported in Table 3 and Figure 3. There were significant improvements in mean PRO scores from baseline to 6 months for all 7 PRO instruments (P < .05). Similarly, there were significant improvements in difference of means from baseline to 12 months for all PRO instruments (P < .05). Between 6 months and 12 months postoperatively, there were significant improvements in mean HOS-Sport and PROMIS-PF scores (P < .05). The remaining outcome scores also had continued improvements from 6 months to 12 months postoperatively but these did not reach statistical significance.
Table 2Means PRO Scores at Baseline, 6 Months, and 12 Months
Baseline
6 Months
12 Months
Mean
SD
Mean
SD
Mean
SD
HOS-ADL
71.25
13.82
88.37
11.17
89.88
13.69
HOS-Sport
56.98
20.24
70.50
23.25
78.88
21.83
NAHS
61.20
16.35
83.42
14.33
84.78
17.42
mHHS
58.16
14.36
81.57
16.57
81.85
20.04
PROMIS-PF
41.53
5.96
48.68
8.12
51.42
10.23
PROMIS-PI
59.71
5.90
50.16
8.71
48.87
8.83
PROMIS-D
44.62
9.73
42.87
9.33
42.00
9.16
HOS-ADL, Hip Outcome Score – Activities of Daily Living; HOS-Sport, Hip Outcome Score –Sport-Specific Scale; mHHS, Modified Harris Hip Score; NAHS, Non-Arthritic Hip Score; PRO, patient-reported outcome; PROMIS-D, Patient-Reported Outcomes Measurement Information System, Depression; PROMIS-PF, Patient-Reported Outcomes Measurement Information System, Physical Function; PROMIS-PI, Patient-Reported Outcomes Measurement Information System, Pain Interference; SD, standard deviation.
Fig 2Mean scores on all patient-reported outcome measures at baseline and postoperatively. (HOS-ADL, Hip Outcome Score – Activities of Daily Living; HOS-Sport, Hip Outcome Score –Sport-Specific Scale; mHHS, Modified Harris Hip Score; NAHS, Non-Arthritic Hip Score; PROMIS, Patient-Reported Outcomes Information System.)
Fig 3Difference of mean scores on all patient-reported outcome measures at baseline and postoperatively. (HOS-ADL, Hip Outcome Score – Activities of Daily Living; HOS-Sport, Hip Outcome Score –Sport-Specific Scale; mHHS, Modified Harris Hip Score; NAHS, Non-Arthritic Hip Score; PROMIS, Patient-Reported Outcomes Information System.)
Rates of achieving MCID and PASS thresholds are shown in Tables 4 and 5, respectively. MCID threshold was reached in the majority of patients by 6 months postoperatively for mHHS (82%), NAHS (81%), PROMIS-PI (77%), PROMIS-PF (65%), and HOS-Sport (60%). At 12 months postoperatively, MCID threshold was reached in the majority of patients for PROMIS-PI (83%), NAHS (81%), mHHS (80%), HOS-Sport (76%), and PROMIS-PF (73%). There was an increase in the percentage of patients who met MCID threshold from 6 months to 12 months postoperatively for HOS-Sports, PROMIS-PF, and PROMIS-PI. More than 70% of patients achieved PASS by 6 months postoperatively for HOS-ADL and PROMIS-PF scores.
Table 4Rates of Achieving MCID Threshold Postoperatively
Correlations between SIP score and PRO scores comparing novice and expert examiners at both time postoperative time points are shown in Table 6. Negative Pearson correlation coefficients (r) for PROMIS-PI and PROMIS-D indicate the directionality of interpreting these PRO instrument scores in comparison with the other PRO tools (i.e., lower pain and depression scores indicate a better score). P values reached statistical significance for all measures except PROMIS-D in the expert group at the 6-month follow-up.
Table 6Correlations Between SIP Score and PRO Scores at 12 Months Postoperatively
At 6 months’ postoperatively, a moderate-strength correlation was seen between the expert SIP score and PRO scores for mHSS (r = 0.50), HOS-ADL (r = 0.50), NAHS (r = 0.45), and PROMIS-PF (r = 0.41). Novice SIP scores also had moderate strength correlation with mHSS (r = 0.48), HOS-ADL (r = 0.50), NAHS (r = 0.47), PROMIS-PF (r = 0.40), and PROMIS-PI (r = 0.42) at 6 months’ postoperatively. Weak correlations were seen for HOS-Sport, HOS-ADL, PROMIS-D, and PROMIS-PI for novice and expert examiners at various time points as seen in Table 6. The strength of correlations remained similar between the 2 postoperative time intervals for combined examiners with a mean overall correlation of r = 0.40 and r = 0.39 at 6 and 12 months, respectively (Table 7). Comparison between examiner skill levels showed that the expert examiner had decreasing overall mean correlation strength of combined outcome measures from 6 months (r = 0.39) to 12 months (r = 0.34) postoperatively, whereas the overall mean correlation for the novice examiner improved marginally from r = 0.42 to r = 0.43.
Table 7Overall Mean Correlation Strength (r) Across Examiners and Time Points
Figure 4 presents the results of the secondary analysis (complete data set, n = 155) showing longitudinal changes in all PROs from baseline using general estimating equation regression analyses. All PROs similarly demonstrated a statistically significant improvement from baseline. Scores for HOS-ADL, HOS-Sport, NAHS, and PROMIS-PF significantly increased from baseline to 6 months and 12 months. Scores for PROMIS-D and PROMIS-PI significantly decreased over time.
Fig 4Longitudinal changes from baseline for all outcome measures using a generalized estimating equation regression model (N = 155).
The main finding was that reported patient outcomes following hip arthroscopy had overall only weak-to-moderate strength of correlation with a surgeon’s prediction of those outcomes. The authors’ hypotheses were largely refuted by the results of this study. Although our results support that an experienced hip arthroscopist can reasonably predict outcomes for some patients, there was not overwhelmingly strong correlation between predicted and actual outcomes as was hypothesized.
In addition, our data refute the hypothesis that expert examiners shoulder have better clinical judgment of surgical outcomes compared to a novice examiner. In fact, our data show that the novice examiner had maintained or even had continued improvement of predictive accuracy from 6 months to 12 months postoperatively, whereas the expert examiner showed trends of decreased strength of correlation over time.
Surgeons use cognitive shortcuts, called heuristics, on conscious and subconscious levels in everyday practice to predict which patients they believe may have the best or worst surgical outcomes. Surgeons estimate risk intuitively through a complex cognitive process that weighs risk factors and draws on past experiences.
How many patients achieve an acceptable state after hip arthroscopy for femoroacetabular impingement syndrome? A cross-sectional study including PASS cutoff values for the HAGOS and iHOT-33.
However, it is unknown exactly the mechanism by which surgeons can produce any risk estimation, thus is often regarded as a “black box” phenomenon where the inputs and outputs are known but internal algorithms are not well understood.
Experienced surgeons often cite their clinical acumen and gestalt as a guide for decision-making. Yet, the results from our study suggest that perhaps there are limitations to even an experienced surgeon’s ability to predict outcomes. The lead surgeon (B.G.), an experienced hip arthroscopist with more than 12 years of surgical experience and performing more than 400 cases per year, had no better predictive ability than a more novice physician assistant with 10 years less experience in hip arthroscopy, and in some cases worse judgment was observed. At 6 months’ postoperatively, the attending surgeon and physician assistant had similar scores of clinical judgment, but interestingly, at time of final follow-up, the novice examiner had stronger clinical judgment of patient outcomes and the expert examiner had worse judgment.
reported that surgeons made meaningful preoperative predictions of major complications after abdominal surgery using a similar 100-mm VAS as used in the present study. They concluded that the unique contribution of a surgeon’s clinical assessment should be considered in predictive models for estimating surgical risks. Jacklin et al.
also used a JA model to assess the ability of trainee surgeons to predict the likelihood that a patient undergoing a laparoscopic cholecystectomy would need to be converted to an open approach. In that study, the authors found the mean correlation to be 0.48 ± 0.14 compared with a gold standard model. In comparison with the present study, the mean overall correlation across all examiners was 0.40.
The results of this study also showed that most patients achieved clinically significant improvements by 6 months from the date of surgery, which is consistent with other reports in the literature.
Arthroscopic correction of sports-related femoroacetabular impingement in competitive athletes: 2-year clinical outcome and predictors for achieving minimal clinically important difference.
How many patients achieve an acceptable state after hip arthroscopy for femoroacetabular impingement syndrome? A cross-sectional study including PASS cutoff values for the HAGOS and iHOT-33.
Overall, the patients in this study met MCID and PASS thresholds at similar levels compared to other studies of patients undergoing hip arthroscopy for FAI. Ishoi et al.
How many patients achieve an acceptable state after hip arthroscopy for femoroacetabular impingement syndrome? A cross-sectional study including PASS cutoff values for the HAGOS and iHOT-33.
found that less than one-half of patients (46%) undergoing hip arthroscopy for FAI had achieved PASS. Our data show similar values with 49% of patients in the present study achieving PASS thresholds for PROMIS-PF, although 73% achieved PASS thresholds for mHHS. Mullins and Carton
Arthroscopic correction of sports-related femoroacetabular impingement in competitive athletes: 2-year clinical outcome and predictors for achieving minimal clinically important difference.
found that 86% of competitive athletes undergoing hip arthroscopy for FAI achieved MCID for mHHS at the 2-year follow-up. In our study, only 73% of our patients achieved MCID for mHHS. However, it should be noted that our sample was a mixed population, and competitive athletes have been previously shown to achieve MCID at greater rates compared with nonathletes.
Arthroscopic correction of sports-related femoroacetabular impingement in competitive athletes: 2-year clinical outcome and predictors for achieving minimal clinically important difference.
This study was not without limitations. First, we used an unsophisticated prediction model as a means to quantify surgeon intuition of patient outcomes. We accept that the SIP questionnaire is not a validated instrument. However, our scale and methodology were modeled after studies with similar methodologic approaches of using a JA model to quantify surgeon assessment of risk and complications. Woodfield et al.
reported that surgeon’s risk estimates using a 100-mm VAS, although subjective, were still more accurate predictors of postoperative complications over objective data. It should be noted that due to this limitation, the lack of correlation between SIP score and PRO score may be due to the SIP tool itself rather than the surgeon’s ability to predict differences. Future studies should look to validate this instrument to assess whether it can truly detect differences in outcome scores. In addition, inter-rater reliability between surgeons and procedures also should be explored with further study in this area. A second limitation is the potential for performance bias if the attending surgeon were to consciously or subconsciously adjust surgical technique based on his preoperative risk estimates. The attending surgeon in this study performs more than 200 hip arthroscopies per year and limited variability by using a systematic approach to guide the surgical technique. In addition, the surgeon and novice examiner also were blinded to re-reviewing their own predictions before surgery and did not have access to review patient outcome scores during the course of the study. Third, due to the prospective and longitudinal nature of the study, we experienced patient withdraw and loss to follow-up, which decreased our sample size. However, this was believed to be important to maintain the integrity of our statistical analysis by only including patients with complete data sets in our primary analysis. We also performed a secondary analysis of our larger sample size, which had incomplete data sets. This secondary analysis demonstrated similar demographic and PRO scores compared with our primary data set. Next, we recognize that the results of a single surgeon and physician assistant using this predictive model does not necessarily allow generalizability to other surgeons and other procedures. Lastly, future studies also may look to use more sophisticated artificial intelligence models for predicting patient outcomes. Recent published work using machine learning models have shown accurate prediction of MCID achievement after hip arthroscopy.
An experienced, high-volume hip arthroscopist had only weak-to-moderate ability to intuitively predict PRO. Surgical intuition and judgment were not superior in an expert examiner compared to a novice.
Development and internal validation of supervised machine learning algorithms for predicting clinically significant functional improvement in a mixed population of primary hip arthroscopy.
Arthroscopic correction of sports-related femoroacetabular impingement in competitive athletes: 2-year clinical outcome and predictors for achieving minimal clinically important difference.
How many patients achieve an acceptable state after hip arthroscopy for femoroacetabular impingement syndrome? A cross-sectional study including PASS cutoff values for the HAGOS and iHOT-33.
The authors report that they have no conflicts of interest in the authorship and publication of this article. Full ICMJE author disclosure forms are available for this article online, as supplementary material.