Editor’s  Note: PPP sponsors educational programming for its members and invites all interested parties to take part in our periodic Journal Club.  For those new to the format, a journal club is a gathering of medical professionals to discuss a scientific paper.  The paper under discussion is distributed prior to the meeting date, so that all participants can read it.  The club meeting begins with a member presenting a summary of the chosen paper.  Then the discussion begins, including questions about the study design, methods, results, and conclusions.  Finally, a consensus is rendered regarding the overall value of the paper. 

This blog entry introduces and summarizes our next Journal Club paper, so you’ll be ready for the discussion!

Join us on Twitter to participate in this Journal Club
July 29, 2020
9p EST
@pppforpatients #PPPJC

Association Between Resident Physician Training Experience and Program-Level Performance on Board Exams
Ryan J. Ellis MD, MS, Yue-Yung Hu MD, MPH, Andrew T. Jones PhD, Jason P. Kopp PhD, Nathanial J. Soper MD, David B. Hoyt MD, and Jo Buyske MD


Is there an association between resident physician perceptions of their education and program-level performance on the American Board of Surgery (ABS) board examination? 


The American Board of Surgery (ABS) has a two-step process for becoming board-certified. The first step is a qualifying examination and the second is the certifying examination. In the past, success on the first-step, the qualifying examination, could be predicted by performance on the ABS in-training examination, a yearly assessment of a surgical residents’ progression through training. This in-training examination is a multiple-choice, 5 hour examination that assesses residents’ knowledge and management of clinical issues related to general surgery. 

The in-training examination also predicts success on the certifying examination. Other factors that have previously been shown to predict success are alpha-omega-alpha status, high ranking within medical school, and scores on the United States Medical Licensing Examination.

Bradford-Hill criteria is the best accepted criteria used when determining causation. The criteria include establishing the following: strength, consistency, specificity, temporality, biological gradient, plausibility, coherence, experiment, and analogy. More information on how to apply these can be found here


The authors administered a multiple choice survey immediately following the in-training examination in January of 2019. Inclusion criteria were any resident in years 2 through 5 and availability of program-level board pass examination rates. Multiple choice questions included topics such as time in operating room, operative autonomy, clinical autonomy and overall satisfaction with their education. These items were then aggregated at a program level as were the in-training examination pass scores. 


In total, 6,269 residents from 248 programs answered the survey (response rate of 85.3%). Passing the certifying examination at the program level was associated with increased program size (OR 3.49) and program in-training exam performance (OR 2.45). Passing the certifying examination was associated with reported perceived appropriate amount of time in the operating room (OR 4.19). However, examination performance was not associated with operative autonomy, clinical autonomy, and satisfaction with overall education. 

PPP Initial Discussion/Critique

Level of Evidence:
This sort of study (retrospective MC questionnaire), even when done well, represents low level evidence (level 4)

Questionnaire Bias Discussion:
In particular, questionnaires like this can be subject to the “forced choice format bias” and many other priming and recall issues. The authors attempted (it appears) to avoid Yes/No questions which is helpful and used a Likert scale instead. We would need to read the questions in order to look for other biases like “priming biases.” 

Sampling Bias Discussion:
For the most part, sampling biases were avoided because of the broad number of programs that were surveyed. A wonderful number of responses occurred (85%) which is unusual for a survey. Still, the 15% who did not respond represent an opportunity for bias. Specifically, if there was a nonresponse bias present among non-respondents, this could significantly affect the reported data. Perhaps those who did not respond were particularly disaffected and unhappy about their programs and the test they just took? Such a bias could skew data in both directions. 

Another methodological problem is that individual scores are not tied to individual responses. It is understandable why this was done, but it is a limitation. It compounds the potential problem with the nonresponse bias because the total number of scores (good and bad residents) are represented in the programs’ rolling averages of test performances but this may not match up with those who chose to respond. Did only happy, good residents who just did well on the test respond? If so, poor performers and their opinions would be undersampled compared to the test scores. Did only angry, disaffected residents choose to respond? Then good test scores would be oversampled in the rolling averages. These sound like minor issues, but such undersampling/oversampling even by a few percentage points can make a huge difference in the marginally statistically significant findings. 

Statistical Analysis Discussion:
The choice to divide responses into tertiles is interesting. This may be a place where some statistical manipulation occurred. Tertiles in place of quartiles or quintiles etc. is an unusual choice. Note that the program scores were divided into quartiles so that would have been a more naturally and statistically appropriate choice. Often, after data collection and analysis, data is re-analyzed with various changes until statistically significant data are found. A good study specifies the statistical choices and tools and outcomes a priori (and the reasoning for those choices) and those analyses are done by a third party (triple blinding). This was not done in this case. If we had the dataset, we might find that the choice of tertiles as opposed to quartiles was necessary to find a significant P-value. This is called “P-hacking” and it is very, very common. Triple blinding and a priori statistical decisions are the only way to really avoid it. So-called “researcher degrees of freedom” are manipulated until the data delivers the desired result and such P-hacking is easy to do, even unintentionally. The choice of tertiles, we believe, is the most important researcher degree of freedom in this study. 

The choice of a two-tailed P-value is appropriate since the compared factors could have effects in both directions. It is always a choice to declare a P-value of less than 0.05 to be significant. From a very strict sense, one must remember that this is not the true error rate since it only account for Type I errors and not Type II errors in the Pearson-Neyman ideas of false findings in a study. The Type II errors are not accountable because we have no idea of the power of the study though it is a large number of subjects so our intuition is that the power is excellent. Obviously, since it is akin to a retrospective study, no power analysis is performed which always makes understanding P-values more difficult. This is all very technical but it goes to the point that the P-value alone is not always that meaningful. 

More important is the “multiple comparators problem.” Essentially, the authors are testing several hypotheses at once with their questionnaire. They are looking for relationships between type, size, and ABSITE performance of program compared to performance on qualifying and certifying exams, as well as relationships between time in OR, OR autonomy, clinical autonomy, and satisfaction with education compared to performance on those two exams. Essentially, they are testing 14 different hypotheses. An example of one such hypotheses would be: “We believe that top tertile satisfaction with time in operating room among residents is associated with improved performance on the certifying examination (but not the qualifying examination).” 

The multiple comparators problem is epidemic among these sorts of studies. It is essentially data-mining and it is a real problem. There are ways to adjust for this and the easiest method (and one they should have used perhaps) is to make a Bonferroni correction. You can do this by simply dividing the nominal P-value (0.05) by the number of comparisons (14) to arrive at new value that should be the threshold for statistical significance: 0.0036. There are other more complex methods but they do not apply to this sort of study. Another way of thinking about this is that if you test 14 different hypotheses, the accepted 5% risk of a false positive finding is repeated 14 times. You get 14 chances to make that much of an error. This gives a 51.2% of at least one false positive finding in the study even if everything else was perfect. Using the Bonferroni corrected threshold reduces that 51.2% back down to 5%. 

From what is presented in the paper, the tests of significance and multivariate analyses were appropriate. 

The authors report 28 P-values so one might argue that they had 28 chances to get a good P-value, not just 14. They claimed three findings to be significant: Qualifying exam scores were better for large programs compared to small and scores were better for top ABSITE scores compared to poor ones. These P-values were only 0.04 which are not significant post correction. The prestudy probability of these hypotheses are higher since these findings have been previously observed and several causal inferences can be made (big programs attract better students, people who do better on ABSITE are probably going to better on the qualifying exam, etc.). At the same time, it may be a contradiction of these observations that performance on the certifying exam was not also better. In any event, the findings were not statistically significant if even two hypotheses were being tested, let alone 14 or 28. 

The third finding they call “novel”- that high performing on the certifying exam was associated with resident-assessed time in the OR. This produced the best P-value in the paper (0.009) but this number is still not statistically significant when appropriately corrected (another way of thinking about it is to multiple the P-value by 14 which would give 0.126 if you want to compare it in your mind to 0.05). 

Even if this finding were significant, the relationship certainly does not lend much to a hypothesis of causality which is a jump that the authors quickly get to when they hypothesize reasons for this discovery. The reality is that it is almost certainly a false positive discovery and at least required replication before such attempts at explaining it. A good discussion with this sort of thing then is to apply the Bradford-Hill criteria for causality

Lastly, I will point out that it is a sign of a poor paper that they used odds ratios here instead of relative risk which shows a lack of statistical expertise. In this case, the odds ratios are going to almost identical to the relative risk, but it is inappropriate to say, for example, that a patient with top tertile operating room time is 4.19 times as likely to score well on the certifying exam; that statement is how one would use relative risk, which is a term that uses probabilities. Odds and probabilities are different. Authors tend to misuse odds ratios rather than relative risks because odds ratios tend to produce an inflated number, and most people just think of them as relative risks. In other words, it makes the outcome look better than it is. 

Discussion Questions

1) What potential biases are present in this study? Other limitations? 

2) Why do we think the authors chose to divide the programs into tertiles instead of quartiles as convention dictates? Could this be evidence of “p-hacking”?

3) The authors cast a wide net with their study, proposing several hypotheses. Is this a form of data mining? What problems are associated with this type of method? Could doing a Bonferroni correction alleviate this? 

4) The authors made 3 conclusions. The first two were that qualifying exam scores were better for large programs compared to small and scores were better for top ABSITE scores compared to poor ones. What was the pre-test probability of this finding? Why? 

5) The final conclusion had the best p-value of 0.009 and was that performing well on the certifying exam was associated with resident-assessed time in the OR. Is this an accurate p-value? 

6) Does the conclusion meet Bradford-Hill criteria for causation? More info here

7) What are some other study ideas to establish that increased training time and experience leads to improved board scores? Are board scores a good surrogate marker for clinical performance?