Should we be concerned about multiple comparisons in hierarchical Bayesian models?

Published by Stephanie Mayer on

Ecologists increasingly use hierarchical Bayesian (HB) models to estimate group-level parameters that vary by, for example, species, treatment level, habitat type or other factors. Group-level parameters may be compared to infer differences among levels. We would conclude a non-zero pairwise difference, separately, for each pair in the group, when the respective 95% credible interval excludes zero. Classical procedures suggest that the rejection procedure should be adjusted to control the family-wise error rate (FWER) for a family of differences. Adjustments for FWER have been considered unnecessary in HB models due to partial pooling whereby increased pooling strength – group-level parameters become more alike – could lead to decreased rejection rates (Type I error, FWER, or Power) and increased false acceptance rates (Type 2 error and its family-wise analogue). To address this, we conducted a simulation experiment with factors of sample size, group size, balance (missingness), overall mean and ratio of within- to between-group variances, resulting in 2016 factor-level combinations (scenarios), replicated 100 times, producing 201,600 pseudo datasets analysed in a Bayesian framework. We evaluated the results in the context of a new partial pooling index (PPI), which we show is also applicable to more complex model structures based on four real-data examples. Simulation results confirm intuition that rejection rates (false and true) decrease and false acceptance rates increase with increasing PPI or pooling strength (scenario-level R2 = 0.81-0.97). The relationship with PPI differed greatly for balanced versus unbalanced designs and was affected by group size, especially for family-wise errors. Critically, an HB model does not guarantee that the FWER will follow a set significance level (α); for example, even minor imbalance can lead to FWER > α for weak to moderate pooling. These results are confirmed by the real-data examples, suggesting that ecologists need to consider FWER when applying HB models, especially for large group sizes or incomplete datasets. Contrary to current thought, HB models are not immune to issues of multiplicity, and our proposed PPI offers a method for evaluating if a particular HB analysis is likely to produce FWER ≤ α (no adjustment or alternative solution required).