Power calculation for two-sample Welch's t test

The R function power.t.test does power calculations (outputs power, sample size, effect size, or whichever parameter you leave out) for t-tests, but only has a single parameter for sample size. The pwr package has a function pwr.t2n.test that performes calculations for a two-sample t-test with different sample sizes (n1,n2). Finally, this suite of stats functions includes a function for Welch's t test (used for samples with different variances), but only includes one parameter for the sample sizes. I have not been able to find a formula for calculating the power for Welch's t. Could someone help me out with the formula or an R function for this?

asked Jun 19, 2019 at 18:11 1,138 9 9 silver badges 21 21 bronze badges $\begingroup$ The package MKmisc has a function called power.welch.t.test . $\endgroup$ Commented Jun 19, 2019 at 19:19

$\begingroup$ @COOLSerdash I link to that package in my question. It does not have options for different sample sizes. $\endgroup$

Commented Jun 19, 2019 at 21:44

3 Answers 3

$\begingroup$

Comment: First, I would suggest you consider carefully whether you have a really good reason to use different sample sizes. Especially if the smaller sample size is used for the group with the larger population, this is not an efficient design.

Second, you can use simulation to get the power for various scenarios. For example, if you use $n_1 = 20,\, \sigma_1 = 15,\,$ $n_2 = 50, \sigma_2 = 10,$ then you have about 75% power for detecting a difference $\delta = 10$ in population means with a Welch test at level 5%.

n1=20; n2=50; sg1=15; sg2=10; dlt=10 set.seed(619) pv = replicate(10^6, t.test(rnorm(n1,0,sg1),rnorm(n2,dlt,sg2))$p.val) mean(pv


Because the P-value is taken directly from the procedure t.test in R, results should be accurate to 2 or 3 places, but this style of simulation runs slowly (maybe 2 or 3 min.) with a million iterations.
You might want to use 10,000 iterations if you are doing repeated runs for various sample sizes, and then use a larger number of iterations to verify the power of the final design.
Changing to $n_2 = 20$ gives power 67%, so the extra 30 observations in Group 2 are not 'buying' you as much as you might hope. By contrast, a balanced design with $n_1 = n_2 = 35$ gives about 90% power (with everything else the same).
answered Jun 19, 2019 at 20:35
57.1k 2 2 gold badges 35 35 silver badges 94 94 bronze badges
$\begingroup$ thanks for the ideas. For my situation, the experiment has been done. We have 40 microbiota and a multitude of covariates which we test in various subgroups. After multiple testing, we might select the biota having the most significance. Sometimes the risk category for a disease might have 10 patients but the non-risk only 3. Power analysis will help me know if I can trust the p-value. In this situation, simulation would be very inefficient given that the calculation can be done directly--assuming one has the formula or a function. $\endgroup$
Commented Jun 19, 2019 at 21:51
$\begingroup$ The reason I showed you a simulation is that I am not sure it is possible to give a simple formula for the exact power of the Welch t test. Given information has to be used to estimate the adjusted degrees of freedom to be used, which is used in turn to approximate the relevant noncentral t distribution. // Of course, the best time to do a power-sample size computation is before you do the experiment. I suppose the main reason it is hard to find programmed power procedures for unbalanced Welch tests is that most investigators try to avoid unbalanced designs. $\endgroup$
Commented Jun 19, 2019 at 22:03
$\begingroup$ For us, what can happen is that we get a load of patients, some with the disease and some without. We can try to make sure we get enough balance of some groups (case/control, gender, age, etc.) But we will still end up with unbalanced samples. For example after we genotype 100 patients, we may find that in the case subgroup, for one risk factor, we end up with 10 homozygous risk, 3 homozygous non-risk, and the rest heterozygous. If we had funding and time to subscribe 1000 people in the study, we wouldn't have a sample size problem. But that's just how it can shake out. $\endgroup$
Commented Jun 20, 2019 at 1:02
$\begingroup$
The article "Optimal sample sizes for Welch’s test under various allocation and cost considerations" from Show-Li Jan & Gwowen Shieh published in Behavior Research Methods in December 2011 has the following code in supplementary material A, slightly modified here for my own ergonomy.
ssize.welch = function(alpha=0.05, power=0.90, mu1, mu2, sigma1, sigma2, n2n1r, use_exact=FALSE) < mud> c(n1=n1,n2=n2) > 
ssize.welch(0.05,0.9,85,105,10,20,3) ssize.welch(0.05,0.9,85,105,10,20,3,TRUE) 
The Z method is also the one used in https://clincalc.com/stats/samplesize.aspx which cites Rosner B. Fundamentals of Biostatistics. 7th ed. Boston, MA: Brooks/Cole; 2011. Weirdly, it forces you to input only one variance but the formula it gives can use two (and is the same as the paper above). It is in the spirit of the usual way of computing sample sizes, but I'm not too sure about when the underlying approximations start to be false.
After some fiddling around, both methods give similar results unless you have very unbalanced groups (but in that case, at some point you might want to just approximate the super large group as a known population).
Hope this helps.
Edit : just realized this does not exactly answer your question about power rather than sample size, but you can easily flip the Z method formula to compute power (exact method seems more hairy ; worst case, numeric trial and error should work since the relationship is monotonous).