# Sample Size Calculations Using epiR

### Prevalence estimation

The expected seroprevalence of brucellosis in a population of cattle is thought to be in the order of 15%. How many cattle need to be sampled and tested to be 95% certain that our seroprevalence estimate is within 20% (i.e. 0.20 $$\times$$ 0.15 = 0.03, 3%) of the true population value, assuming use of a test with perfect sensitivity and specificity? This formula requires the population size to be specified so we set N to a large number, 1,000,000:

library(epiR)
epi.sssimpleestb(N = 1E+06, Py = 0.15, epsilon.r = 0.20, se = 1, sp = 1, nfractional = FALSE, conf.level = 0.95)
#>  545

A total of 545 cows are required to meet the specifications of the study.

### Prospective cohort study

A prospective cohort study of dry food diets and feline lower urinary tract disease (FLUTD) in mature male cats is planned. A sample of cats will be selected at random from the population and owners who agree to participate in the study will be asked to complete a questionnaire at the time of enrolment. Cats enrolled into the study will be followed for at least 5 years to identify incident cases of FLUTD. The investigators would like to be 0.80 certain of being able to detect when the risk ratio of FLUTD is 1.4 for cats habitually fed a dry food diet, using a 0.05 significance test. Previous evidence suggests that the incidence risk of FLUTD in cats not on a dry food (i.e. ‘other’) diet is around 50 per 1000 per year. Assuming equal numbers of cats on dry food and other diets are sampled, how many cats should be sampled overall?

epi.sscohortt(irexp1 = 50/1000, irexp0 = 70/1000, FT = 5, n = NA, power = 0.80, r = 1,
design = 1, sided.test = 2, nfractional = FALSE, conf.level = 0.95)$n.total #>  2080 A total of 2080 subjects are required (1040 exposed and 1040 unexposed). ### Case-control study A case-control study of the relationship between white pigmentation around the eyes and ocular squamous cell carcinoma in Hereford cattle is planned. A sample of cattle with newly diagnosed squamous cell carcinoma will be compared for white pigmentation around the eyes with a sample of controls. Assuming an equal number of cases and controls, how many study subjects are required to detect an odds ratio of 2.0 with 0.80 power using a two-sided 0.05 test? Previous surveys have shown that around 0.30 of Hereford cattle without squamous cell carcinoma have white pigmentation around the eyes. epi.sscc(OR = 2.0, p0 = 0.30, n = NA, power = 0.80, r = 1, rho = 0, design = 1, sided.test = 2, conf.level = 0.95, method = "unmatched", nfractional = FALSE, fleiss = FALSE)$n.total
#>  282

If the true odds for squamous cell carcinoma in exposed subjects relative to unexposed subjects is 2.0, we will need to enrol 141 cases and 141 controls (282 cattle in total) to reject the null hypothesis that the odds ratio equals one with probability (power) 0.80. The Type I error probability associated with this test of this null hypothesis is 0.05.

### Non-inferiority trial

Suppose a pharmaceutical company would like to conduct a clinical trial to compare the efficacy of two antimicrobial agents when administered orally to patients with skin infections. Assume the true mean cure rate of the treatment is 0.85 and the true mean cure rate of the control is 0.65. We consider a difference of less than 0.10 in cure rate to be of no clinical importance (i.e. delta = -0.10). Assuming a one-sided test size of 5% and a power of 80% how many subjects should be included in the trial?

epi.ssninfb(treat = 0.85, control = 0.65, delta = -0.10, n = NA,
r = 1, power = 0.80, nfractional = FALSE, alpha = 0.05)$n.total #>  50 A total of 50 subjects need to be enrolled in the trial, 25 in the treatment group and 25 in the control group. ### Population sensitivity using a diagnostic test with imperfect specificity We’ll continue with the brucellosis example introduced above. Imagine the test we’re using has a diagnostic sensitivity of 0.95 (as before) but this time it has a specificity of 0.98. How many herds need to be sampled to be 95% certain that the prevalence of brucellosis in dairy herds is less than the design prevalence if less than a specified number of tests return a positive result? rsu.sssep.rsfreecalc(N = 5000, pstar = 0.05, mse.p = 0.95, msp.p = 0.95, se.u = 0.95, sp.u = 0.98, method = "hypergeometric", max.ss = 32000)$summary
#>     n    N c pstar         p1     se.p      sp.p
#> 1 194 5000 7  0.05 0.04898102 0.951019 0.9573939

A population sensitivity of 95% is achieved with a total sample size of 194 herds, assuming a cut-point of 7 or more positive herds are required to return a positive survey result.

Note the substantial increase in sample size when diagnostic specificity is imperfect (194 herds when specificity is 0.98 compared with 63 when specificity is 1.00). The relatively low design prevalence in combination with imperfect imperfect specificity means that false positives are more likely to be a problem in this population so the number tested needs to be (substantially) increased. Increase the design prevalence to 0.10 to see its effect on estimated sample size.

rsu.sssep.rsfreecalc(N = 5000, pstar = 0.10, mse.p = 0.95,
msp.p = 0.95, se.u = 0.95, sp.u = 0.98, method = "hypergeometric",
max.ss = 32000)$summary #> n N c pstar p1 se.p sp.p #> 1 66 5000 3 0.1 0.04992274 0.9500773 0.9566218 The required sample size decreases to 66 and the cut-point to 3 positives due to: (1) the expected reduction in the number of false positives; and (2) the greater difference between true and false positive rates in the first example compared with the second. ### One-stage cluster sampling An aid project has distributed cook stoves in a single province in a resource-poor country. At the end of three years, the donors would like to know what proportion of households are still using their donated stove. A cross-sectional study is planned where villages in a province will be sampled and all households (approximately 75 per village) will be visited to determine if the donated stove is still in use. A pilot study of the prevalence of stove usage in five villages showed that 0.46 of householders were still using their stove and the intracluster correlation coefficient (ICC) for stove use within villages is in the order of 0.20. If the donor wanted to be 95% confident that the survey estimate of stove usage was within 10% of the true population value, how many villages (clusters) need to be sampled? epi.ssclus1estb(b = 75, Py = 0.46, epsilon.r = 0.10, rho = 0.20, conf.level = 0.95)$n.psu
#>  96

A total of 96 villages need to be sampled to meet the requirements of the study.

### One-stage cluster sampling (continued)

Continuing the example above, we are now told that the number of households per village varies. The average number of households per village is 75 with a 0.025 quartile of 40 households and a 0.975 quartile of 180. Assuming the number of households per village follows a normal distribution the expected standard deviation of the number of households per village is in the order of (180 - 40) $$\div$$ 4 = 35. How many villages need to be sampled?

epi.ssclus1estb(b = c(75,35), Py = 0.46, epsilon.r = 0.10, rho = 0.20, conf.level = 0.95)$n.psu #>  115 A total of 115 villages need to be sampled to meet the requirements of the study. ### Two-stage cluster sampling This example is adapted from Bennett et al. (1991). We intend to conduct a cross-sectional study to determine the prevalence of disease X in a given country. The expected prevalence of disease is thought to be around 20%. Previous studies report an intracluster correlation coefficient for this disease to be 0.02. Suppose that we want to be 95% certain that our estimate of the prevalence of disease is within 5% of the true population value and that we intend to sample 20 individuals per cluster. How many clusters should be sampled to meet the requirements of the study? # From first principles: n.crude <- epi.sssimpleestb(N = 1E+06, Py = 0.20, epsilon.r = 0.05 / 0.20, se = 1, sp = 1, nfractional = FALSE, conf.level = 0.95) n.crude #>  246 # A total of 246 subjects need to be enrolled into the study. Calculate the design effect: rho <- 0.02; b <- 20 D <- rho * (b - 1) + 1; D #>  1.38 # The design effect is 1.38. Our crude sample size estimate needs to be increased by a factor of 1.38. n.adj <- ceiling(n.crude * D) n.adj #>  340 # After accounting for lack of independence in the data a total of 340 subjects need to be enrolled into the study. How many clusters are required? ceiling(n.adj / b) #>  17 # Do all of the above using epi.ssclus2estb: epi.ssclus2estb(b = 20, Py = 0.20, epsilon.r = 0.05 / 0.20, rho = 0.02, nfractional = FALSE, conf.level = 0.95) #> Warning: The calculated number of primary sampling units (n.psu) is 17. At #> least 25 primary sampling units are recommended for two-stage cluster sampling #> designs. #>$n.psu
#>  17
#>
#> $n.ssu #>  340 #> #>$DEF
#>  1.38
#>
#> $rho #>  0.02 A total of 17 clusters need to be sampled to meet the specifications of this study. epi.ssclus2estb returns a warning message that the number of clusters is less than 25. ### Two-stage cluster sampling Continuing the brucellosis prevalence example (above) being seropositive to brucellosis is likely to cluster within herds. Otte and Gumm (1997) cite the intracluster correlation coefficient for Brucella abortus in cattle to be in the order of 0.09. Adjust your sample size estimate of 545 to account for lack of independence in the data, i.e. clustering at the herd level. Assume that b = 10 animals will be sampled per herd: n.crude <- epi.sssimpleestb(N = 1E+06, Py = 0.15, epsilon.r = 0.20, se = 1, sp = 1, nfractional = FALSE, conf.level = 0.95) n.crude #>  545 rho <- 0.09; b <- 10 D <- rho * (b - 1) + 1; D #>  1.81 n.adj <- ceiling(n.crude * D) n.adj #>  987 # Similar to the example above, we can do all of these calculations using epi.ssclus2estb: epi.ssclus2estb(b = 10, Py = 0.15, epsilon.r = 0.20, rho = 0.09, nfractional = FALSE, conf.level = 0.95) #>$n.psu
#>  99
#>
#> $n.ssu #>  986 #> #>$DEF
#>  1.81
#>
#> \$rho
#>  0.09

After accounting for clustering at the herd level we estimate that a total of (545 $$\times$$ 1.81) = 986 cattle need to be sampled to meet the requirements of the survey. If 10 cows are sampled per herd this means that a total of (987 $$\div$$ 10) = 99 herds are required.