The data comes from the Harvard Project Implicit and includes a sample of scores on the Disability Implicit Association Test (IAT) for 124265 participants. I should give a decent starting point for some power calculations. I focus on this outcome because it is the most risky. For other outcomes (games in experiments), effect sizes are generally quite large.
First, I plot the histograms for the IAT score of participants below 18. The red line indicates the mean. A good chunk of the distribution lies above the severe bias threshold.
df %<>%
mutate(
iat_all = D_biep.Abled_Good_all,
age = year - birthyear
) %>%
filter(!is.na(age))
# Cutoff points to classify the degree of bias
thresholds <- tibble(points = c(-0.15, 0.15, 0.33))
text <- tibble(
points = c(-0.5, -0.15, 0.15, 0.5),
text = c("Pro bias", "No bias", "Some bias", "Severe bias")
)
mean_u18 <- df %>% filter(age <= 18) %$% mean(iat_all, na.rm = T)
df %>% filter(age <= 18) %>%
ggplot(aes(iat_all)) +
geom_histogram(alpha = 0.2, position="identity") +
geom_vline(data = thresholds, aes(xintercept = points)) +
geom_text(
data = text,
mapping = aes(x = points + 0.1, label = text, y = 0),
angle = 90, hjust = 0) +
geom_vline(
xintercept = mean_u18, linetype="dotted",
color = "red", size=1.5) +
scale_x_continuous(breaks = seq.int(-1.95, 1.95, 0.3))
The mean and standard deviation of the IAT score by age group (below and above 18) are reported in the table bellow.
df_sum <- df %>% mutate(`Age under 18` = ifelse(age <= 18, 1, 0)) %>%
group_by(`Age under 18`) %>%
summarise(
Mean = mean(iat_all, na.rm = T),
SD = sd(iat_all, na.rm = T)
)
# Mean and SD for the under 18 groups
mean <- df_sum[['Mean']][[2]]
sd <- df_sum[['SD']][[2]]
df_sum %>% mutate_all(format, digits = 3,nsmall = 0) %>%
kable(format = "html",
caption = "Implicit Association Score Summary statistics by group") %>%
kable_styling()
| Age under 18 | Mean | SD |
|---|---|---|
| 0 | 0.538 | 0.458 |
| 1 | 0.421 | 0.468 |
The bias increases slightly with age. Would it be more interesting to focus on older children (higher bias), or younger? For short term effects, I guess older would be preferable. The variance instead appears to be decreasing with age.
df %>% filter(age < 50) %>%
ggplot(aes(age, iat_all)) +
geom_point(alpha = 0.01) +
geom_smooth()
The table below reports the minimum detectable effect (MDE), and the MDE size (MDES) for sample sizes from 100 to 200 (per treatment group), for two-tail hypothesis test with significance level 5% and 80% power.
n <- seq(100, 200, 25)
# MDES for different sample sizes
MDES <- lapply(n, pwr.t.test, sig.level = 0.05, power = 0.8, type = c("two.sample")) %>%
sapply(., `[[`, "d")
# MDE
MDE <- MDES * sd
# Share of the mean
sh_mean <- MDE/mean
# Create table
tibble(n, MDE, MDES, sh_mean) %>%
rename(`N per group` = n, `Share of mean` = sh_mean) %>%
mutate_all(format, digits = 3,nsmall = 0) %>%
kable() %>% kable_styling()
| N per group | MDE | MDES | Share of mean |
|---|---|---|---|
| 100 | 0.186 | 0.398 | 0.443 |
| 125 | 0.167 | 0.356 | 0.395 |
| 150 | 0.152 | 0.325 | 0.361 |
| 175 | 0.141 | 0.300 | 0.334 |
| 200 | 0.131 | 0.281 | 0.312 |
One-tail test
# MDES for different sample sizes
MDES <- lapply(n, pwr.t.test, sig.level = 0.05, power = 0.8, type = c("two.sample"), alternative = "less") %>%
sapply(., `[[`, "d")
# MDE
MDE <- MDES * sd
# Share of the mean
sh_mean <- MDE/mean
# Create table
tibble(n, MDE, MDES, sh_mean) %>%
rename(`N per group` = n, `Share of mean` = sh_mean) %>%
mutate_all(format, digits = 3,nsmall = 0) %>%
kable() %>% kable_styling()
| N per group | MDE | MDES | Share of mean |
|---|---|---|---|
| 100 | -0.165 | -0.353 | -0.392 |
| 125 | -0.148 | -0.315 | -0.351 |
| 150 | -0.135 | -0.288 | -0.320 |
| 175 | -0.125 | -0.266 | -0.296 |
| 200 | -0.117 | -0.249 | -0.277 |