STATS SESSION 2: INFERENTIAL STATISTICS (PART 1)

Changelog

2025 Version for R, including long-format data and ANOVA extension (Frans van der Sluis)
2019 Original version for SPSS (Lorna Wildgaard, Haakon Lund, and Toine Bogers)

Topics

Paired-samples t-test
Independent samples t-test
Tidyverse/dplyr’s group_by and group-wise summaries
One-way and repeated-measures ANOVA

Introduction

The goal of this lab session is to learn how to perform Student’s t-tests using R. In this session, we will analyze the data from our experiment on typing speed (words per minute, WPM) using two different keyboards: the Opti keyboard and the QWERTY keyboard. The data are stored in long format in the file keyboard_data_R_2026.csv, which means that each row corresponds to one trial. Key variables include:

ParticipantID (unique identifier)
Keyboard (with values “OPTI” or “QWERTY”)
Trial_order (e.g., 1 to 5)
Reaction_Time
WPM

Before beginning, ensure that you have already completed Lab 1 (Descriptive Statistics) and loaded your dataset into R.

1. Paired-Samples t-Test

The data set contains data from a controlled experiment on typing speed using the Opti keyboard vs Qwerty keyboard. There are many different research questions we can investigate, e.g.:

RQ1: Is there a difference in words per minute (wpm) between typing on the Opti vs. Qwerty keyboard?
RQ2: Is there a difference in wpm between males and females?
RQ3: Is there a difference in typing speed between our two older and younger student groups?

The experiment used a within-group design for the different keyboards. All participants were asked to complete typing tasks on Opti and on Qwerty keyboards. That means that we have paired observations: typing speed in words/minute on Opti and typing speed in words/minute on Qwerty. In addition, the participants’ gender (Sex) and age (Age) were also recorded.

We start with our first research question (RQ1):

Formulate the null and alternative hypothesis for answering RQ1.

Let’s first describe our data for either keyboard. We can summarize (aggregate) our data per keyboard using the ‘group_by’ function. It calculates the mean and standard deviation (both its parametric and non-parametric alternatives) for each group (Keyboard):

keyboard_data %>%
  group_by(Keyboard) %>%
  summarise(Mean = mean(WPM, na.rm = TRUE),
            Std = sd(WPM, na.rm = TRUE),
            Median = median(WPM, na.rm = TRUE),
            MAD = mad(WPM, na.rm = TRUE)
)

Take a look at the output;

Do you think we will be able to confirm Ha?

Since every participant completes trials on both keyboards, we have repeated measures for each participant. This means each participant contributes paired observations (one for the OPTI condition and one for the QWERTY condition).

To perform a paired t-test, we first rearrange the data to ensure that each row represents a matched pair of trials for the same participant and trial order. Then, we pivot the data into a wide format, so that each participant’s WPM on OPTI and QWERTY appear in separate columns:

library(tidyverse)
wpm_trial <- keyboard_data %>%
  select(ParticipantID, Trial_order, Keyboard, WPM) %>%
  arrange(ParticipantID, Trial_order, Keyboard) %>% # Sort data so pairs match correctly
  pivot_wider(names_from = Keyboard, values_from = WPM) # Pivot to wide format to match trial pairs

t.test(wpm_trial$OPTI, wpm_trial$QWERTY, paired = TRUE)

The reshaped data frame has two data columns, OPTI and QWERTY (both WPM measures). Open it in RStudio to inspect. The t.test() function compares these columns, and paired = TRUE tells R that each pair of values (OPTI & QWERTY) belongs to the same participant-trial combination, so differences within pairs are analyzed rather than differences between

A note on sample normality and t-tests: The t-test doesn’t assume that the individual data points (our sample) are normally distributed. It assumes that the means of samples taken from the population are normally distributed. This is known as the sampling distribution of the mean. Even though our sample shows positive skew, if our sample size is large enough (usually 20–30 or more), the central limit theorem tells us that the distribution of the sample means will be approximately normal. Thus, the t-test remains valid despite our sample data’s skewness.

Report the results of a paired-samples t-test for the other OPTI vs QWERTY variables in APA style.

When reporting your t-test in APA style, include:

Means and standard deviations for each condition.
The t-value with its degrees of freedom, the p-value, and optionally an effect size.

For example: “A paired-samples t-test indicated that WPM scores were significantly higher in the OPTI condition (M = 80.1, SD = 12.3) than in the QWERTY condition (M = 75.4, SD = 10.8), t(29) = 2.35, p = .026.” (note: these are random data).

Tip: Negative t-values: The sign of a t-value tells us the direction of the difference in sample means, which can be difficult to interpret without further explanation: Does a negative t-value indicate Opti’s sample mean was greater or smaller than Qwerty? Therefore, it is common to indicate the direction of the mean-difference (even if nonsignificant) in some other way, such as by mentioning the sample means in the text, or by showing the sample means graphically, as in a bar chart.

(OPTIONAL / BONUS) We have now ran a paired t-test using trial-level data. Now, let’s perform a similar test by summarizing the data at the participant level. This involves using tidyverse/dplyr’s group_by (parameters: Participant, Keyboard) and summarize methods (to extract the mean of WPM). For a tutorial on group_by, see here

2. Paired samples t-test on subgroups

What we’ve done in answering RQ1 is group all participants together. But the fact that we found a statistically significant difference does not mean that this difference also exists within our subgroups, e.g., is there also a significant difference between typing on the OPTI keyboard vs.QWERTY for our two age groups, younger and older students? To be able to answer this question, we need to temporarily de-select all older participants in our data set.

The first task is to recode your Age variable into two groups that fit your data and create an Age_group variable, dividing the respondents into a “younger” and “older” group:

Check the frequency distribution of your Age variable (see Stats1, “Frequency distribution table”) to understand its distribution.

Use R to calculate the median of Age:

age_median <- median(keyboard_data$Age, na.rm = TRUE)
print(age_median)

Based on the frequency table and the median value, decide on a cutoff that divides your sample into two roughly equal groups. In many cases, the median is a good choice.
Using the method shown in Stats1 (“Re-coding variables”), create a new variable (Age_group) that assigns participants to “younger” (if Age is less than or equal to the median) or “older” (if Age is above the median).

What level of measurement (nominal, ordinal, interval, ratio) is the Age_group variable measured at?

Second, we need to filter our dataframe on a subgroup of younger students. In Stats1 we filtered twice: on a specific partipant (ID: 5193237) and on a specific keyboard (OPTI). Use the same filter command to now filter on younger students.

Is there a statistically significant difference in wpm when typing on the OPTI vs QWERTY keyboard among our younger group of students (at α = 0.05)? Formulate the null/alternative hypotheses, perform the paired-samples t-test, and report the results APA-style.

3. Independent-Samples t-Test

Now that we have answered our first research question, we can move on to our second:

RQ2: Is there a difference in wpm between males and females?

If we want to investigate differences between the genders, then we suddenly have a between-group design on our hands: you cannot be both female and male at the same time. That means that the genders form two independent groups, which in turn means that we have to use an independent-samples t-test.

To compare participants, we first need to aggregate our data at the participant level by calculating the mean WPM for each participant:

mean_wpm <- keyboard_data %>%
  group_by(ParticipantID,Sex) %>%
  summarize(Mean_WPM = mean(WPM, na.rm = TRUE), .groups = "drop")
head(mean_wpm)

Using this aggregated dataframe we can perform an independent-samples t-test. You can use the same function call as before (t_test), though without the ‘paired=TRUE’ parameter. You will also need to change the formula: Instead of predicting WMP by Keyboard, we will now compare mean_WPM by Sex.QEE

Investigate RQ2. Formulate the null/alternative hypotheses, use the appropriate chart to visualize the differences between the female and male participants, perform the independent-samples t-test, and report the results for RQ2 (with a = 0.05).

Tip: For an appropriate graph, consult “Population Pyramids (Alternative: Faceted Bar Chart in R)” from Stats1. Use the mean_wpm dataframe, take Mean_WPM as x-axis (in the aes definition), ~ Sex for facet_wrap.

Levene’s test

The t-test assumes that variances within both groups are equal This is called the ‘equality of variances assumption’. To test this, you can run Levene’s test. If the p-value is above 0.05, you assume equal variances and use the standard t-test; if below, you must use the t-test with different instructions.

library(car)
leveneTest(Mean_WPM ~ Sex, data = mean_wpm)

Run the code, can we assume equal variances between males and females?

If significant, revisit the previous question and include the parameter var.equal = FALSE in your t_test call.

Differences Between Age Groups

We also still need to answer our third and last research question:

RQ3: Is there a difference in typing speed between the older and younger student groups?

Answer RQ3. Formulate the null/alternative hypotheses, explain whether they involve between-group or within-group designs, use the appropriate chart to visualize the differences between the old and young participants, perform the appropriate t-tests, and report the results.

Beyond t-tests: When to use ANOVA

So far, we have used t-tests to compare two groups (e.g., WPM between OPTI and QWERTY keyboards). However, when we have more than two categories, a t-test is no longer suitable. Instead, we use Analysis of Variance (ANOVA):

A t-test compares the means between two groups (e.g., Keyboard: OPTI vs. QWERTY).
ANOVA compares means across three or more groups (e.g., MessagesCategory: “10 or less”, “11 to 50”, “More than 50”).
ANOVA tests whether at least one group mean is different but does not tell us which groups differ; for that, we need post-hoc tests (e.g., Tukey’s HSD).

Case 1: Repeating a t-test as an ANOVA

Previously, we used a t-test to compare WPM between OPTI and QWERTY keyboards. Here, we perform a one-way ANOVA to achieve a similar result, using again our full data set (keyboard_data):

anova_keyboard <- aov(WPM ~ Keyboard, data = keyboard_data)
summary(anova_keyboard)

If the ANOVA is significant (p < 0.05), follow up a post-hoc test (e.g., Tukey’s HSD or Bonferroni) to find which Keyboard is faster:

TukeyHSD(anova_keyboard)

Question:

Compare the p-value of the ANOVA test to the t-test: Which one is more conservative?

Case 2: Repeated measures ANOVA

In Case 1, the one-way ANOVA was more conservative than the paired t-test because it treated Keyboard as a between-subjects variable, assuming that each WPM value came from different participants. However, our study is within-subjects: each participant types on both keyboards across multiple trials.

To correctly account for this, we use a repeated-measures ANOVA, which separates between-participant variance from within-participant variance, just like a paired t-test does.

# Convert values to factors (categoricals)
keyboard_data <- keyboard_data %>%
  mutate(
    ParticipantID = factor(ParticipantID),
    Keyboard = factor(Keyboard),
    Trial_order = as.numeric(Trial_order)
  )

# Fit the repeated-measures ANOVA model
anova_model <- aov(WPM ~ Keyboard * Trial_order + Error(ParticipantID/(Keyboard*Trial_order)), data = keyboard_data)
summary(anova_model)

Explanation of the Formula Notation:

Part 1: WPM ~ Keyboard * Trial_order. This follows standard formula notation in R: WPM is the dependent variable, Keyboard * Trial_order specifies that we want to test for main effects of Keyboard and Trial_order, as well as their interaction (Keyboard:Trial_order).
Part 2 Error(ParticipantID/(Keyboard * Trial_order)) specifies ParticipantID as subject identifier, and repeated measures across the combinations of Keyboard and Trial_order. It tells R how to account for within-subject variability (i.e., repeated measures).

Note: The conversion to factors is important here! If Trial_order is considered an interval (continuous) scale, anova evaluates whether WPM changes linearly across trials. It assumes that the effect follows a consistent upward/downward trend. As factor, anova tests for WPM differences at each trial. This also captures non-linear patterns (e.g., rapid learning, then plateau).

The resulting ANOVA table provides three tests:

Keyboard Effect: Tests if WPM differs between OPTI and QWERTY keyboards.
Trial_order Effect: Tests if WPM changes across trials (learning effects).
Keyboard × TrialOrder Interaction: Tests if the effect of Keyboard depends on Trial_order (e.g., does one keyboard improve more with practice?).

If the interaction is significant, it suggests that one keyboard may have benefited more from repeated trials than the other, indicating a more pronounced learning (or fatigue) effect for one of the keyboards.

Question:

Compare the p-value of the repeated-measures ANOVA test to the t-test: Which one is more conservative now? And which one is more detailed?
Interpret / try to understand the results in your own words. What do all these main and interactions effects mean?

Case 3: Comparing WPM Across Three Message Categories

As indicated, an ANOVA can also be used to compare >2 categories. We can test this using our MessagesCategory variable (You created this variable in Stats1.) Re-run your recoding code if needed:

library(dplyr)
keyboard_data <- keyboard_data %>%
  mutate(MessagesCategory = case_when(
    Messages_per_day <= 10 ~ "10 or less",
    Messages_per_day >= 11 & Messages_per_day <= 50 ~ "11 to 50",
    Messages_per_day > 50 ~ "More than 50"
  ))

Question:

Run a one-way ANOVA to test whether WPM differs across MessagesCategory. If significant, conduct a post-hoc test (Tukey’s HSD) to determine which categories differ. Interpret the results: Is there a difference in typing speed based on how frequently people send messages?

Work through the exercises, compare your results with the examples provided, and discuss any discrepancies with your peers and instructors.

Happy analyzing!

Lab & Stats Weeks

Frans van der Sluis (2026 version)

2026-03-20