STATS SESSION 1: DESCRIPTIVE STATISTICS

License: CC BY-SA 4.0 Available on Github

Changelog

  • 2025 Version for R (Mara Günther, Frans van der Sluis)
  • 2019 Original version for SPSS (Lorna Wildgaard, Haakon Lund, and Toine Bogers)

Topics

  • Frequency tables
  • Contingency tables
  • Visualisations
  • Descriptive statistics

Getting started

You can use R with either RStudio or online via ERDA. Instructions for both follow:

Getting started with ERDA

  1. Open and log in to ERDA with your KU mail
  2. Click on “jupyter”, then “Start DAG”
  3. Now, JupyterLab will launch.
  4. When JupyterLab has launched, use the file browser on the right-hand side to upload our data files: questionnaire_data_R_2026.xlsx and keyboard_data_R_2026.csv. One of the icons there are to ‘Upload files’.
  5. Now, in JupyterLab, find and click on RStudio.
  6. Start a new project with a new working directory
  7. Start a new RScript and save it in the directory

Note: Remember to work in an RScript, not in the console!

Getting ready in RStudio

First, install R and RStudio following this link.

Once R and RStudio have been installed, the next step is to install and load libraries that we will be using this session. You can read more about libraries here. The install.packages command installs a package, you only need to run this once (until you re-install or upgrade R). The library command needs to run everytime you start a new R session.

  • Here is an example for readxl, a library needed to read Excel files:
install.packages("readxl")
library(readxl)
  1. After installing and loading this library, try to install and load the following libraries:
  • Tidyverse -> a collection of R packages designed for data science and useful for quantitative data analysis
  • Moments -> for descriptive statistics

Questionnaire data

Now, let’s take a first look at our data. Place the file “questionnaire_data_R_2026.xlsx” in the directory you are currently working in:

questionnaire_data <- read_excel("questionnaire_data_R_2026.xlsx")
glimpse(questionnaire_data)

The function glimpse is designed to give you an overview of the data set. It gives basic info on the data set and allows a peak at the various columns and types of variables and values they hold.

You should now also see the data in the Environment tab. In RStudio, click on it and see what happens.

Also, from here on: Open a new file (File -> New File -> R Script) and put the working (!) code in there. You can run code line-by-line from this file by typing CMD+Enter or CTRL+ENTER depending on your platform. Consider this like a notebook where you keep track of all the commands that worked, allowing you also to easily rerun them whenever you need to.

Missing data

Missing values in R are represented by NA. We can use the is.na() function to find missing values:

is.na(questionnaire_data)

Try it out: Do you spot any missing values in our data set?

The is.na() function returns a dataframe of booleans (0s and 1s). For each row/column in our dataframe, it tells us whether that value is an NA. In combination with the sum() function you can count for the missing values, using the following code.

sum(is.na(questionnaire_data))
  1. How many missing values are there in total in our data set?

Frequency distribution tables

One way of displaying the way values of a particular variable are distributed is constructing a frequency distribution.

  • In R, we can use count whenever we want to count the number of observations in a data set.
  • Let’s take a look how many native English speaker there are in our data set:
questionnaire_data %>% count(English_as_first_language)
  1. Try the same for the ‘Device’ variable in our data.
  • The %>% operator is from the tidyverse library. This is a very powerful library for summarizing and manipulating data. TODO: add link for more information.

Bar charts

We have already seen how we can let R count and generate a frequency table, but some information can be grasped more quickly if it is presented in graphic form rather than in a table of numbers. So now, we will try to make a basic bar chart for the Gender variable.

For this we can use a graphic package which is already a part of the Tidyverse: ggplot.

This package has a distinct grammar that needs to be used to create graphs:

  • data: the dataset containing the variables of interest.

  • geom: the geometric object in question. This refers to the type of object we can observe in a plot. For example: points, lines, and bars.

  • aes: aesthetic attributes of the geometric object. For example, x/y position, color, shape, and size. Aesthetic attributes are mapped to variables in the dataset.

With this in mind, let’s create our very first bart chart:

ggplot(data = questionnaire_data, aes(x = Sex, fill = Sex)) +
  geom_bar() + 
  ggtitle("Gender Distribution") +
  labs(x = "Gender")
Gender Distribution

Gender Distribution

As you can see in the bar chart, our code also adds a title to the chart and a label to the x-axis.

Visualisations like these will always be loaded in the right bottom. Here you can easily save images to your computer by clicking on export.

If you want to play around with the bar chart and change it´s design, look here.

Re-coding variables

  1. Try to make a frequency table for MESSAGES PER DAY using count.

This does not really reveal any trends in our data. Instead let´s re-code and re-group this variable.

For this we can the tidyverse/dplyr function mutate. This function creates a new column in our data set that is called MessageCategory.

questionnaire_data <- questionnaire_data %>%
  mutate(MessagesCategory = case_when(
    Messages_per_day <= 10 ~ "10 or less",
    Messages_per_day >= 11 & Messages_per_day <= 50 ~ "11 to 50",
    Messages_per_day > 50 ~ "More than 50"
  ))

With this new variable we can even make a bar chart:

ggplot(data = questionnaire_data, aes(x = MessagesCategory)) +
  geom_bar(fill = "steelblue") + 
  labs(x = "Messages per day")
Messages per day

Messages per day

But what if you only want to look at one participant?

For this R has a simple solution: the filter function. Filter can keeps or discards rows in our data set.

Let’s take a look at our first participant with the unique ID 5193237.

ggplot(data = questionnaire_data %>% filter(ParticipantID == 5193237), aes(x = MessagesCategory)) +
  geom_bar(fill = "steelblue") + 
  labs(x = "Messages per day")
Messages per day for Participant 5193237

Messages per day for Participant 5193237

  1. Can you repeat this step with another participant?

Line chart

You can use R to create various other charts and data visualisation. We can for example make a line chart for messages per day in relation to the number of respondents.

For this we need to count the messages per day and then create a graph using ggplot. Pay attention to the new part geom_line. Geom_line is always used for line charts.

questionnaire_data %>%
  count(Messages_per_day) %>%
  ggplot(aes(x = Messages_per_day, y = n)) +
  geom_line(color = "black") +
  labs(x = "Messages per day", y = "Number of respondents") +
  ggtitle("Messages per day")
Line Chart: Messages per day

Line Chart: Messages per day

  1. Do you think line graphs are better at visualizing some types of variables over others?

Pie charts

Pie charts are another common way of presenting nominal and ordinal data graphically.

GGplot does not have a specific geometric function to build pie charts. The key is to go back to geom_bar( ) and add the polar coordinate function to make the graph circular.

You can read more about this here.

ggplot(data = questionnaire_data, aes(x = "", fill = MessagesCategory)) + 
  geom_bar(stat = "count", width = 1) + 
  coord_polar("y") +
  theme_void()
Pie Chart: Messages Category

Pie Chart: Messages Category

Histogram

To draw a histogram in ggplot2, we use the geometric function, geom_histogram. With this we can for example create a histogram for the age distribution in our data.

ggplot(data = questionnaire_data, aes(x = Age))+
  geom_histogram(binwidth = 5, color = "black", fill = "light green") +
  ggtitle("Age Distribution")
Age Distribution

Age Distribution

Population Pyramids (Alternative: Faceted Bar Chart in R)

A population pyramid is a specialized type of histogram that splits a distribution into two groups and mirrors them. While creating an exact population pyramid in R is complex (see here for an explanation), we can achieve a similar effect by using faceted histograms to compare distributions side by side.

In the previous exercise, you created a histogram for Age. Now, we will split the distribution by gender by adding facet_wrap(~ Sex) to our plot definition. This function creates separate plots for each category of Sex, aligning their x and y axes for easy comparison:

ggplot(data = questionnaire_data, aes(x = Age))+
  facet_wrap(~ Sex) +
  geom_histogram(binwidth = 5, color = "black", fill = "light green") +
  ggtitle("Faceted Age Distribution")
Age Distribution

Age Distribution

Let’s expand our plot definition a bit further:
  1. Modify the plot and observe the differences. Try one-by-one:
  • Add fill = Sex as second parameter inside aes() and remove fill as geom_histogram parameter.
  • Set scales = "free_y" as second parameter in facet_wrap(), after the formula (~ Sex).
  • Change the bin width (e.g., 2 or 10).
  1. Next, examine whether there is 1) a gender difference in the amount of messages sent per day, and 2) a gender difference in the amount of hours used on this in our data. Generate the population pyramids (well, faceted bar charts) for these to comparisons. Change the Age variable in the x = Age parameter of aes for your variable of interest. Play around with the binwidth (and other parameters) of the graphs. Is there visual evidence for a gender gap?

Contingency tables

Contingency tables show frequencies for particular combinations of values of two discrete random variables X and Y. In our dataset that could be gender and messagescategory.

Using count can create this for us.

questionnaire_data %>% count(Sex, MessagesCategory)
  1. Take a look at the output. What can we learn about gender and message category?

Measures of central tendency and dispersion

Calculating measures of central tendency and dispersion using R is relatively easy.

With the following code, you can for example calculate the mean, median, mode, standard deviation, minimum and maximum values for messages per day:

mean(questionnaire_data$Messages_per_day, na.rm=TRUE)
median(questionnaire_data$Messages_per_day, na.rm=TRUE)
as.numeric(names(sort(table(questionnaire_data$Messages_per_day), decreasing = TRUE)[1])) 

sd(questionnaire_data$Messages_per_day, na.rm=TRUE)
min(questionnaire_data$Messages_per_day, na.rm=TRUE)
max(questionnaire_data$Messages_per_day, na.rm=TRUE)

Keyboard data

Let’s now input the keyboard data file. It’s called ‘keyboard_data_R_2026.csv’ and can be imported using the command ‘read.csv(file=“keyboard_data_R_2026.csv”)’. Be sure sure to assign the output to a new variable, which we’ll call ‘keyboard_data’.
  1. Import keyboard_data_R_2026.csv to a variable named keyboard_data.

The import of this csv file is similar to how we imported the previous excel file, so look back to that definition for an example.

Long format data

The questionnaire data is structured in a ‘wide’ format: each row represents a participant, and each observation is put in a column. Instead, the keyboard_data is in ‘long’ format. Here, each row represents a trial, which is one attempt to write the sentence ‘the quick brown fox jumps over the lazy dog’.

  1. Explore the data using the glimpse function and/or by opening the dataset in RStudio:
  • How many rows are there per participant?
  • What happened to the questionnaire data, which was not measured per trial but only once per participant?
  • Which columns identify a row? ParticipantID is no longer unique, so which other columns are needed in addition to ParticipantID to know which trial we are looking at?

More on measures of central tendency and dispersion: The OPTI-keyboard

You can use this code to calculate the measures of central tendency and dispersion for the OPTI-keyboard. Let’s first filter our data to only include trials with the OPTI-keyboard:

opti_data <- keyboard_data %>% filter(Keyboard == "OPTI")

We can use tidyverse/dplyr’s summarize function to derive various statistical moments from our data:

opti_data %>% summarize(
    Mean = mean(WPM, na.rm = TRUE),
    Median = median(WPM, na.rm = TRUE),
    SD = sd(WPM, na.rm = TRUE),
    Variance = var(WPM, na.rm = TRUE),
    Minimum = min(WPM, na.rm = TRUE),
    Maximum = max(WPM, na.rm = TRUE),
    Skewness = skewness(WPM, na.rm = TRUE),
    Kurtosis = kurtosis(WPM, na.rm = TRUE)
)
##       Mean   Median       SD Variance  Minimum  Maximum Skewness Kurtosis
## 1 12.27404 10.59483 10.52659  110.809 3.539852 144.6634 7.074001 69.72817
  1. Try to look at this table and interpret it´s results: What can we learn about our data?

Now that we calculated the measures of central tendency and dispersion for the OPTI-keyboard, we can also look at the QWERT-keyboard.

  1. Can you repeat the steps and calculate the measures for the QWERT-keyboard?

Tests of normality: OPTI keyboard

The best way to investigate kurtosis and skewness and normality is through Kolmogorov-Smirnov statistic and the Shapiro-Wilk Test.

Kolmogorov-Smirnov statistic is done through ks.test() in R. For this we will need to specify the normal distribution: pnorm.

Shapiro-Wilk Test s available in R via the shapiro.test() function.

We will try these tests in R for OPTI. Try to run the code. You will notice that the data in OPTI does not follow a normal distribution. What do we do now?

# Kolmogorov-Smirnov
ks.test(opti_data$WPM, "pnorm", mean = mean(opti_data$WPM), sd = sd(opti_data$WPM))

# Shapiro-Wilk
shapiro.test(opti_data$WPM)

Taking a look at outliers

ggplot(data = opti_data, aes(x = WPM)) +
  geom_histogram(binwidth = 5, color = "black", fill = "light green") +
  labs(x = "OPTI", y = "Frequency")
Histogram of OPTI

Histogram of OPTI

qqnorm(opti_data$WPM, main = "Q-Q Plot of OPTI")
qqline(opti_data$WPM, col = "red")
Q-Q Plot of OPTI

Q-Q Plot of OPTI

ggplot(opti_data, aes(y = WPM)) +
  geom_boxplot(fill = "lightgreen", color = "black", width = 0.5) +
  labs(title = "Boxplot of WPM", y = "WPM")
Boxplot of OPTI_1

Boxplot of OPTI_1

A way to check what is going on here is through data visualizations. I have noted down the code for a histogram, Q-Q plot and boxplot.

  1. Take a look at these three visualisations: Can you see why the data in OPTI is not normally distributed?
  1. Can you repeat these steps for our QWERTY keyboard? Are our data for QWERT normaly distributed, and why (not) so?