Cleaning and Analyzing distBuilder Data in R

Quentin André

Apr 8, 2021 6 min read

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(scales))

This tutorial is for R users. If you are a Python user, check out this other tutorial instead!

You have just completed a simple distribution builder study on a Qualtrics.

Participants have been randomly assigned to one of two conditions (“Low” or “High”), and have learned a distribution of numbers ranging from 11 to 20.
After learning the distribution, they had to predict the next 10 numbers that would appear in the distribution.
To do so, they allocated 10 balls (representing the next 10 draws) to 10 buckets [11, 12, 13, 14, 15, 16, 17, 18, 19, 20] (representing the possible outcomes).

Here is the data for the first five participants:

text <- ("A,Low,'3,1,0,2,1,0,0,0,1,2'
B,High,'0,0,2,0,2,0,1,2,0,3'
C,High,'3,0,0,2,1,0,0,0,3,1'
D,Low,'2,1,0,2,0,1,2,0,1,1'
E,Low,'3,2,0,0,0,1,0,3,0,1'
F,High,'0,0,2,0,0,1,2,0,2,3'
G,Low,'1,3,0,3,0,2,0,1,0,0'
H,High,'0,1,3,0,0,1,2,2,1,0'
I,High,'2,2,0,0,0,2,1,0,0,3'
J,Low,'1,3,1,0,2,2,0,0,0,1'
K,Low,'1,3,1,0,0,1,2,1,1,0'
L,Low,'3,1,2,1,0,1,0,1,0,1'
M,Low,'0,2,2,0,1,3,0,2,0,0'
N,Low,'1,3,0,2,0,1,1,1,1,0'
O,High,'0,0,1,3,1,0,0,3,0,2'
P,High,'1,1,2,0,0,0,2,0,2,2'
Q,High,'1,1,0,0,0,0,3,2,2,1'
R,High,'2,0,0,0,0,2,3,0,3,0'
S,High,'0,1,3,2,0,0,0,0,1,3'
T,Low,'1,3,1,0,1,0,3,0,0,1'")
data <- read.table(
  text = text, sep = ",",
  col.names = c("part_id", "condition", "allocation")
) %>%
  mutate(condition = factor(condition, levels = c("Low", "High")))
head(data, 5)

A data.frame: 5 × 3
	part_id	condition	allocation
	<fct>	<fct>	<fct>
1	A	Low	3,1,0,2,1,0,0,0,1,2
2	B	High	0,0,2,0,2,0,1,2,0,3
3	C	High	3,0,0,2,1,0,0,0,3,1
4	D	Low	2,1,0,2,0,1,2,0,1,1
5	E	Low	3,2,0,0,0,1,0,3,0,1

1. Data Wrangling: Making the raw data usable

If you look at the raw data from Qualtrics, you can see that the column “allocation” has stored the allocation provided by each participant.

However, those allocations are not interpretable right now:

The allocations are stored as a string (and not as a list of numbers).
This numbers in this allocation are not the values of the distribution: those are the number of balls in each bucket. You can see that all the values are below 10, while the buckets ranged from 11 to 20.

Let’s review how you should transform this data, and what you can do with it.

A. Converting the allocations to distributions

The first step is to write a function and convert those allocations to distributions.

convert_allocation_to_distribution <- function(allocation, buckets) {
  # Takes an allocation of balls to buckets, and a list of buckets.  Return the
  # corresponding distribution of values.
  allocation <- as.character(allocation)  # Failsafe in case it was stored as a factor
  list_alloc_num <- unlist(strsplit(allocation, ","))
  list(rep(buckets, list_alloc_num))
}

Now we apply this function to the column “allocation”, specifying that the buckets ranged from 11 to 20.

buckets <- seq(11, 20)  # Our buckets
data <- data %>%
  rowwise %>%
  mutate(distribution = convert_allocation_to_distribution(allocation, buckets)) %>%
  ungroup()
head(data, 5)

A tibble: 5 × 4
part_id	condition	allocation	distribution
<fct>	<fct>	<fct>	<list>
A	Low	3,1,0,2,1,0,0,0,1,2	11, 11, 11, 12, 14, 14, 15, 19, 20, 20
B	High	0,0,2,0,2,0,1,2,0,3	13, 13, 15, 15, 17, 18, 18, 20, 20, 20
C	High	3,0,0,2,1,0,0,0,3,1	11, 11, 11, 14, 14, 15, 19, 19, 19, 20
D	Low	2,1,0,2,0,1,2,0,1,1	11, 11, 12, 14, 14, 16, 17, 17, 19, 20
E	Low	3,2,0,0,0,1,0,3,0,1	11, 11, 11, 12, 12, 16, 18, 18, 18, 20

Good! The raw allocation strings have now been converted to the actual distribution provided by participants.

However, the data is not in a nice shape: the data for each participant is stored as a list of number in a single column. Ideally, we’d like a format that we can use to do analysis and graphs.

B. Pivoting the distributions in long form

We are now going to reshape the data in long form, such that one record corresponds to one value entered by one participant. This is achieved using the unnest function from dplyr.

data_long <- data %>%
  unnest(distribution) %>%
  group_by(part_id) %>%
  mutate(value_index = row_number()) %>%
  ungroup() %>%
  rename(value = distribution)
head(data_long, 10)

A tibble: 10 × 5
part_id	condition	allocation	value	value_index
<fct>	<fct>	<fct>	<int>	<int>
A	Low	3,1,0,2,1,0,0,0,1,2	11	1
A	Low	3,1,0,2,1,0,0,0,1,2	11	2
A	Low	3,1,0,2,1,0,0,0,1,2	11	3
A	Low	3,1,0,2,1,0,0,0,1,2	12	4
A	Low	3,1,0,2,1,0,0,0,1,2	14	5
A	Low	3,1,0,2,1,0,0,0,1,2	14	6
A	Low	3,1,0,2,1,0,0,0,1,2	15	7
A	Low	3,1,0,2,1,0,0,0,1,2	19	8
A	Low	3,1,0,2,1,0,0,0,1,2	20	9
A	Low	3,1,0,2,1,0,0,0,1,2	20	10

Great! We now have the data in long form. At this stage, it is good practice to double-check that all your participants have entered a full distribution (i.e., that there are 10 values per participant).

data_long %>%
  count(part_id) %>%
  head(5)

A tibble: 5 × 2
part_id	n
<fct>	<int>
A	10
B	10
C	10
D	10
E	10

We are now ready for some dataviz and statistical analysis!

2. Visualizing the data

The first thing you can do is to inspect the distributions provided by the participants:

options(repr.plot.width = 7, repr.plot.height =5)
ggplot(data_long) +
facet_wrap(vars(part_id), ncol = 4, scales = "free") +
geom_bar(aes(value, fill = condition)) +
xlab("Distributions") +
ylab("Count") +
scale_x_continuous(breaks = seq(11, 20, 1)) +
scale_y_continuous(limits = c(0, 4)) +
scale_fill_manual(values = c("#619CFF", "#F8766D")) +
theme_classic() +
theme(
  axis.text.x = element_text(size = rel(0.75)),
  strip.text = element_text(face = "bold", size = 9),
  strip.background = element_rect(fill = NULL, colour = "black", size = 0)
)

png

Those distributions look pretty random (because they were randomly generated), but at least we can see how each participant responded!

Note that we could also look at the overall distribution of responses by condition. Don’t forget to plot the frequency (rather than the count) by condition, otherwise you will not be able to compare two conditions that do not have the exact same number of participants:

options(repr.plot.width = 6, repr.plot.height = 3)
ggplot(data_long) +
facet_wrap(vars(condition), ncol = 4, scales = "free") +
geom_bar(aes(value, fill = condition, y = ..prop.., group = condition)) +
scale_x_continuous(breaks = seq(11, 20, 1)) +
scale_y_continuous(limits = c(0, 0.25), labels = percent) +
scale_fill_manual(values=c("#619CFF", "#F8766D")) +
xlab("Distributions") +
ylab("Frequency") +
theme_classic() +
theme(
  axis.text.x = element_text(size = rel(0.75)),
  strip.text = element_text(face = "bold", size = 9),
  strip.background = element_rect(fill = NULL, colour = "black", size = 0)
)

png

3. Analyzing properties of the distributions

The analysis of the data always depends on the research question.

So far, we have taken a descriptive approach to the data: We considered that we have collected 10 values from each participant, and have analyzed the data as such.

However, we could have collected the data with the specific goal of testing people’s beliefs about particular properties of the distribution:

We could be interested in what is the average value that people expect (i.e., the mean)
We could be interested in how dispersed participants expect the distribution to be (e.g., the standard deviation).
We could be interested in how skewed they expect it to be (i.e., the kurtosis)

In that case, the relevant unit of analysis is the statistics computed at the distribution’s level.

We need to group the data by part_id, and compute the statistics of interest, to obtain one data point per participant.

data_summary <- data_long %>%
  group_by(condition, part_id) %>%
  summarise(Dist_SD = sd(value), Dist_M = mean(value), .groups = "drop")
head(data_summary, 5)

A tibble: 5 × 4
condition	part_id	Dist_SD	Dist_M
<fct>	<fct>	<dbl>	<dbl>
Low	A	3.713339	14.7
Low	D	3.212822	15.1
Low	E	3.622461	14.7
Low	G	2.233582	13.9
Low	J	2.740641	14.2

We can now see if our condition had an impact on people’s perceived mean of the distribution:

options(repr.plot.width = 4, repr.plot.height = 4)
ggplot(data_summary, aes(x = condition, y = Dist_M, color = condition)) +
geom_jitter(width = 0.2) +
xlab("Condition") +
ylab("Mean of Reported Distributions") +
scale_color_manual(values = c("#619CFF", "#F8766D")) +
theme_classic()

png

Here, we see that participants in the “Low” condition reported a significantly lower mean that participants in the “High” condition (which is exactly how I created the data, so not so surprising…).

4. Wrapping Up

We have seen how to convert the raw data generated by distBuilder to a standard long form dataset.

A few words of caution to finish:

Always check that the list of buckets you use in the analysis matches the list of buckets that was shown to participants.
Always check your number of records. If something is off, you might have missing data, or you might not have forced participants to submit a full distribution.
Always ask yourself: what is my unit of analysis? Is it the values provided by participants, or is it a summary statistics of their total reported distribution?