forked from mortenarendt/dataanalyssisconsumerscience
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path08_MSTexercises.Rmd
252 lines (186 loc) · 11.3 KB
/
08_MSTexercises.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
# MST exercises
Below you will find exercises relevant for the data analysis for the project work in the Meal Systems and Technologies. Exercises are meant as a guide on what you can do with data obtained with the iBuffet.
## Exercise 1 - cleaning Compusense data with R
The task of this exercise is to get your data ready for data-analysis i.e. cleaning and merging your data! This can be done in R or excel.
That includes:\
- Give meaningful labels to variables.\
- Merging of "control" and "intervention" spreadsheet.\
- Removal of meaningless rows such as rows without data etc.\
- Removal of unimportant characters.\
- Merging of consumption data with questionnaire. (Remember to have matching columns if using R - see chapter 2.3.1.1).
### Exercise 1 - Solution
Start by importing the data and combine them row wise with the *rbind()* function. These datasets look exactly like the ones that you will receive from Compusense, but for the sake of simplicity, they have been included in the *data4consumerscience*-package.
```{r,warning=FALSE,message=FALSE}
library(data4consumerscience)
data("mstcontrol")
data("mstintervension")
data("mstquestionare")
consumption <- rbind(mstcontrol,mstintervension)
```
Next we clean our data. We want to remove rows containing "--" , "0" and "REFILL". We use the *filter* function from the *tidyverse* package. The "!" is used as a NOT logical operator, meaning it will only select rows that do NOT contain "\--". *if_any* checks all rows in specific columns, to see if conditions are met. *everything* specifies that it should be all columns.
```{r,message=FALSE,warning=FALSE}
library(tidyverse)
consumption <- consumption %>%
filter(!if_any(everything(), ~ .x == "--")) %>%
filter(!if_any(everything(), ~ .x == "0")) %>%
filter(!if_any(everything(), ~ grepl("REFILL", .x)))
```
If we take a look at the *Person* column, then it is rather confusing. We want to have the Person ID only. Which can be found after "A:". This can be done by using gsub with string specifications.
```{r}
consumption$Person <- gsub("^A:\\s*(\\d+).*", "\\1", consumption$Person) # Only extract the "Number" after "A:".
```
Then we merge the questionnaire and consumption data using *left_join* by a common variable "Person"
```{r,warning=FALSE}
mstquestionare$Person <- as.factor(mstquestionare$Person)
consumption$Person <- as.factor(consumption$Person)
Buffet_quest <- consumption %>%
left_join(mstquestionare, by = c("Person"))
Buffet_quest <- na.omit(Buffet_quest) #remove rows with NA values
```
We are however not interested in all variables. Therefore, we select only the relevant variables! In this case only variables in columns 1-7 and 18-29.
```{r,warning=FALSE}
Buffet_quest <- Buffet_quest[,c(1:7,18:29)]
```
The scale of the questionnaire are given on likert scale e.g. 1-5, 1-7 or 1-9. Or male and female are encoded as 1 or 2, if we however want the corresponding text answer. e.g. 1 = "Female" or 2 = "Male". This can be done with the *mutate* and *fct_recode*. Below will only show examples of how this could be done, for some selected variables. The opposite of what is shown, can also be used going from text to numbers! While we are at it, we are also going to rename inside the "Buffet name" variable.
```{r,warning=FALSE}
library(forcats)
Buffet_quest <- Buffet_quest %>%
mutate(
Gender = fct_recode(as.factor(Gender),
"Female" = "1",
"Male" = "2"),
Dietary_lifestyle = fct_recode(as.factor(Dietary_lifestyle),
"Omnivore" = "1",
"Flexitarian" = "2",
"Pescetarian" = "3",
"Vegetarian" = "4",
"Vegan" = "5"),
`Oat/Oatmeal_familiarity` = fct_recode(as.factor(`Oat/Oatmeal_familiarity`),
"I do not recognize it" = "1",
"I recognize it, but i have not tasted it" = "2",
"I have tasted it, but i do not consume it on a regular basis" = "3",
"I consume it on a regular basis" = "4"),
Freq_oat_consumption = fct_recode(as.factor(Freq_oat_consumption),
"Never" = "1",
"Less than once a month" = "2",
"1-3 times a month" = "3",
"1-3 times a week" = "4",
"4-6 times a week" = "5",
"Once a day" = "6",
"2 or more times a day" = "7"),
Freq_ryebread_consumption = fct_recode(as.factor(Freq_ryebread_consumption),
"Never" = "1",
"Less than once a month" = "2",
"1-3 times a month" = "3",
"1-3 times a week" = "4",
"4-6 times a week" = "5",
"Once a day" = "6",
"2 or more times a day" = "7"),
`Buffet name` = recode(`Buffet name`,
"OAT TEST Kontrol" = "control",
"OAT TEST Interventions" = "intervention")
)
```
See how it looks below!
```{r,warning=FALSE}
library(DT)
datatable(Buffet_quest,rownames = FALSE,options = list(pageLength = 5, autoWidth = TRUE))
```
## Exercise 2 - creating a tableone and visualization
Descriptive Statistics are used to present quantitative descriptions in a manageable form e.g. simpler summary (e.g. a number, average etc.). Although you run the risk of distorting the original data or losing important details descriptive statistics provide a powerful summary that may enable comparisons across people or other units.
You should use tableone as a way to present your descriptive statistics! See chapter 10.4 for how to do it!
### Exercise 2 - solution
Lets have a quick overview of the data by the use of *CreateTableOne*. Note: If you want to showcase e.g. "Age" with median and IQR, you need to add the variable to the *nonnormal* argument in the *print* function.
```{r,warning=FALSE, fig.width=7}
library(tableone)
#it seems the variable "Consumption in grams" was a character. We want it to be numeric in our tableone!
Buffet_quest$`Consumption in grams` = as.numeric(Buffet_quest$`Consumption in grams`)
tb1 <- CreateTableOne(data = Buffet_quest, vars = c("Gender", "Age", "Nationality",
"Dietary_lifestyle", "Oat/Oatmeal_familiarity", "Freq_oat_consumption",
"Freq_ryebread_consumption","I_like_oat_/_oatmeal", "I_like_ryebread","Consumption in grams"),strata = "Buffet name")
print(tb1, showAllLevels = TRUE, nonnormal = 'Age')
```
From these results, we can quickly gather information such as:
Be careful with the interpretation as each participant occurs two times each.
- The number of samples is 76 (n = 76). Where the intervention group has 44 samples (n = 44), and the control group has 32 samples (n = 32)
- The vast majority Like oat/oatmeal moderately or even more! No significant difference in oat liking across control and intervention (p = 0.285)
- The average consumption of oat and rye was 100 grams in the control, where the average consumption of oat and ryebread was 85 grams. And significantly different (see t-test chapter 6.4)
It could be interesting to see if there is a difference in the control and intervention. This can be done with the use of the strata argument. - Beware that the function uses ANOVA (chapter 6.6) when comparing more than two groups.
```{r}
tb2 <- CreateTableOne(data = Buffet_quest, vars = c("Gender", "Age", "Nationality",
"Dietary_lifestyle", "Oat/Oatmeal_familiarity", "Freq_oat_consumption",
"Freq_ryebread_consumption","I_like_oat_/_oatmeal", "I_like_ryebread","Consumption in grams"),strata = c("Station name", "Buffet name"))
print(tb2, showAllLevels = TRUE)
```
Lets visualize the difference in consumption based on Station Name and Buffet Name, with a boxplot, violin plot.
```{r}
library(ggplot2)
ggplot(data = Buffet_quest,aes(x = interaction(`Buffet name`,`Station name`),y = `Consumption in grams`, fill = interaction(`Buffet name`,`Station name`)))+
geom_boxplot()+
geom_jitter()+
guides(fill=guide_legend(title = "Buffet & Station")) +
theme_bw()+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
```
```{r}
ggplot(data = Buffet_quest,aes(x = interaction(`Buffet name`,`Station name`),y = `Consumption in grams`, fill = interaction(`Buffet name`,`Station name`)))+
geom_violin()+
guides(fill=guide_legend(title = "Buffet & Station")) +
theme_bw()+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
```
Or use a histogram to show the distribution of consumption in control and intervention.
```{r}
ggplot(data = Buffet_quest, aes(x = `Consumption in grams`, fill = `Station name`))+
geom_histogram()+
facet_wrap(~`Buffet name`)
```
## Exercise 3 - Creating an outcome table
In this exercise you should create and outcome table. An outcome table includes:
- Mean and standard deviation for each treatment (intervention and control in this case. Hint: Variable name "Buffet name" )
- The p-value from a generalized linear model adjusted for relevant parameters (e.g. age, gender, "I_like_oat")
### Exercise 3 - Solution
We can calculate the mean and sd using R basic functions. Or simply use the ones from tableone.
If you want to use variables on the likert scale for your linear model, one should use the numerical form. For this we will simply use "Buffet_quest"
Lets find mean and sd for oat and rye bread.
First we split the data into oat and rye bread based on the station name and calculate mean and sd based on intervention/control and oat/rye
```{r,warning=FALSE}
library(tidyverse)
Oat_stats <- Buffet_quest %>%
filter(`Station name` == "OAT") %>%
group_by(`Station name`,`Buffet name`) %>%
dplyr::summarise(
mean = mean(`Consumption in grams`),
sd = sd(`Consumption in grams`)
)
Rye_stats <- Buffet_quest %>%
filter(`Station name` == "RYE BREAD") %>%
group_by(`Station name`,`Buffet name`) %>%
dplyr::summarise(
mean = mean(`Consumption in grams`),
sd = sd(`Consumption in grams`)
)
outcometable <- rbind(Oat_stats,Rye_stats)
```
Next we use *lm* and *summary* to find the p-values for the linear model. Do a linear m for each type of consumption and control/intervention adjusted for liking, age, gender or other factors you think could be important, or are shown significant in tableone.
```{r,warning=FALSE}
Oat_lm <- Buffet_quest %>%
filter(`Station name` == "OAT")
mdl <- lm(data = Oat_lm, `Consumption in grams` ~ `Buffet name` + Age + factor(Gender) + `I_like_oat_/_oatmeal`)
summary(mdl)
Rye_lm <- Buffet_quest %>%
filter(`Station name` == "RYE BREAD")
mdl2 <- lm(data = Rye_lm, `Consumption in grams` ~ `Buffet name` + Age + factor(Gender) + I_like_ryebread)
summary(mdl2)
```
We combine the models p-values with our outcome table
```{r,warning=FALSE}
outcometable$p.value <- c(0.6122, NA, 0.257, NA)
kableExtra::kable(outcometable)
```
The outcome table should be presented nicer. But that is for you to do!
Further one should explore possible interactions in the linear model.