Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

betta() and betta_random() p-value between each combination of discrete variable #198

Open
AlexaBennett opened this issue Nov 13, 2024 · 5 comments

Comments

@AlexaBennett
Copy link

I am hoping you can point me in the right direction. I was following the diversity-hypothesis-testing vingette, and wanted to know the mathematically sound way for one to determine the statistical significance between all groups. In the case of the tutorial, this would be the diversity at day 0 vs d 12, day 0 vs day 82, and day 12 vs 82.

@adw96
Copy link
Owner

adw96 commented Nov 13, 2024

Thanks @AlexaBennett , great question. Before I dive in to checking for you, do you have any other adjustment variables (in addition to day)? That would help give you a better answer. Thanks!!

@AlexaBennett
Copy link
Author

AlexaBennett commented Nov 13, 2024

Right now, I am using a single adjustment variable. I plan to add another discrete variable.

Regarding the single variable, here is what I have tried.

  1. The first attempt was to subset the data and remove the previous intercept. I then used the p.adjust() function from the stats package to correct the p-values. This strategy was to avoid re-analyzing the relationship between the two groups. However, I noticed that this can influence the estimates, errors, and p-values of the remaining groups. In my data, this could change the result's significance.
  2. Do not subset, but add factor levels to force one of the variables into the intercept. This produces more stable estimates, which, in my data, begin to vary in the 1e-06 position. It makes relatively stable errors for groups that are not the intercept. However, the intercept's error tends to either increase or decrease. The p-values between two groups (no data removed, only factor levels applied to the discrete variable) could have marked differences. For example, the p-value of comparison between A-B, the intercept as A, was 0.0000002 while B-A, the intercept as B, was 0.3866848. Striking this out since I noticed one of my groups lost too many members for statistical comparison. For example, the p-value of comparison between A-B, the intercept as A, was 0 while B-A, the intercept as B, was 0.0001250.

@AlexaBennett
Copy link
Author

Given the currently available functions and the stability of results for each intercept, this is my planned approach:

  1. Calculate the estimates with each group as the intercept
  2. Plot the estimates and errors using the median value obtained for each group from all iterations of the above
  3. Take the max p-value obtained for each A-B comparison
  4. Apply p-value correction

I am open to a more statistically sound methodology as my background is in Biology and not Statistics.

@adw96
Copy link
Owner

adw96 commented Nov 14, 2024

These are great questions and thanks for circling back, @AlexaBennett . Fortunately there is an easier way than iterating through various model parameterizations. As you suggest, let's suppose we want to test that the average richness is equal for all days, i.e.,
average richness (Day 0) = average richness (Day 12) = average richness (Day 82).

In this case, we can look at the output of global as follows:

library(tidyverse)
library(breakaway)
library(corncob)
library(phyloseq)
data("soil_phylo")
soil_phylo %>% sample_data %>% head
subset_soil <- soil_phylo %>%
  subset_samples(Amdmt == 1) # only biochar
richness_soil <- subset_soil %>% breakaway
meta <- subset_soil %>%
  sample_data %>%
  as_tibble %>%
  mutate("sample_names" = subset_soil %>% sample_names )
combined_richness <- meta %>%
  left_join(summary(richness_soil),
            by = "sample_names")
bt_day_fixed <- betta(formula = estimate ~ Day, 
                      ses = error, data = combined_richness)
bt_day_fixed$table
bt_day_fixed$global[2] ## p-value for strong null

and in this case we would reject the null (p < 0.001).

This approach will test your desired hypothesis as long as you have no other variables in the model. It will take a bit of work on the backend for us to build that for you -- but it's work worth doing if you need it! Before we do that, can you confirm that you want to test, "average richness is equal across all days with the same amendment" using something like the following

subset_soil2 <- soil_phylo %>%
  subset_samples(Amdmt %in% c(0, 1)) # biochar and unmodified
richness_soil2 <- subset_soil2 %>% breakaway
meta2 <- subset_soil2 %>%
  sample_data %>%
  as_tibble %>%
  mutate("sample_names" = subset_soil2 %>% sample_names )
combined_richness2 <- meta2 %>%
  left_join(summary(richness_soil2),
            by = "sample_names")
bt_day_fixed2 <- betta(formula = estimate ~ Day + Amdmt, 
                      ses = error, data = combined_richness2)

If so, I'll see what we can do.

@AlexaBennett
Copy link
Author

@adw96 This is what I am looking for, with the added ability to see which days are statistically different from one another within each amendment.

While not the original point of this issue, the instability of the standard error of the estimate for the intercept is a higher priority for me at this time. Should I open another issue?

With your example, I was setting factor levels for "Day" to get the final p-value for the relation of day 12 to day 84. That was when I realized, in my real-life data, that the feature used for the intercept in the $table was susceptible to the error being decreased by up to 75% in some cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants