Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computation of genetic diversity stats and comparison between populations #43

Open
wants to merge 32 commits into
base: master
Choose a base branch
from

Conversation

jonbrenas
Copy link
Collaborator

This notebook contains my code for comparing Watterson's theta at various sample sizes for different populations. So far, I have not handled the arabiensis samples (or the intermediate ones if we deem that the ones in the Far West and the ones in Kenya 12 are comparable and that we have enough sample sets with them). The pictures are also quite ugly.

That said, the main issue is that it creates a zarr file with the stats for the various populations and populations sizes that we don't want to have in the repo. So first runs might be long. A solution could be to create a bucket on the cloud where the data would be stored. It wasn't necessary as long as I was working on my lonesome on this but it might be worth it if it becomes more of a collaborative effort.

The same code (modulo the name of the stats when the scikit allel function is called) can probably be reused for a few stats but a few tweeks are likely to be needed (for instance, arabiensis only use 3L for the analysis so I will have to add a parameter).

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@leehart leehart self-requested a review October 15, 2020 13:09
@jonbrenas
Copy link
Collaborator Author

A short comment: The zarr files are used to store the various runs of Watterson's theta computation. As Datalab kept dying on me, even we just plotting stuff, dataframes are also created that contain the data that we want to plot. The "Dataframe" section should be skipped on the first run (because the files that are read do not yet exist). The Dataframes are created by the first population for each species. The Dataframes will need to be reworked to store all the relevant populations and only those.

@leehart
Copy link
Collaborator

leehart commented Oct 21, 2020

  • I recall @hardingnj wants to use notebooks/ag3.py instead of intake.
  • Add cluster.adapt() to this nb, as per @alimanfoo 's recent Discourse.

@hardingnj
Copy link
Contributor

I don't object to intake per se, it's just that I found it insufficient alone.

There is quite a bit of complexity in ag3, in terms of masking sites and dropping samples. My ultimate preference would be to use intake in ag3.py, but I really feel there is still a need for ag3.py because of the organisation of the data.

@leehart
Copy link
Collaborator

leehart commented Oct 21, 2020

Cool, I'd be happy to convert ag3.py to use intake. Or at least I can make a start on it and submit a PR.

@hardingnj
Copy link
Contributor

Sounds good to me!

@hardingnj
Copy link
Contributor

Thanks Jon.

I think what I would like to do with this PR is change the focus from an investigation of sample size effects to a general purpose analysis that generates summary statistics for different populations.

That said- I think the trajectory of theta is informative, so probably should remain part of the work.

I guess it mainly depends on how we want to show the data, let's chat about this this afternoon.

@jonbrenas jonbrenas changed the title Comparison of genetic diversity stats at various samples sizes Computation of genetic diversity stats and comparison between populations Oct 28, 2020
@jonbrenas
Copy link
Collaborator Author

I have made some attempts to show the difference between coluzzii and gambiae through the Watterson's theta lense. It does not show on the notebook because I used the data from my TM study. I thus have 5 coluzzii populations and 6 gambiae populations from the same(ish) locations and same(ish) time. All populations have been downsampled to the smallest ones to remove the sample size effect. I will only show 3L, 3R looks similar.

The first idea was to plot the distribution of WT in each window for coluzzii and gambiae. It works and it shows very clearly that there are some regions with a noticeable difference but in general it is not easy to see much of a difference.

image

The second option to show the difference between the mean value for each species in each window. It looks much clearer but it lacks information on variance.

image

Trying to show the difference between the minimum of gambiae and the maximum of coluzzii shows that there is a lot of overlap.

image

The last thing I tried was to normalize the first plot with the mean value for coluzzii (so that its plot is centered around one). I think it is the one that shows the difference the most clearly but it is also the most abstract measure of the bunch.

image

nicholasharding added 2 commits October 28, 2020 14:36
- fully automated based on distance clustering
- new naming scheme
- includes arabiensis
@hardingnj
Copy link
Contributor

Thanks Jon. This is neat. This clearly shows that the variation due to region exceeds that of species.

I'm not sure why we are particularly interested in theta by window in this case. I think it's only interesting when we are examining the features of the genome.

The fundamental question about whether there is a difference in theta between gambiae and coluzzii could be addressed by simply taking the whole genome as our unit of theta (standardizing on minimum n), ranking each population, and using something like the Mann–Whitney U test.

I think to do otherwise is problematic- by taking windows and looking at their distribution we (falsely) imply independent observations.

An approach which may bring the best of both worlds, is to compute U1 + U2 for each window in the genome, based on gambiae/coluzzii populations and to plot that value over chromosomes. Where the ranks are evenly distributed U1/U2 will be around 15, where all gambiae populations are higher, U1/U2 will be 30/0. The key thing is that the populations are the unit of sampling here, not windows.

@hardingnj
Copy link
Contributor

An approach which may bring the best of both worlds, is to compute U1 + U2 for each window in the genome, based on gambiae/coluzzii populations and to plot that value over chromosomes. Where the ranks are evenly distributed U1/U2 will be around 15, where all gambiae populations are higher, U1/U2 will be 30/0. The key thing is that the populations are the unit of sampling here, not windows.

In fact to do this should be straightforward given the data you have computed already?

@jonbrenas
Copy link
Collaborator Author

jonbrenas commented Oct 29, 2020

I think we will discuss that on Monday, so this will work more as notes. That said my stats knowledge is a little hazy, right now.

The fundamental question about whether there is a difference in theta between gambiae and coluzzii could be addressed by simply taking the whole genome as our unit of theta (standardizing on minimum n), ranking each population, and using something like the Mann–Whitney U test.

I agree with the MWU test being able to tell us whether the distributions for gambiae and coluzzii are likely to be the same, but it is only one value, not a plot (and I like plots).

I'm not sure why we are particularly interested in theta by window in this case. I think it's only interesting when we are examining the features of the genome.

True. Examining the features of the genome kinda was the angle I was trying to take. I don't think it is controversial that theta is higher for gambiae than for coluzzii, and thus so is Ne. What makes Ne_gambiae higher is thus the next question (higher census population size, more migration, no aestivation, ...). But I get that it is not the question we are trying to answer here (and I have no proof that looking at the features of the genome would lead to an answer).

I think to do otherwise is problematic- by taking windows and looking at their distribution we (falsely) imply independent observations.

Yes and no. I would say that the values for a given species/ varying windows are not independent, but that for a given window/varying species the values should be. I would expect the covariance to drop down with distance so the window method (probably) gives us a handful of independent-ish observations instead of just the one of averaging them.

@hardingnj
Copy link
Contributor

From meeting this morning- plan is then to:

  • compute theta on 4 way degenerate sites
  • bootstrap confidence intervals by resampling
  • plot gambiae/coluzzii populations as bar charts with error bars.
  • Perform Mann-Whitney U test of west african gambiae vsa coluzzii.

@jonbrenas
Copy link
Collaborator Author

@hardingnj , @leehart , @alimanfoo , @cclarkson : I was wondering, the population definition gives the list of sample names in each population but not the sample set they come from. I feel like, in particular when we don't want to run something on all populations at the same time, it would be productive to have the list of sample sets they come from (Nick said that each population is extracted from only one dataset right now but that might change in the future). Otherwise, I feel like one either has to look in the metadata where they can be found (which is recompute that piece of information every time) or aggregate all the datasets and metadata and find them (which feels like overkill). Any opinion?

@leehart
Copy link
Collaborator

leehart commented Nov 11, 2020

@hardingnj , @leehart , @alimanfoo , @cclarkson : I was wondering, the population definition gives the list of sample names in each population but not the sample set they come from. I feel like, in particular when we don't want to run something on all populations at the same time, it would be productive to have the list of sample sets they come from (Nick said that each population is extracted from only one dataset right now but that might change in the future). Otherwise, I feel like one either has to look in the metadata where they can be found (which is recompute that piece of information every time) or aggregate all the datasets and metadata and find them (which feels like overkill). Any opinion?

I think I agree, although it would complicate and bloat the YAML format (perhaps "population_id" > {sample_set_id: "sample_set_id", sample_id: "sample_id"} ). CSV might be an option. The location-colours YAML is currently "country" > "location" > "colour". Do we need to account for cases where the same sample_id appears in different sample sets, or is that impossible by design and strictly enforced? My first feeling is that sample ids belong to sample sets, so would normally be considered together as a pair of keys, but I also suspect that we might have relied on unique sample ids elsewhere - although that might be a mistake.

If populations might consist of samples from more than one sample set in the future, then it might be prudent to future-proof this file for defining populations, especially if future sample ids might clash. But I also hear the convenience of the present case.

On the chore of having to look up the sample set for any particular sample id, I'm not sure if that's a big cost, but I still feel doubtful that these sample ids are going to scale as unique identifiers across all sample sets, as we accumulate more sample sets, when it seems natural to carry the sample-set-id and sample-id pair around together. Personally I'd rather see sample-set sample-id pairs than a path to generating long unwieldy frankenstein sample ids that seek to be unique across all sets.

I did wonder whether these population groups might purposely detach/insulate samples from any kind of prior associations they had with sample sets, such as country of origin. For instance we already have a sample set with samples originating from more than one country. In a similar spirit, the colouring of locations had been purposely detached/insulated from prior association with geo-political groupings, such as country, and instead is based on lat-long. But since the population groupings are named according to country-5km-species-year, it feels like a clarification of which samples came from which sample set within population_definitions.yml might provide good provenance (and convenience), at the expense of a bloated file and potential redundancy.

…us sample sizes for every population. For the arabiensis populations, it has been computed using the gamb-colu filters as well as

the arabiensis filters. The last part is still experimental, it is trying to have error bars for different loci. All the data is in the csv files.
…be modified to do some kind of jackknifing or bootstrapping, eventually.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants