-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Computation of genetic diversity stats and comparison between populations #43
base: master
Are you sure you want to change the base?
Conversation
…-data-paper into q13-design-pca-figure
…es. Arabiensis has not been done yet and the pictures can (easily) be improved.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
A short comment: The zarr files are used to store the various runs of Watterson's theta computation. As Datalab kept dying on me, even we just plotting stuff, dataframes are also created that contain the data that we want to plot. The "Dataframe" section should be skipped on the first run (because the files that are read do not yet exist). The Dataframes are created by the first population for each species. The Dataframes will need to be reworked to store all the relevant populations and only those. |
Including plotting and rationale TODO: West Africa
…-data-paper into q13-design-pca-figure
|
I don't object to There is quite a bit of complexity in ag3, in terms of masking sites and dropping samples. My ultimate preference would be to use |
Cool, I'd be happy to convert |
Sounds good to me! |
Thanks Jon. I think what I would like to do with this PR is change the focus from an investigation of sample size effects to a general purpose analysis that generates summary statistics for different populations. That said- I think the trajectory of theta is informative, so probably should remain part of the work. I guess it mainly depends on how we want to show the data, let's chat about this this afternoon. |
I have made some attempts to show the difference between coluzzii and gambiae through the Watterson's theta lense. It does not show on the notebook because I used the data from my TM study. I thus have 5 coluzzii populations and 6 gambiae populations from the same(ish) locations and same(ish) time. All populations have been downsampled to the smallest ones to remove the sample size effect. I will only show 3L, 3R looks similar. The first idea was to plot the distribution of WT in each window for coluzzii and gambiae. It works and it shows very clearly that there are some regions with a noticeable difference but in general it is not easy to see much of a difference. The second option to show the difference between the mean value for each species in each window. It looks much clearer but it lacks information on variance. Trying to show the difference between the minimum of gambiae and the maximum of coluzzii shows that there is a lot of overlap. The last thing I tried was to normalize the first plot with the mean value for coluzzii (so that its plot is centered around one). I think it is the one that shows the difference the most clearly but it is also the most abstract measure of the bunch. |
- fully automated based on distance clustering - new naming scheme - includes arabiensis
Thanks Jon. This is neat. This clearly shows that the variation due to region exceeds that of species. I'm not sure why we are particularly interested in theta by window in this case. I think it's only interesting when we are examining the features of the genome. The fundamental question about whether there is a difference in theta between gambiae and coluzzii could be addressed by simply taking the whole genome as our unit of theta (standardizing on minimum n), ranking each population, and using something like the Mann–Whitney U test. I think to do otherwise is problematic- by taking windows and looking at their distribution we (falsely) imply independent observations. An approach which may bring the best of both worlds, is to compute U1 + U2 for each window in the genome, based on gambiae/coluzzii populations and to plot that value over chromosomes. Where the ranks are evenly distributed U1/U2 will be around 15, where all gambiae populations are higher, U1/U2 will be 30/0. The key thing is that the populations are the unit of sampling here, not windows. |
In fact to do this should be straightforward given the data you have computed already? |
I think we will discuss that on Monday, so this will work more as notes. That said my stats knowledge is a little hazy, right now.
I agree with the MWU test being able to tell us whether the distributions for gambiae and coluzzii are likely to be the same, but it is only one value, not a plot (and I like plots).
True. Examining the features of the genome kinda was the angle I was trying to take. I don't think it is controversial that theta is higher for gambiae than for coluzzii, and thus so is Ne. What makes Ne_gambiae higher is thus the next question (higher census population size, more migration, no aestivation, ...). But I get that it is not the question we are trying to answer here (and I have no proof that looking at the features of the genome would lead to an answer).
Yes and no. I would say that the values for a given species/ varying windows are not independent, but that for a given window/varying species the values should be. I would expect the covariance to drop down with distance so the window method (probably) gives us a handful of independent-ish observations instead of just the one of averaging them. |
From meeting this morning- plan is then to:
|
… q13-design-pca-figure
…ook to create the pictures that we are interested in
…far as Wat theta goes.
…es. Arabiensis has not been done yet and the pictures can (easily) be improved.
…ook to create the pictures that we are interested in
…far as Wat theta goes.
…es. Arabiensis has not been done yet and the pictures can (easily) be improved.
…ook to create the pictures that we are interested in
…far as Wat theta goes.
@hardingnj , @leehart , @alimanfoo , @cclarkson : I was wondering, the population definition gives the list of sample names in each population but not the sample set they come from. I feel like, in particular when we don't want to run something on all populations at the same time, it would be productive to have the list of sample sets they come from (Nick said that each population is extracted from only one dataset right now but that might change in the future). Otherwise, I feel like one either has to look in the metadata where they can be found (which is recompute that piece of information every time) or aggregate all the datasets and metadata and find them (which feels like overkill). Any opinion? |
I think I agree, although it would complicate and bloat the YAML format (perhaps "population_id" > {sample_set_id: "sample_set_id", sample_id: "sample_id"} ). CSV might be an option. The location-colours YAML is currently "country" > "location" > "colour". Do we need to account for cases where the same sample_id appears in different sample sets, or is that impossible by design and strictly enforced? My first feeling is that sample ids belong to sample sets, so would normally be considered together as a pair of keys, but I also suspect that we might have relied on unique sample ids elsewhere - although that might be a mistake. If populations might consist of samples from more than one sample set in the future, then it might be prudent to future-proof this file for defining populations, especially if future sample ids might clash. But I also hear the convenience of the present case. On the chore of having to look up the sample set for any particular sample id, I'm not sure if that's a big cost, but I still feel doubtful that these sample ids are going to scale as unique identifiers across all sample sets, as we accumulate more sample sets, when it seems natural to carry the sample-set-id and sample-id pair around together. Personally I'd rather see sample-set sample-id pairs than a path to generating long unwieldy frankenstein sample ids that seek to be unique across all sets. I did wonder whether these population groups might purposely detach/insulate samples from any kind of prior associations they had with sample sets, such as country of origin. For instance we already have a sample set with samples originating from more than one country. In a similar spirit, the colouring of locations had been purposely detached/insulated from prior association with geo-political groupings, such as country, and instead is based on lat-long. But since the population groupings are named according to country-5km-species-year, it feels like a clarification of which samples came from which sample set within |
…us sample sizes for every population. For the arabiensis populations, it has been computed using the gamb-colu filters as well as the arabiensis filters. The last part is still experimental, it is trying to have error bars for different loci. All the data is in the csv files.
…be modified to do some kind of jackknifing or bootstrapping, eventually.
This notebook contains my code for comparing Watterson's theta at various sample sizes for different populations. So far, I have not handled the arabiensis samples (or the intermediate ones if we deem that the ones in the Far West and the ones in Kenya 12 are comparable and that we have enough sample sets with them). The pictures are also quite ugly.
That said, the main issue is that it creates a zarr file with the stats for the various populations and populations sizes that we don't want to have in the repo. So first runs might be long. A solution could be to create a bucket on the cloud where the data would be stored. It wasn't necessary as long as I was working on my lonesome on this but it might be worth it if it becomes more of a collaborative effort.
The same code (modulo the name of the stats when the scikit allel function is called) can probably be reused for a few stats but a few tweeks are likely to be needed (for instance, arabiensis only use 3L for the analysis so I will have to add a parameter).