Computation of genetic diversity stats and comparison between populations #43

jonbrenas · 2020-10-15T10:49:18Z

This notebook contains my code for comparing Watterson's theta at various sample sizes for different populations. So far, I have not handled the arabiensis samples (or the intermediate ones if we deem that the ones in the Far West and the ones in Kenya 12 are comparable and that we have enough sample sets with them). The pictures are also quite ugly.

That said, the main issue is that it creates a zarr file with the stats for the various populations and populations sizes that we don't want to have in the repo. So first runs might be long. A solution could be to create a bucket on the cloud where the data would be stored. It wasn't necessary as long as I was working on my lonesome on this but it might be worth it if it becomes more of a collaborative effort.

The same code (modulo the name of the stats when the scikit allel function is called) can probably be reused for a few stats but a few tweeks are likely to be needed (for instance, arabiensis only use 3L for the analysis so I will have to add a parameter).

…-data-paper into q13-design-pca-figure

…es. Arabiensis has not been done yet and the pictures can (easily) be improved.

review-notebook-app · 2020-10-15T10:49:22Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

jonbrenas · 2020-10-16T10:17:51Z

A short comment: The zarr files are used to store the various runs of Watterson's theta computation. As Datalab kept dying on me, even we just plotting stuff, dataframes are also created that contain the data that we want to plot. The "Dataframe" section should be skipped on the first run (because the files that are read do not yet exist). The Dataframes are created by the first population for each species. The Dataframes will need to be reworked to store all the relevant populations and only those.

Including plotting and rationale TODO: West Africa

…-data-paper into q13-design-pca-figure

leehart · 2020-10-21T07:55:27Z

I recall @hardingnj wants to use notebooks/ag3.py instead of intake.
Add cluster.adapt() to this nb, as per @alimanfoo 's recent Discourse.

hardingnj · 2020-10-21T11:56:38Z

I don't object to intake per se, it's just that I found it insufficient alone.

There is quite a bit of complexity in ag3, in terms of masking sites and dropping samples. My ultimate preference would be to use intake in ag3.py, but I really feel there is still a need for ag3.py because of the organisation of the data.

leehart · 2020-10-21T14:15:31Z

Cool, I'd be happy to convert ag3.py to use intake. Or at least I can make a start on it and submit a PR.

hardingnj · 2020-10-21T14:35:07Z

Sounds good to me!

hardingnj · 2020-10-26T13:02:37Z

Thanks Jon.

I think what I would like to do with this PR is change the focus from an investigation of sample size effects to a general purpose analysis that generates summary statistics for different populations.

That said- I think the trajectory of theta is informative, so probably should remain part of the work.

I guess it mainly depends on how we want to show the data, let's chat about this this afternoon.

jonbrenas · 2020-10-28T08:29:57Z

I have made some attempts to show the difference between coluzzii and gambiae through the Watterson's theta lense. It does not show on the notebook because I used the data from my TM study. I thus have 5 coluzzii populations and 6 gambiae populations from the same(ish) locations and same(ish) time. All populations have been downsampled to the smallest ones to remove the sample size effect. I will only show 3L, 3R looks similar.

The first idea was to plot the distribution of WT in each window for coluzzii and gambiae. It works and it shows very clearly that there are some regions with a noticeable difference but in general it is not easy to see much of a difference.

The second option to show the difference between the mean value for each species in each window. It looks much clearer but it lacks information on variance.

Trying to show the difference between the minimum of gambiae and the maximum of coluzzii shows that there is a lot of overlap.

The last thing I tried was to normalize the first plot with the mean value for coluzzii (so that its plot is centered around one). I think it is the one that shows the difference the most clearly but it is also the most abstract measure of the bunch.

- fully automated based on distance clustering - new naming scheme - includes arabiensis

hardingnj · 2020-10-28T16:41:37Z

Thanks Jon. This is neat. This clearly shows that the variation due to region exceeds that of species.

I'm not sure why we are particularly interested in theta by window in this case. I think it's only interesting when we are examining the features of the genome.

The fundamental question about whether there is a difference in theta between gambiae and coluzzii could be addressed by simply taking the whole genome as our unit of theta (standardizing on minimum n), ranking each population, and using something like the Mann–Whitney U test.

I think to do otherwise is problematic- by taking windows and looking at their distribution we (falsely) imply independent observations.

An approach which may bring the best of both worlds, is to compute U1 + U2 for each window in the genome, based on gambiae/coluzzii populations and to plot that value over chromosomes. Where the ranks are evenly distributed U1/U2 will be around 15, where all gambiae populations are higher, U1/U2 will be 30/0. The key thing is that the populations are the unit of sampling here, not windows.

hardingnj · 2020-10-28T16:42:42Z

An approach which may bring the best of both worlds, is to compute U1 + U2 for each window in the genome, based on gambiae/coluzzii populations and to plot that value over chromosomes. Where the ranks are evenly distributed U1/U2 will be around 15, where all gambiae populations are higher, U1/U2 will be 30/0. The key thing is that the populations are the unit of sampling here, not windows.

In fact to do this should be straightforward given the data you have computed already?

jonbrenas · 2020-10-29T09:38:23Z

I think we will discuss that on Monday, so this will work more as notes. That said my stats knowledge is a little hazy, right now.

The fundamental question about whether there is a difference in theta between gambiae and coluzzii could be addressed by simply taking the whole genome as our unit of theta (standardizing on minimum n), ranking each population, and using something like the Mann–Whitney U test.

I agree with the MWU test being able to tell us whether the distributions for gambiae and coluzzii are likely to be the same, but it is only one value, not a plot (and I like plots).

I'm not sure why we are particularly interested in theta by window in this case. I think it's only interesting when we are examining the features of the genome.

True. Examining the features of the genome kinda was the angle I was trying to take. I don't think it is controversial that theta is higher for gambiae than for coluzzii, and thus so is Ne. What makes Ne_gambiae higher is thus the next question (higher census population size, more migration, no aestivation, ...). But I get that it is not the question we are trying to answer here (and I have no proof that looking at the features of the genome would lead to an answer).

I think to do otherwise is problematic- by taking windows and looking at their distribution we (falsely) imply independent observations.

Yes and no. I would say that the values for a given species/ varying windows are not independent, but that for a given window/varying species the values should be. I would expect the covariance to drop down with distance so the window method (probably) gives us a handful of independent-ish observations instead of just the one of averaging them.

hardingnj · 2020-11-02T11:29:52Z

From meeting this morning- plan is then to:

compute theta on 4 way degenerate sites
bootstrap confidence intervals by resampling
plot gambiae/coluzzii populations as bar charts with error bars.
Perform Mann-Whitney U test of west african gambiae vsa coluzzii.

… q13-design-pca-figure

…ook to create the pictures that we are interested in

…far as Wat theta goes.

…es. Arabiensis has not been done yet and the pictures can (easily) be improved.

…ook to create the pictures that we are interested in

…far as Wat theta goes.

…es. Arabiensis has not been done yet and the pictures can (easily) be improved.

…ook to create the pictures that we are interested in

…far as Wat theta goes.

jonbrenas · 2020-11-11T20:41:52Z

@hardingnj , @leehart , @alimanfoo , @cclarkson : I was wondering, the population definition gives the list of sample names in each population but not the sample set they come from. I feel like, in particular when we don't want to run something on all populations at the same time, it would be productive to have the list of sample sets they come from (Nick said that each population is extracted from only one dataset right now but that might change in the future). Otherwise, I feel like one either has to look in the metadata where they can be found (which is recompute that piece of information every time) or aggregate all the datasets and metadata and find them (which feels like overkill). Any opinion?

leehart · 2020-11-11T22:58:09Z

@hardingnj , @leehart , @alimanfoo , @cclarkson : I was wondering, the population definition gives the list of sample names in each population but not the sample set they come from. I feel like, in particular when we don't want to run something on all populations at the same time, it would be productive to have the list of sample sets they come from (Nick said that each population is extracted from only one dataset right now but that might change in the future). Otherwise, I feel like one either has to look in the metadata where they can be found (which is recompute that piece of information every time) or aggregate all the datasets and metadata and find them (which feels like overkill). Any opinion?

I think I agree, although it would complicate and bloat the YAML format (perhaps "population_id" > {sample_set_id: "sample_set_id", sample_id: "sample_id"} ). CSV might be an option. The location-colours YAML is currently "country" > "location" > "colour". Do we need to account for cases where the same sample_id appears in different sample sets, or is that impossible by design and strictly enforced? My first feeling is that sample ids belong to sample sets, so would normally be considered together as a pair of keys, but I also suspect that we might have relied on unique sample ids elsewhere - although that might be a mistake.

If populations might consist of samples from more than one sample set in the future, then it might be prudent to future-proof this file for defining populations, especially if future sample ids might clash. But I also hear the convenience of the present case.

On the chore of having to look up the sample set for any particular sample id, I'm not sure if that's a big cost, but I still feel doubtful that these sample ids are going to scale as unique identifiers across all sample sets, as we accumulate more sample sets, when it seems natural to carry the sample-set-id and sample-id pair around together. Personally I'd rather see sample-set sample-id pairs than a path to generating long unwieldy frankenstein sample ids that seek to be unique across all sets.

I did wonder whether these population groups might purposely detach/insulate samples from any kind of prior associations they had with sample sets, such as country of origin. For instance we already have a sample set with samples originating from more than one country. In a similar spirit, the colouring of locations had been purposely detached/insulated from prior association with geo-political groupings, such as country, and instead is based on lat-long. But since the population groupings are named according to country-5km-species-year, it feels like a clarification of which samples came from which sample set within population_definitions.yml might provide good provenance (and convenience), at the expense of a bloated file and potential redundancy.

…us sample sizes for every population. For the arabiensis populations, it has been computed using the gamb-colu filters as well as the arabiensis filters. The last part is still experimental, it is trying to have error bars for different loci. All the data is in the csv files.

…be modified to do some kind of jackknifing or bootstrapping, eventually.

nicholasharding and others added 7 commits October 8, 2020 09:34

Initial port of LH work

1851dcd

WIP using PCA to identify colors

51bb899

Merge branch 'master' into q13-design-pca-figure

4a9f25f

Merge branch 'master' of https://github.com/malariagen/ag1000g-phase3…

7b25d97

…-data-paper into q13-design-pca-figure

WIP with colours + bokeh

8f5a71e

Updated with population definitions for WWA

af6a11d

a notebook for comparison of Watterson s theta at various samples siz…

8977c3c

…es. Arabiensis has not been done yet and the pictures can (easily) be improved.

leehart self-requested a review October 15, 2020 13:09

nicholasharding added 5 commits October 19, 2020 17:09

Add population definitions

0dcc223

Including plotting and rationale TODO: West Africa

Finalize gamb_colu populations

8ce548a

Merge branch 'master' of https://github.com/malariagen/ag1000g-phase3…

f27225d

…-data-paper into q13-design-pca-figure

Complete gamb_colu figure and edits to ag3.py

2950c4b

Remove old code, tidy, add svg

6f3e0a4

nicholasharding added 4 commits October 21, 2020 14:56

Change population > location for clarity

3a70f04

minor edits following AM discussion

96ccfc0

updates

cda1f78

complete merge re ag3.py

322bbf6

jonbrenas changed the title ~~Comparison of genetic diversity stats at various samples sizes~~ Computation of genetic diversity stats and comparison between populations Oct 28, 2020

nicholasharding added 2 commits October 28, 2020 14:36

update approach

e3f6462

- fully automated based on distance clustering - new naming scheme - includes arabiensis

removing notes committed in error

cc27099

WIP: PoC calc location colours w/ CIELAB colour space

7316eb2

nicholasharding and others added 11 commits November 2, 2020 12:31

Merge remote-tracking branch 'origin/LH_recalc_location_colours' into…

28c2f93

… q13-design-pca-figure

Moved the zarr files to the cloud and started to reorganize the noteb…

5320129

…ook to create the pictures that we are interested in

Update with LH new colours

6c39e4b

Checked that arabiensis were significantly different from gambiae as …

4fe91bf

…far as Wat theta goes.

a notebook for comparison of Watterson s theta at various samples siz…

d7ec11e

…es. Arabiensis has not been done yet and the pictures can (easily) be improved.

Moved the zarr files to the cloud and started to reorganize the noteb…

9243c8a

…ook to create the pictures that we are interested in

Checked that arabiensis were significantly different from gambiae as …

d52c618

…far as Wat theta goes.

a notebook for comparison of Watterson s theta at various samples siz…

5aa29e0

…es. Arabiensis has not been done yet and the pictures can (easily) be improved.

Moved the zarr files to the cloud and started to reorganize the noteb…

6e3e851

…ook to create the pictures that we are interested in

Checked that arabiensis were significantly different from gambiae as …

557c452

…far as Wat theta goes.

Rebase didn't go as smoothly as I hoped.

dfcdafc

jonbrenas added 2 commits February 2, 2021 16:32

Here are the notebooks computing Watterson's theta. They may need to …

db19845

…be modified to do some kind of jackknifing or bootstrapping, eventually.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Computation of genetic diversity stats and comparison between populations #43

Computation of genetic diversity stats and comparison between populations #43

jonbrenas commented Oct 15, 2020

review-notebook-app bot commented Oct 15, 2020

jonbrenas commented Oct 16, 2020

leehart commented Oct 21, 2020

hardingnj commented Oct 21, 2020

leehart commented Oct 21, 2020

hardingnj commented Oct 21, 2020

hardingnj commented Oct 26, 2020

jonbrenas commented Oct 28, 2020

hardingnj commented Oct 28, 2020

hardingnj commented Oct 28, 2020

jonbrenas commented Oct 29, 2020 •

edited

Loading

hardingnj commented Nov 2, 2020

jonbrenas commented Nov 11, 2020

leehart commented Nov 11, 2020

Computation of genetic diversity stats and comparison between populations #43

Are you sure you want to change the base?

Computation of genetic diversity stats and comparison between populations #43

Conversation

jonbrenas commented Oct 15, 2020

review-notebook-app bot commented Oct 15, 2020

jonbrenas commented Oct 16, 2020

leehart commented Oct 21, 2020

hardingnj commented Oct 21, 2020

leehart commented Oct 21, 2020

hardingnj commented Oct 21, 2020

hardingnj commented Oct 26, 2020

jonbrenas commented Oct 28, 2020

hardingnj commented Oct 28, 2020

hardingnj commented Oct 28, 2020

jonbrenas commented Oct 29, 2020 • edited Loading

hardingnj commented Nov 2, 2020

jonbrenas commented Nov 11, 2020

leehart commented Nov 11, 2020

jonbrenas commented Oct 29, 2020 •

edited

Loading