simulated data vs rsequence data #186
Replies: 3 comments 1 reply
-
@ramesh121412 it seems you don't know what you really want to do, so best if you think about that first and then think about the best method and tool for the job;) Having said this, both 1) fully simulated sequence data and QTL effects and 2) taking true sequence data and simulating QTL effects are useful, but it depends on what you want to do. As to 1), it likely will not match with your diversity setting, so 2) seems a more prudent approach, BUT note that current simulations of QTLs and their effects don't really have any connection to reality, so while the option 2) might seem more realistic, it is still quite far away from reality. |
Beta Was this translation helpful? Give feedback.
-
Hello! I'd like to follow up on this question as I was wondering something similar. In the context of simulating an existing maize breeding program to test alternative breeding scenarios, there are mainly three approaches:
Initially, 1 seemed most appealing to us because it avoids potential biases or artifacts from existing data and marker effect estimation. However, in discussions with others, we've found that option 3 is frequently used. Could you provide insights on when it might be more advantageous to use one approach over the others? Thanks!! |
Beta Was this translation helpful? Give feedback.
-
@minesrebollo each method has its own strengths and weaknesses, so which is best depends on your goals and the data you have available. I've used all three approaches at different times for different reasons. The strength of method 1 is that it doesn't limit you to only data you have on hand. It is ideal for testing methods in a more generic breeding context. This approach also provides you with a lot of flexibility to test assumptions in your simulation (e.g. number of QTL controlling a trait or Ne of your founders). The weakness of method 1 is that your founder genotypes may poorly match your real genotypes. For example, distributions of allele frequencies and LD could be off. This approach also doesn't really do much to tell you about a specific population. For example, it isn't going to give you a specific estimates for genetic gain or GWAS power in a population. The strength of method 2 is that it can overcome the allele frequency and LD limitations of method 1. This approach is probably best if you want to do a power analysis for a QTL mapping study, because it is more tailored to the germplasm you are working with. The weaknesses of method 2 come from the quality and quantity of your genotype data. A particular concern is the degree to which ascertainment bias is present in your data. This is expected to be a bigger concern with SNP chip data sets relative to whole genome sequencing at a reasonably high depth. You have to watch out for population structure that could be influencing your results. It is also worth noting that since the QTL and their effects are placed randomly, the distribution of the phenotype may be very inconsistent with the LD in the population. The strength of method 3 is that it lets you make very specific predictions in a population. This approach works well for predicting genetic gain over one to maybe a few crossing generations. It is an approach you'd use to help pick crosses in a crossing block. The weakness of method 3 is that it is very prone to artifacts in your data. There's a lot of uncertainty in estimated marker effects, so this approach is going to break down over multiple rounds of crossing. The rounds of crossing cause that uncertainty to grow considerably. It may not even work well for one round of crossing, depending on the accuracy of your model. This approach probably has the most limited scope of inference. A neat trick is that you can actually use method 1 to explore methods 2 and 3. You can build a simulation with method 1 and then take the simulated data to create additional simulations using the approaches for the other methods. This is an example of how method 1 is good for testing methods, different simulation methods in this case. |
Beta Was this translation helpful? Give feedback.
-
Hi, I am currently working on a project that involves analyzing sequence data comprising 200 diverse samples of maize from a particular trait, alongside corresponding phenotypic data. In the process of my analysis, I am deliberating whether to utilize the genetic data directly or employ data simulation techniques. However, as I am relatively new to this area, I am seeking guidance on the most appropriate approach to take.
Could you kindly advise on whether it would be preferable to utilize the genetic data directly or to simulate the data?, and also someone guide me what simulations can i do from this data
Beta Was this translation helpful? Give feedback.
All reactions