-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Formula modificaiton for the estimation of recombinaiton rate #162
Comments
@AprilYUZhang I want to share a few thoughts on the first equation given in the paper that might answer some of your questions. The equation is as follows: Here, every term of the sum, including The reason why Your algorithm for the recombination rate is correct, but it assumes that we have full knowledge of the genotypes, which is impossible in practice. It is a good interpreter of the data we have, but it doesn't infer anything more than what we currently have. Depending on what is our goal, whether we want another statistic of the current data or a better recombination rate estimation to improve the algorithm, we may stop here or explore other ways to take into account the uncertainties that lie in the data. |
@XingerTang I agree that the true value couldn't be estimated in practice. As your explanation, although some issues still exist, whether we need to simplify it is that whether it must be part of iteration. At same time, we should add this estimation in "Peeling.py" or "PeelingUpdateds.py". I prefer to estimate directly based on current segregation probability; in other word, r should be updated in "PeelingUpdates.py", like genetype error, etc. But it does not mean the calculation of recombination will not consider the information of adjacent loci and pedigree. When we calculate the segregation probability, it should be considered. In theory, the more accurate the segregation probability, the more accurate the recombination rate. Or this new formula could look like this: k is the pp,pm,mp,mmSo this formula will have more complex expression if countinue. In total, I think segregation probability is a "directly estimated" parameter in iteration, while recombination should be a "indirectly estimated" parameter which can be estiamted from segregation probability. |
@AprilYUZhang which there are several ways to calculate, but none needs further information on the adjacent loci. A similar calculation can also be applied to the other conditional probability used. From your implementation, it seems that the simple algorithm may overestimate the occurrence of the recombination events. The equation in the paper involves three probability terms to multiply, which by intuition would yield a much smaller recombination rate. So I think this algorithm is still worth implementing. |
I am a bit confused here as your equation for |
It sounds good. My formula is not a new formula exactly; it is just a new expression approach, I use the predefined formula in paper to extend my original one. I just want to show more clearly that for every individual and locus, when we calculate whether the separation probability between two loci is the same, it is not as simple as 0 if they are the same, otherwise 1. For example, for your formula that you mentioned
I agree that three terms will lead to a much smaller recombination rate, however, I don't think the recombination events are overestimated. As my figure shows, It can indicate that recombination events may happen. For example, Generations 0 and 1 easily get the wrong peak, but it is due to too little information from their parents rather than the recombination calculation formula. Even if we improve our formula to perfection, this issue still exists. But the recombination rate must be overestimated because the probability will repeat itself many times around the real recombination event. |
@calculate_time
# the method from Xinger
def recom_f3(seg_prob_array,nInd,nLociAll):
# Design indicator function
# initialize the recombination event array
r= 0.0008
default_array=np.array([0.25,0.25,0.25,0.25])
# Design indicator function
indicator =np.array([[0, 1 , 1 , 2],
[1 , 0 , 2 , 1],
[1 , 2 , 0 , 1],
[2 , 1 , 1 , 0]])
# initialize the recombination event array
recom=np.empty((nInd,nLociAll-1))
for k in range(0,nLociAll-1):
for j in range(0,nInd):
# avoid that the dedault value causes the distortion of recombination event
if np.all(seg_prob_array[j,k]==default_array) and np.all(seg_prob_array[j,k+1]==default_array) :
recom[j,k]=0
else:
recom[j,k]=np.dot(np.dot(seg_prob_array[j,k],indicator),seg_prob_array[j,k+1])
recom_added =np.empty((nInd,nLociAll-1))
for j in range(0,recom.shape[0]):
for k in range(0,recom.shape[1]):
if k==0:
# seg_i_jplus2=(r*recom[j,k+1]+(1-r)*(1-recom[j,k+1]))/seg_prob_array[j,k+2]
# because have zero that will lead to recombinaiton inf and nan, temporarily remove it
seg_i_jplus2=(r*recom[j,k+1]+(1-r)*(1-recom[j,k+1]))
seg_i_j_1=1
elif k == recom.shape[1]-1:
# seg_i_j_1=(r*recom[j,k-1]+(1-r)*(1-recom[j,k-1]))/seg_prob_array[j,k-1]
seg_i_j_1=(r*recom[j,k-1]+(1-r)*(1-recom[j,k-1]))
seg_i_jplus2=1
else:
# seg_i_j_1=(r*recom[j,k-1]+(1-r)*(1-recom[j,k-1]))/seg_prob_array[j,k-1]
# seg_i_jplus2=(r*recom[j,k+1]+(1-r)*(1-recom[j,k+1]))/seg_prob_array[j,k+2]
seg_i_j_1=(r*recom[j,k-1]+(1-r)*(1-recom[j,k-1]))
seg_i_jplus2=(r*recom[j,k+1]+(1-r)*(1-recom[j,k+1]))
recom_added[j,k]=np.nansum(seg_i_jplus2)*recom[j,k]*np.nansum(seg_i_j_1)
return recom_added I tried to realise your formula; is it correct? if not, let me know, Thank you.
The result looks not good, may I run something wrong? |
One possible way to calculate I understand your points, but if the probabilities around the locus where the real recombination event happened are summing up to more than |
@AprilYUZhang The problems of the indicator matrix are:
|
Both two problems are related to the algorithm for calculating recommendations from segregation probabilities. The expected value of recombination probability between 2 loci is 2 in biology. So when we calculate the recombination rate, the total number of recombination events is divided by 2nIndividual(nloci-1). (Your question reminds me that I make mistakes in the denominator. I used nIndividual*nloci, I will update table later.) |
If the probabilities are truly |
@AprilYUZhang I just noticed that your total recombination events are counted based on the |
@AprilYUZhang where the matrix counts the number of the recombination events occurring between the loci I managed to access the real recombination data at each locus for each individual, so I didn't average out the recombination rate across the individual, just that we can see the more precise information about whether the algorithm gives a correct estimation of the occurrence of the recombination events for each locus. Same as what you've done, I used two datasets, the true segregation data and the estimated segregation data for the measures of the accuracies. The true segregation is obtained via the simulation data, while the estimated segregation data is obtained by the multi-loci peeling performed by the AlphaPeel accuracy test. So, there are 1999000 loci pairs used in total and only 1636 recombination events have occurred, and no two recombination events have occurred at the same locus. Thus, for each locus, there is either no recombination or a single recombination occurred. Given that we have very few recombination events compared to the total number of loci, I measure the accuracy based on two metrics, the recall rate and the precision rate. As there is no two recombinations have occurred at the same locus, the estimated recombination rates generated by the (1) are in the range of 0 to 1. So it is natural to set some number between 0 to 1 as the threshold for predicting whether a recombination event occurred at one particular locus. For example, if the threshold is 0.5, then we would predict the locus that has an estimated probability greater or equal to 0.5 to have a recombination event occur and the others have no recombination event. I plotted the precision-recall curve for different values of the threshold that ranged from 0 to 1. For the true segregation data input, the result is given by the following While the recall is always 1, which indicates that all the positives have been successfully predicted, the precision is not very satisfying as the highest precision that could be achieved is only 0.004. It indicated that lots of loci which have no recombination event occurred have been classified as having recombination events occurred and this happened so often that 0.996 of the predicted recombinations are wrong. As you have discussed before, it is probably caused by the Now, we have the desired result that the precision of 1 is also achieved for some thresholds. However, this is for the true segregation data, we may not be able to perform as well as this when we are using the estimated segregation data. So I recalculated the precision-recall curve for the full estimated segregation data, which is the following, This indicates that the highest precision we could possibly achieve is 0.002 for different values of the thresholds, which means that there are always 0.998 estimated recombination events that are wrong. I also do the same calculation for the estimated data excludes the first generation, In this case, we can have a relatively higher precision rate (but still very low) compared to the previous case, but it is only possible when the recall rate is very low, which indicates that we cannot classify the true positives and other cases very well based on this algorithm. |
Given that the above method cannot find the recombination event without misclassifying locus has no recombination events to be having the recombination events and misclassifying the locus with recombination events occurred to be no recombination has occurred, I implemented several regression on true segregation data and estimated segregation data to find a better matrix which is counterintuitive at the first sight. But the For the full true segregation data, the precision-recall curve is given by the following For the full estimate segregation data, the precision-recall curve is given by the following The precision-recall curve for the segregation data excludes the first generation is the same as this one. We can see that it is better then the previous case that for the same recall rate, this time we can achieve a higher precision rate, which means we can classify the true recombination events better. So for the implementation, this choice of However, even for this choice of |
@XingerTang , I think the reason why your first result has low precision is mainly the effect of the default value ([0.25,0.25,0.25,0.25]). For this problem, my solution is to set this recombination probability as 0, if x1 and x2 == [0.25,0.25,0.25,0.25]. Your method is good, and I also managed to do this before. I think the solution to getting high ROC is: 1. segregation probability may have space to get more accuracy; 2. we can use the recombination probability curve to estimate the recombination event. Now, Gregor and I try to use the Viterbi algorithm for iterative peeling to improve accuracy, as this issue . |
@AprilYUZhang |
which data did you use? true segregation or estimate by AlphaPeel. Sorry, I don't understand your meaning. Do you mean that based on the principal , you agree, but based on the practice, you disagree? Or, you totally disagree, and low precision is one of the reasons. |
For the estimated segregation data generated by AlphaPeel without the first generation which most |
In the paper, the original formula is as follows:
r : recombiniation, i : i-th individual, j : j-th locus, seg_i,j : the segregation status of i-th individual and j-th locus.
Two issues are ambiguous.
However, if we admit it should be two loops, we couldn't solve the loss of p(seg_j,j+1). That's a controversy.
Further, is the conditional probability necessary? whether we can simplify them? In my perspective, the recombination rate is similar to the genetype error rate. When the peeling process is done, these parameters will be updated according to the output and then input to the next cycle. So they can be calculated, not based on the peeling. In other words, the peeling process has been completed, and the calculation of the recombination rate is supposed to focus on adjacent loci.
Another interesting explanation is that this recombination rate is not the total recombination rate but rather the recob rate per locus. The segregation status of two adjacent loci will have 16 possibilities; this formula manages to sum the 16 possibilities.
But it still loses the p(seg_j,j+1). From this perspective, except for the problem of "p(seg_j,j+1)" and the conditional probability, I came up with an alternative.
segregation status means the "pp","mp","pm","mm".
The text was updated successfully, but these errors were encountered: