Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture uploaded allele correctly for VCF input #1744

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions modules/Bio/EnsEMBL/VEP/InputBuffer.pm
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,7 @@ sub next {

# new chromosome
if($prev_chr && $vf->{chr} ne $prev_chr) {
$self->{minimal}=1 if $vf->{minimised};
$self->split_variants() if $self->{minimal};
return $buffer;
}
Expand Down Expand Up @@ -224,7 +225,7 @@ sub next {
# we can't push the VF back onto the parser, so add it to $pre_buffer
# and it will get picked up on the following next() call
push @$pre_buffer, $vf;

$self->{minimal}=1 if $vf->{minimised};
$self->split_variants() if $self->{minimal};
$prev_start = 0;
return $buffer;
Expand All @@ -245,6 +246,7 @@ sub next {
}
}
$prev_start = $vf->{start};
$self->{minimal}=1 if $vf->{minimised};
}
}
}
Expand All @@ -258,7 +260,6 @@ sub next {
}

$self->split_variants() if $self->{minimal};

return $buffer;
}

Expand Down Expand Up @@ -664,7 +665,8 @@ sub split_variants {
$first->{original_allele_string} = $original_vf->{allele_string};
$first->{original_start} = $original_vf->{start};
$first->{original_end} = $original_vf->{end};
$first->{minimised} = 1
$first->{minimised} = 1;
$first->{first} = 1;
}

push @tmp, $new_vf;
Expand Down
6 changes: 4 additions & 2 deletions modules/Bio/EnsEMBL/VEP/OutputFactory.pm
Original file line number Diff line number Diff line change
Expand Up @@ -1296,7 +1296,8 @@ sub VariationFeatureOverlapAllele_to_output_hash {

# reference allele
$hash->{REF_ALLELE} = $vf->ref_allele_string if $self->{show_ref_allele};


# Capture uploaded allele
$hash->{UPLOADED_ALLELE} = ($vf->{original_allele_string} || $vf->{allele_string} || $vf->{class_SO_term} || "" ) if $self->param('uploaded_allele');

# picked?
Expand Down Expand Up @@ -2193,7 +2194,8 @@ sub rejoin_variants_in_InputBuffer {
foreach my $vf(@{$buffer->buffer}) {

# reset original one
if(defined($vf->{original_allele_string})) {
# check $vf->{first} to get first $vf in split_variants
if(defined($vf->{first})) {
nakib103 marked this conversation as resolved.
Show resolved Hide resolved

# do consequence stuff
$self->get_all_output_hashes_by_VariationFeature($vf);
Expand Down
37 changes: 30 additions & 7 deletions modules/Bio/EnsEMBL/VEP/Parser.pm
Original file line number Diff line number Diff line change
Expand Up @@ -851,13 +851,28 @@ sub post_process_vfs {
$vf->seq_region_start($vf->{start});
$vf->seq_region_end($vf->{end});

# Checks if the allele string is insertion or/and deletion
# Checks if the allele string is non-minimised insertion or/and deletion
my $is_sv = ref($vf) eq 'Bio::EnsEMBL::Variation::StructuralVariationFeature';
if(!$is_sv && defined($vf->{allele_string}) && $vf->{allele_string} =~ /\//){
my $is_indel = 0;
my ($ref_allele_string,$alt_allele_string) = split(/\//, $vf->{allele_string});
$is_indel = 1 unless length($ref_allele_string) == length($alt_allele_string) or $vf->{allele_string} =~ /-/;
$vf = ${$self->minimise_alleles([$vf])}[0] if $is_indel;
my $is_non_minimised_indel = 0;
# For VCF input, the allele_string is trimmed to remove anchor base, so we capture original allele

my $original_allele_string = ($vf->{nontrimmed_allele_string} || $vf->{allele_string});
my @alleles = split(/\//, $original_allele_string);
my $ref_allele_string = shift @alleles;
my $alt_allele_count;

foreach my $alt(@alleles) {
if ($original_allele_string =~ /-/)
{
last;
}
elsif (length($ref_allele_string) != length($alt)) {
$is_non_minimised_indel = 1;
last;
}
}
$vf = ${$self->minimise_alleles([$vf])}[0] if $is_non_minimised_indel;
}
}
return $vfs;
Expand Down Expand Up @@ -950,7 +965,15 @@ sub minimise_alleles {

nakib103 marked this conversation as resolved.
Show resolved Hide resolved
# skip VFs with more than one alt
# they get taken care of later by split_variants/rejoin_variants
if(!$vf->{allele_string} || $vf->{allele_string} =~ /.+\/.+\/.+/ || $vf->{allele_string} !~ /.+\/.+/) {
if(!$vf->{allele_string} || $vf->{allele_string} !~ /.+\/.+/) {
push @return, $vf;
}

elsif($vf->{allele_string} =~ /.+\/.+\/.+/)
{
# Updating a flag to minimise multi-allelic variants in split_variants/rejoin_variants
$vf->{minimised} = 1;
nakib103 marked this conversation as resolved.
Show resolved Hide resolved
Comment on lines +974 to +975
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-allelic is not getting minimised for default format. For example - 1 961320 961324 GCAGG/GCA/GCAG +

But in the output still getting MINIMISED=1, (without the PR they are also not minimised but there is no MINIMISED=1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nakib103 , can you please test this example with the latest commit. The allele is expected to be similar to when running --minimal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @likhitha-surapaneni,

Yes, the allele are same when running with or without --minimal.
But the original problem remains. The output says it is minimised but when it is not -

1_961320_GCAGG/GCA/GCAG	1:961320-961324	-	ENSG00000188976	ENST00000327044	Transcript	upstream_gene_variant	-	-UPLOADED_ALLELE=GCAGG/GCA/GCAG;IMPACT=MODIFIER;DISTANCE=2067;STRAND=-1;MINIMISED=1

Copy link
Contributor Author

@likhitha-surapaneni likhitha-surapaneni Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nakib103 , the output is still minimised if you notice the 3rd column (correct me if I got it wrong). We have the allele "-" as the minimised representation for first alternative allele is GG/- and for the second one is G/-. The problem however is that there is no way to differentiate between the alternative alleles as both show "-". Ideally minimal representation should be GG/-/G. This is also an existing problem with the option --minimal and needs to be addressed probably in a future ticket. Please let me know if this makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually looking at the Uploaded_variation column, it does not seem to be minimised as it does for bi-allelic variants (see - 1 961320 961324 GCAGG/GCA +).

But the Allele column shows the alleles are minimised. It seems Uploaded_variation string has different logic, should we address this?

Copy link
Contributor Author

@likhitha-surapaneni likhitha-surapaneni Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree @nakib103. The uploaded variation was not expected to minimise alleles but it was minimising the alleles in some cases due to the way we populate the column and it was difficult to correct this as it is also the default option, hence the approach was taken to capture original allele with a different flag (--uploaded_allele). However, I agree it may also be worth investigating why uploaded variation is minimising the allele in some cases and not doing it in other cases.

$vf->{original_allele_string} = $vf->{nontrimmed_allele_string} || $vf->{allele_string};
push @return, $vf;
}

Expand All @@ -976,7 +999,7 @@ sub minimise_alleles {
$new_vf->{end} = $end;
$new_vf->{seq_region_start} = $start;
$new_vf->{seq_region_end} = $end;
$new_vf->{original_allele_string} = $vf->{allele_string};
$new_vf->{original_allele_string} = $vf->{nontrimmed_allele_string} || $vf->{allele_string};
$new_vf->{original_start} = $vf->{start};
$new_vf->{original_end} = $vf->{end};
$new_vf->{minimised} = 1;
Expand Down
1 change: 1 addition & 0 deletions modules/Bio/EnsEMBL/VEP/Parser/VCF.pm
Original file line number Diff line number Diff line change
Expand Up @@ -340,6 +340,7 @@ sub create_VariationFeatures {
start => $start,
end => $end,
allele_string => $non_variant ? $ref : $ref.'/'.join('/', @$alts),
nontrimmed_allele_string => $non_variant ? $original_alleles[0] : join("/",@original_alleles),
strand => 1,
map_weight => 1,
adaptor => $self->get_adaptor('variation', 'VariationFeature'),
Expand Down
21 changes: 6 additions & 15 deletions t/AnnotationSource_Cache_Variation.t
Original file line number Diff line number Diff line change
Expand Up @@ -414,11 +414,11 @@ is_deeply(
$ib = get_ib([qw(21 8987004 . TA C,TAGCG . . .)]);
$c->annotate_InputBuffer($ib);
is_deeply(
$ib->buffer->[0]->{existing}->[0]->{matched_alleles},
$ib->buffer->[1]->{existing}->[0]->{matched_alleles},
[
{
'a_index' => 1,
'a_allele' => 'TAGCG',
'a_index' => 0,
'a_allele' => 'GCG',
'b_allele' => 'GCG',
'b_index' => 0
}
Expand All @@ -444,24 +444,15 @@ is_deeply(
$ib = get_ib([qw(21 8987004 . TAT TAGCGT,TAGTGT . . .)]);
$c->annotate_InputBuffer($ib);
is_deeply(
$ib->buffer->[0]->{existing}->[0]->{matched_alleles},
$ib->buffer->[1]->{existing}->[0]->{matched_alleles},
[
{
'a_index' => 0,
'a_allele' => 'AGCGT',
'b_allele' => 'GCG',
'b_index' => 0
},
{
'a_index' => 1,
'a_allele' => 'AGTGT',
'a_allele' => 'GTG',
'b_allele' => 'GTG',
'b_index' => 1
}
],
'nastiness 4'
);

],'nastiness 4' );

# test old_maf setting
$p = Bio::EnsEMBL::VEP::Parser::VCF->new({config => $cfg, file => $test_cfg->{test_vcf}, valid_chromosomes => [21]});
Expand Down
39 changes: 14 additions & 25 deletions t/AnnotationSource_Cache_VariationTabix.t
Original file line number Diff line number Diff line change
Expand Up @@ -445,11 +445,11 @@ SKIP: {
$ib = get_ib([qw(21 8987004 . TA C,TAGCG . . .)]);
$c->annotate_InputBuffer($ib);
is_deeply(
$ib->buffer->[0]->{existing}->[0]->{matched_alleles},
$ib->buffer->[1]->{existing}->[0]->{matched_alleles},
[
{
'a_index' => 1,
'a_allele' => 'TAGCG',
'a_index' => 0,
'a_allele' => 'GCG',
'b_allele' => 'GCG',
'b_index' => 0
}
Expand All @@ -475,17 +475,11 @@ SKIP: {
$ib = get_ib([qw(21 8987004 . TAT TAGCGT,TAGTGT . . .)]);
$c->annotate_InputBuffer($ib);
is_deeply(
$ib->buffer->[0]->{existing}->[0]->{matched_alleles},
$ib->buffer->[1]->{existing}->[0]->{matched_alleles},
[
{
'a_index' => 0,
'a_allele' => 'AGCGT',
'b_allele' => 'GCG',
'b_index' => 0
},
{
'a_index' => 1,
'a_allele' => 'AGTGT',
'a_allele' => 'GTG',
'b_allele' => 'GTG',
'b_index' => 1
}
Expand All @@ -494,16 +488,17 @@ SKIP: {
);

# Test frequency match for 'matched alleles'
my $ib_freq = get_ib([qw(21 25005812 . CA CAAA,CAAA . . .)]);
my $ib_freq = get_ib([qw(21 25005811 . CA CAAA,CAAA . . .)]);
$c->{freq_pop} = '1KG_AMR';
$c->annotate_InputBuffer($ib_freq);
$vf = $ib_freq->buffer->[0];
$c->get_frequency_data($vf);

is_deeply(
$vf->{_freq_check_freqs},
{
'1KG_AMR' => {
'AAA' => '0.4107'
'AA' => '0.4107'
},
},
'get_frequency_data - matched alleles'
Expand Down Expand Up @@ -606,11 +601,11 @@ SKIP: {
$ib = get_ib([qw(21 8987004 . TA C,TAGCG . . .)]);
$c->annotate_InputBuffer($ib);
is_deeply(
$ib->buffer->[0]->{existing}->[0]->{matched_alleles},
$ib->buffer->[1]->{existing}->[0]->{matched_alleles},
[
{
'a_index' => 1,
'a_allele' => 'TAGCG',
'a_index' => 0,
'a_allele' => 'GCG',
'b_allele' => 'GCG',
'b_index' => 0
}
Expand All @@ -635,18 +630,12 @@ SKIP: {

$ib = get_ib([qw(21 8987004 . TAT TAGCGT,TAGTGT . . .)]);
$c->annotate_InputBuffer($ib);
is_deeply(
$ib->buffer->[0]->{existing}->[0]->{matched_alleles},
is_deeply(
$ib->buffer->[1]->{existing}->[0]->{matched_alleles},
[
{
'a_index' => 0,
'a_allele' => 'AGCGT',
'b_allele' => 'GCG',
'b_index' => 0
},
{
'a_index' => 1,
'a_allele' => 'AGTGT',
'a_allele' => 'GTG',
'b_allele' => 'GTG',
'b_index' => 1
}
Expand Down
20 changes: 17 additions & 3 deletions t/InputBuffer.t
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ is_deeply($vfs->[0], bless( {
'variation_name' => 'rs142513484',
'map_weight' => 1,
'allele_string' => 'C/T',
'nontrimmed_allele_string' => 'C/T',
'end' => 25585733,
'start' => 25585733,
'seq_region_end' => 25585733,
Expand All @@ -90,6 +91,7 @@ is_deeply($vfs->[0], bless( {
'variation_name' => 'rs148490508',
'map_weight' => 1,
'allele_string' => 'A/G',
'nontrimmed_allele_string' => 'A/G',
'end' => 25592911,
'start' => 25592911,
'seq_region_end' => 25592911,
Expand Down Expand Up @@ -418,7 +420,9 @@ is_deeply($ib->buffer, [
'chr' => '1',
'minimised' => 1,
'original_allele_string' => 'CAGAAGAAAG/TAGAAGAAAG/C',
'nontrimmed_allele_string' => 'CAGAAGAAAG/TAGAAGAAAG/C',
'original_end' => 10,
'first' => 1,
'end' => 1,
'original_start' => 1,
'strand' => 1,
Expand All @@ -439,7 +443,10 @@ is_deeply($ib->buffer, [
'variation_name' => '.',
'alt_allele' => '-',
'map_weight' => 1,
'minimised' => 1,
'allele_string' => 'AGAAGAAAG/-',
'original_allele_string' => 'CAGAAGAAAG/TAGAAGAAAG/C',
'nontrimmed_allele_string' => 'CAGAAGAAAG/TAGAAGAAAG/C',
'start' => 2,
'seq_region_start' => 2,
'seq_region_end' => 10,
Expand Down Expand Up @@ -470,11 +477,18 @@ is_deeply(
'strand' => 1,
'variation_name' => '.',
'map_weight' => 1,
'allele_string' => 'CAG/TAG/T',
'end' => 3,
'minimised' => 1,
'allele_string' => 'C/T',
'original_allele_string' => 'CAG/TAG/T',
'nontrimmed_allele_string' => 'CAG/TAG/T',
'alt_allele' => 'T',
'end' => 1,
'start' => 1,
'seq_region_start' => 1,
'seq_region_end' => 3,
'seq_region_end' => 1,
'original_start' => 1,
'original_end' => 3,
'first' => 1
}, 'Bio::EnsEMBL::Variation::VariationFeature' ),
'minimal - doesnt affect non-minimisable'
);
Expand Down
2 changes: 1 addition & 1 deletion t/OutputFactory_JSON.t
Original file line number Diff line number Diff line change
Expand Up @@ -532,7 +532,7 @@ SKIP: {
'cdna_start' => 2347,
'transcript_id' => 'NM_000484.3',
'gene_id' => '351',
'uploaded_allele' => '-/A',
'uploaded_allele' => 'G/GA',
'cds_start' => 2147,
'protein_start' => 716,
'refseq_match' => [
Expand Down
Loading