-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design #1
Comments
Two suggestions:
|
Agree. We should also add proper handling of variable widths etc. I will indeed factor out the one-hot-encoder from concise for the beginning to make things simple. It will not be as fast as genomelake which uses cython to compile the one-hot-encode function. Do you think one should install all the dependencies for all the dataloaders in the dataloaders package or should these be installed on the fly? |
@lzamparo what was your issue once again with genomelake - pysam? |
Migrating from pysam to pyfaidx: kundajelab/genomelake#19 |
@Avsecz my issue was that genomelake - pysam could not handle a fasta file that contained a collection of intervals with the corresponding subsequences. It expected an entire chromosome, and each subsequence where the subsequences could be extracted from it. |
Did you do profiling of cythonized versus pure python one-hot encoding? How much of a speedup did it provide?
I'm not sure. I think for users, it would be better to have dependencies installed on an as-needed basis. However, this might be more work for you, so I understand if you'd want to bundle all dependencies together whe then install a kipoi/dataloaders package (via conda I imagine?) |
Yup. I did the benchmarks. For 1kb it's 300us vs 50us and for 10kb is 3ms vs 0.5 ms. Hence 6x speedup. I really want to make these dataloaders as fast as possible so that they become the default to people training models as well.
I migrated genomelake to pyfaidx. Need to fix the unit tests and then it should work. Could you please send me a slashed down version of your fasta file (containing a single sequence of length ~100) and the expected result so that I can add it to the unit-tests. Migrating to pyfaidx will also solve the installation issues - if conda channels were in the wrong order, then (old) pysam from anaconda was chosen.
Ok. I think that's a good idea especially if the set of data-loaders and their dependencies starts growing. It's actually not more work. We anyway need to annotate each dataloader as in |
thanks for the discussion - I am currently implementing a first draft of the dataloaders repo and I think my suggestion agrees with many of your points. I have opted to:
using these definitions, I get a neat and compact defintion of dataloader classes capable to cater for one-hot encoded models with all sorts of different input shapes and axis-order, string-sequece models, etc. see: https://github.com/kipoi/dataloaders/blob/first_draft/kipoi_dataloaders/sequence_based.py. please feel free to take a look at the branch https://github.com/kipoi/dataloaders/tree/first_draft and give feedback :) |
@lzamparo btw. I am more than happy to use a fasta reading package that has the functionality you requested. pyfaidx does not have the functionality that you have requested - e.g.: requesting the sequence from genomic interval chr15:98503515-98503517 given the fasta example file in your gist in kipoi/kipoi#330. pyfaidx inteprets the entire string after the ">" as a sequence/chromosome name. If you can point me to the fasta reader that you have in mind which is capable to handle this the way you specified and if the that fasta reader complies with the fasta standards then I am more than happy to replace pyfaidx with that one. This is what happens when I use pyfaidx: import pyfaidx
ff = pyfaidx.Fasta("tests/data/subseq_sample.fasta")
ff.get_seq("chr15", 98503515, 98503517)
Traceback (most recent call last):
File ".../python3.6/site-packages/pyfaidx/__init__.py", line 628, in from_file
i = self.index[rname]
KeyError: 'chr15'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../python3.6/site-packages/pyfaidx/__init__.py", line 1015, in get_seq
seq = self.faidx.fetch(name, start, end)
File ".../python3.6/site-packages/pyfaidx/__init__.py", line 613, in fetch
seq = self.from_file(name, start, end)
File ".../python3.6/site-packages/pyfaidx/__init__.py", line 631, in from_file
"Please check your FASTA file.".format(rname))
pyfaidx.FetchError: Requested rname chr15 does not exist! Please check your FASTA file. on this fasta file:
|
@krrome I've not used There is one snag, which is you need to convert to relative indexing of the bases from the absolute indexing of the interval, but that's pretty fast. Have a look at the |
@lzamparo thanks for the link. I could add your way of querying in case the default get_seq throws a |
True, I ensured there would be correspondance by construction. Though I think it can handle the .other common use case, where you have one fasta file per chromosome, and the intervals can be taken as a subset without any conversion to relative sequences. Up to you to decide how to handle the case where an interval has no match in the fasta provided. |
Great! Thanks @lzamparo . I'll have a closer look at your implementation tomorrow and add the provided files to unit-tests. Update
|
@Avsecz +1 for both transforms you suggest below, I can think of some handy use-cases for both. Also useful might be a random substring transform? I can imagine a scenario when you are working with longer sequences, but the property you're trying to predict is valid over the length of the sequence, and you want to use this for data augmentation. |
@lzamparo what's the motivation of storing chromosome subsets in the fasta file instead of using intervals to query the sequence directly?
|
Which datalaoders to use:
API
torch.data.Dataset
APINice to have
Integration with Kipoi
default_dataloader
The text was updated successfully, but these errors were encountered: