Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to save an SgkitSampleData instance, e.g. for running the CLI #925

Open
hyanwong opened this issue May 24, 2024 · 4 comments
Open

How to save an SgkitSampleData instance, e.g. for running the CLI #925

hyanwong opened this issue May 24, 2024 · 4 comments

Comments

@hyanwong
Copy link
Member

It appears as if it's not possible to save an SgkitSampleData instance to a path. I'm not sure, therefore, how I might run the CLI on an zarr file, if I've specified bespoke masks, ancestral alleles, or wherever via numpy arrays (see #923)

I guess the easy way around this is to have a function that saves all the information such as bespoke masks / ancestral alleles into the zarr file (or make a copy of it if the original zarr is read-only?), and allow the CLI to run directly on that modified zarr:

tsinfer infer demo.vcz -O demo.trees

A more complex possibility for the user is to have the CLI accept the same parameters as tsinfer.SgkitSampleData(...), but then we might want to allow either numpy files or names in the .vcz file, which seems a bit icky, e.g.

tsinfer infer demo.vcz --variant_mask my_vmask_file.npy --ancestral_allele variant_AA -O demo.trees
@jeromekelleher
Copy link
Member

I guess another possibility would be to provide an input CSV file (or Zarr) which is formatted with position, ancestral_allele, variant_age etc, which then populates the arrays appropriately.

@hyanwong
Copy link
Member Author

Good point. This could even be a .npz file with the appropriately named variables. I guess that's basically the same as another zarr file. Probably best if @benjeffery weighs in with what he thinks would work best.

@hyanwong
Copy link
Member Author

hyanwong commented May 24, 2024

Here's a snippet from the docs I am trying to write

vcf2zarr explode demo.vcf.gz /tmp/demo.exploded
vcf2zarr encode /tmp/demo.exploded demo.vcz
# Here: how can I specify the ancestral state via a simple CLI command???
tsinfer generate-ancestors demo.vcz  # saves to demo.ancestors
tsinfer match-ancestors demo.vcz  # saves to demo.ancestors.trees
tsinfer match-samples demo.vcz  # saves to demo.trees

To parallelise the vcf2zarr steps you may wish to explore the --worker-processes option, or even split over partitions. To parallelise the tsinfer steps you may wish to investigate the --progress and --num_threads options.

Thoughts about how to specify on the command-line to use variant_AA as ancestral state, given a demo VCF with ancestral alleles embedded in the AA field, would be most welcome.

@jeromekelleher
Copy link
Member

A JSON or yaml config file specifying inference parameters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants