How to save an SgkitSampleData instance, e.g. for running the CLI #925

hyanwong · 2024-05-24T12:35:40Z

It appears as if it's not possible to save an SgkitSampleData instance to a path. I'm not sure, therefore, how I might run the CLI on an zarr file, if I've specified bespoke masks, ancestral alleles, or wherever via numpy arrays (see #923)

I guess the easy way around this is to have a function that saves all the information such as bespoke masks / ancestral alleles into the zarr file (or make a copy of it if the original zarr is read-only?), and allow the CLI to run directly on that modified zarr:

tsinfer infer demo.vcz -O demo.trees

A more complex possibility for the user is to have the CLI accept the same parameters as tsinfer.SgkitSampleData(...), but then we might want to allow either numpy files or names in the .vcz file, which seems a bit icky, e.g.

tsinfer infer demo.vcz --variant_mask my_vmask_file.npy --ancestral_allele variant_AA -O demo.trees

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2024-05-24T13:59:40Z

I guess another possibility would be to provide an input CSV file (or Zarr) which is formatted with position, ancestral_allele, variant_age etc, which then populates the arrays appropriately.

hyanwong · 2024-05-24T14:24:28Z

Good point. This could even be a .npz file with the appropriately named variables. I guess that's basically the same as another zarr file. Probably best if @benjeffery weighs in with what he thinks would work best.

hyanwong · 2024-05-24T15:21:54Z

Here's a snippet from the docs I am trying to write

vcf2zarr explode demo.vcf.gz /tmp/demo.exploded
vcf2zarr encode /tmp/demo.exploded demo.vcz
# Here: how can I specify the ancestral state via a simple CLI command???
tsinfer generate-ancestors demo.vcz  # saves to demo.ancestors
tsinfer match-ancestors demo.vcz  # saves to demo.ancestors.trees
tsinfer match-samples demo.vcz  # saves to demo.trees
To parallelise the vcf2zarr steps you may wish to explore the --worker-processes option, or even split over partitions. To parallelise the tsinfer steps you may wish to investigate the --progress and --num_threads options.

Thoughts about how to specify on the command-line to use variant_AA as ancestral state, given a demo VCF with ancestral alleles embedded in the AA field, would be most welcome.

jeromekelleher · 2024-05-24T19:05:02Z

A JSON or yaml config file specifying inference parameters?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to save an SgkitSampleData instance, e.g. for running the CLI #925

How to save an SgkitSampleData instance, e.g. for running the CLI #925

hyanwong commented May 24, 2024

jeromekelleher commented May 24, 2024

hyanwong commented May 24, 2024

hyanwong commented May 24, 2024 •

edited

Loading

jeromekelleher commented May 24, 2024

How to save an SgkitSampleData instance, e.g. for running the CLI #925

How to save an SgkitSampleData instance, e.g. for running the CLI #925

Comments

hyanwong commented May 24, 2024

jeromekelleher commented May 24, 2024

hyanwong commented May 24, 2024

hyanwong commented May 24, 2024 • edited Loading

jeromekelleher commented May 24, 2024

hyanwong commented May 24, 2024 •

edited

Loading