Skip to content

Commit

Permalink
fix quickstart and docs
Browse files Browse the repository at this point in the history
  • Loading branch information
willdumm committed Jul 24, 2024
1 parent 9f858f7 commit 7da2859
Show file tree
Hide file tree
Showing 4 changed files with 74 additions and 39 deletions.
2 changes: 0 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,6 @@ gctree documentation
:maxdepth: 1
:caption: Notes

CHANGELOG
faq


Indices and tables
Expand Down
78 changes: 56 additions & 22 deletions docs/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,8 @@ You now have two new files, ``outfile`` and ``outtree``.
Note: if you want to rerun the above ``dnapars`` command, you must delete these two files first!


gctree
======
Gctree Ranking
==============

We're now ready to run ``gctree infer`` to use abundance data (in ``abundances.csv``) to rank the equally parsimonious trees (in ``outfile``).
We can use the optional argument ``--frame`` to indicate the coding frame of the sequence start, so that amino acid substitutions can be annotated on our trees.
Expand Down Expand Up @@ -106,48 +106,82 @@ This file may be manipulated using ``gctree infer``, instead of providing
a dnapars ``outfile``.

.. note::
Although described below, using context likelihood, mutability parsimony, or isotype parsimony
as ranking criteria is experimental, and has not yet been shown in a careful
validation to improve tree inference. Only the default branching process
likelihood is recommended for tree ranking!
Although described below, using context likelihood log loss, mutability parsimony, or isotype parsimony
as ranking criteria is experimental, and has not yet been shown in a careful
validation to improve tree inference. Only the default branching process
likelihood is recommended for tree ranking!

Criteria other than branching process likelihoods can be used to break ties
between trees. Providing arguments ``--isotype_mapfile`` and
``--idmapfile`` will allow trees to be ranked by isotype parsimony. Providing
arguments ``--mutability`` and ``--substitution`` allows trees to be ranked
according to a context-sensitive mutation model. By default, trees are ranked
lexicographically, first maximizing likelihood, then minimizing isotype
parsimony, and finally maximizing a context-based poisson likelihood, if such information is provided.
Ranking priorities can be adjusted using the argument ``--ranking_coeffs``.

For example, to find the optimal tree
according to a linear combination of likelihood, isotype parsimony,
mutabilities, and alleles:

.. command-output:: gctree infer gctree.out.inference.parsimony_forest.p --frame 1 --idmap idmap.txt --isotype_mapfile ../example/isotypemap.txt --mutability ../HS5F_Mutability.csv --substitution ../HS5F_Substitution.csv --ranking_coeffs 1 0.1 0 --outbase newranking --summarize_forest --tree_stats --verbose
according to a context-sensitive Poisson model. By default, trees are ranked
lexicographically, first minimizing branching process log loss (negative
log-likelihood), then minimizing isotype
parsimony, minimizing context-based poisson log loss, and finally minimizing number of alleles,
if required information for all criteria is provided. Criteria for which
required information is not provided will be skipped.

Ranking priorities can be adjusted using the argument ``--ranking_strategy``.
This argument accepts a string describing either an alternative lexicographic
ordering, or a linear combination of criteria to be minimized. The following
identifiers are used to specify available ranking criteria:

* ``B`` - branching process log loss
* ``I`` - isotype parsimony
* ``C`` - context-based Poisson log loss
* ``M`` - old mutability parsimony
* ``A`` - number of alleles
* ``R`` - sitewise reversions to naive sequence

An alternative lexicographic ordering can be specified using a comma-separated
list of identifiers. For example by passing ``--ranking_strategy "C, B, R"`` to
``gctree infer``, we will minimize context-based Poisson log loss, then
branching process log-loss, and finally naive reversions. If for some reason
the user wishes to maximize instead of minimizing a criterion, a negative
coefficient can be provided, as in ``"C, B, -R"``. To compute the value of
a criterion without using it for ranking, a coefficient of zero can be
prepended to its identifier.

Alternatively, the ``--ranking_strategy`` option can be used to rank trees to
minimize a linear combination of criteria.
For example, to find the optimal tree according to a linear combination of
branching process log loss, isotype parsimony, context-based Poisson log loss,
and alleles:

.. command-output:: gctree infer gctree.out.inference.parsimony_forest.p \
--frame 1 \
--idmap idmap.txt \
--isotype_mapfile ../example/isotypemap.txt \
--mutability ../HS5F_Mutability.csv \
--substitution ../HS5F_Substitution.csv \
--ranking_strategy "B + I + 0.1C + 0.01A" \
--outbase newranking --summarize_forest \
--tree_stats \
--verbose
:shell:

The files ``HS5F_Mutability.csv`` and ``HS5F_Substitution.csv`` are a context
sensitive mutation model which can be downloaded from the `Shazam Project <https://bitbucket.org/kleinstein/shazam/src/master/data-raw/`>_.
sensitive mutation model which can be downloaded from the `Shazam Project <https://bitbucket.org/kleinstein/shazam/src/master/data-raw/>`_, and are required here to compute context-based Poisson log loss.

By default, only the files listed above will be generated, with the optional argument ``--outbase`` specifying how the output files should be named.

.. image:: newranking.inference.1.svg
:width: 1000

For detailed information about each tree used for ranking, as well as a pairplot like the one below comparing the highest ranked tree to all other ranked trees,use the argument ``--tree_stats``.
For detailed information about each tree used for ranking, as well as a pairplot like the one below comparing the highest ranked tree to all other ranked trees, use the argument ``--tree_stats``.

.. image:: newranking.tree_stats.pairplot.png
.. image:: newranking.tree_stats.pairplot.svg
:width: 1000

Sometimes ranked trees are too numerous, and generating the output of ``--tree_stats`` would require too many resources. For a summary of the collection of trees used for ranking, the argument ``--summarize_forest`` is provided. Most importantly, this option summarizes how much less likely the top ranked tree is, compared to the most likely tree being ranked, for example to validate coefficients passed to ``--ranking_coeffs``.
Sometimes ranked trees are too numerous to generate the output of ``--tree_stats`` in a reasonable amount of time. For a summary of the collection of trees used for ranking, the option ``--summarize_forest`` is provided. Most importantly, this option summarizes how much less likely the top ranked tree is, compared to the most likely tree being ranked, for example to validate coefficients passed to ``--ranking_strategy``.

.. command-output:: cat newranking.forest_summary.log
:shell:


isotype
=======
Isotypes
========

If we would like to add observed isotype data to trees output by gctree
inference, we can now do so.
Expand Down
4 changes: 3 additions & 1 deletion gctree/branching_processes.py
Original file line number Diff line number Diff line change
Expand Up @@ -1173,6 +1173,7 @@ def filter_trees( # noqa: C901
outbase: str = "gctree.out",
summarize_forest: bool = False,
tree_stats: bool = False,
img_type: str = "svg",
) -> CollapsedForest:
"""Filter trees according to specified criteria.
Expand All @@ -1196,6 +1197,7 @@ def filter_trees( # noqa: C901
outbase: file name stem for a file with information for each tree in the DAG.
summarize_forest: whether to write a summary of the forest to file `[outbase].forest_summary.log`
tree_stats: whether to write stats for each tree in the forest to file `[outbase].tree_stats.log`
img_type: format for output plots.
Returns:
The trimmed forest, containing all optimal trees according to the specified criteria, and a tuple
Expand Down Expand Up @@ -1613,7 +1615,7 @@ def minfunckey(tup):
hue="set",
diag_kind="hist",
)
pplot.savefig(outbase + ".tree_stats.pairplot.pdf")
pplot.savefig(outbase + f".tree_stats.pairplot.{img_type}")

return (ctrees, trimmed_forest, weighttuples)

Expand Down
29 changes: 15 additions & 14 deletions gctree/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,7 @@ def isotype_add(forest):
mutability_file=args.mutability,
substitution_file=args.substitution,
chain_split=args.chain_split,
img_type=args.img_type,
)

if args.colormapfile is not None:
Expand Down Expand Up @@ -609,9 +610,9 @@ def get_parser():
type=float,
default=None,
help=(
"This argument is deprecated and will throw an error. Use `--ranking_strategy` instead. "
"This argument is deprecated and will throw an error. Use ``--ranking_strategy`` instead. "
"Coefficient used for branching process likelihood, when ranking trees by a linear "
"combination of traits. This value will be ignored if `--ranking_coeffs` argument is not "
"combination of traits. This value will be ignored if ``--ranking_coeffs`` argument is not "
"also provided."
),
)
Expand All @@ -621,7 +622,7 @@ def get_parser():
nargs=3,
default=None,
help=(
"This argument is deprecated and will throw an error. Use `--ranking_strategy` instead. "
"This argument is deprecated and will throw an error. Use ``--ranking_strategy`` instead. "
"List of coefficients for ranking trees by a linear combination of traits. "
"Coefficients are in order: isotype parsimony, mutation model parsimony, number of alleles. "
"A coefficient of -1 will be applied to branching process likelihood. "
Expand All @@ -635,13 +636,13 @@ def get_parser():
default=None,
help=(
"Expression describing tree ranking strategy. If provided, takes precedence over all other ranking arguments. "
"Two types of expressions are permitted: First are those describing lexicographic orderings, like `B,C,A`, which means "
"Two types of expressions are permitted: First are those describing lexicographic orderings, like ``B,C,A``, which means "
"choose trees to minimize branching process log loss, then minimize context log loss, then minimize number "
"of alleles. Next are expressions describing linear combinations of criteria, like `B+2C-1.1A`, which means choose "
"of alleles. Next are expressions describing linear combinations of criteria, like ``B+2C-1.1A``, which means choose "
"trees to minimize the specified linear combination of criteria. "
"If linear combination expression has leading `-`, use `=` instead of space to separate argument, "
"If linear combination expression has leading ``-``, use ``=`` instead of space to separate argument, "
"e.g. ``--ranking_strategy=-B+R``. "
"These two methods of ranking cannot be combined. For example, `B+C,A` is not a valid ranking strategy expression. "
"These two methods of ranking cannot be combined. For example, ``B+C,A`` is not a valid ranking strategy expression. "
"Ranking criteria are specified using the following identifiers. All are by default minimized:\n"
"B - branching process log loss,\n"
"I - isotype parsimony,\n"
Expand All @@ -650,23 +651,23 @@ def get_parser():
"A - number of alleles,\n"
"R - sitewise reversions to naive sequence.\n"
"To compute the value of a criterion on ranked trees without affecting the ranking, include that ranking criterion "
"with a coefficient of zero, as in `B+2C+0A`, or `B,C,0A`.\n"
"with a coefficient of zero, as in ``B+2C+0A``, or ``B,C,0A``.\n"
"To maximize instead of minimizing a criterion in lexicographic ranking, provide a negative coefficient. "
"For example, `B,-A` will first minimiz branching process log loss, then maximize the number of alleles. \n"
"For example, ``B,-A`` will first minimize branching process log loss, then maximize the number of alleles. \n"
"A ranking strategy string containing a single ranking criterion identifier will be interpreted as a lexicographic ordering. "
"`gctree infer --verbose` will describe the ranking strategy used. Examine this output to make sure it's as expected."
"``gctree infer --verbose`` will describe the ranking strategy used. Examine this output to make sure it's as expected."
),
)
parser_infer.add_argument(
"--use_old_mut_parsimony",
action="store_true",
help=(
"This argument is deprecated and will throw an error. Use the identifier 'M' with the "
"argument `--ranking_strategy` instead. "
"argument ``--ranking_strategy`` instead. "
"Use old mutability parsimony instead of poisson context likelihood. Not recommended "
"unless attempting to reproduce results from older versions of gctree. "
"This argument will have no effect unless an S5F model is provided with the arguments "
"`--mutability` and `--substitution`."
"``--mutability`` and ``--substitution``."
),
)
parser_infer.add_argument(
Expand All @@ -680,14 +681,14 @@ def get_parser():
"--summarize_forest",
action="store_true",
help=(
"write a file `[outbase].forest_summary.log` with a summary of traits for trees in the forest."
"write a file ``[outbase].forest_summary.log`` with a summary of traits for trees in the forest."
),
)
parser_infer.add_argument(
"--tree_stats",
action="store_true",
help=(
"write a file `[outbase].tree_stats.log` with stats for all trees in the forest. "
"write a file ``[outbase].tree_stats.log`` with stats for all trees in the forest. "
"For large forests, this is slow and memory intensive."
),
)
Expand Down

0 comments on commit 7da2859

Please sign in to comment.