fix quickstart and docs

matsengrp · Jul 24, 2024 · 7da2859 · 7da2859
1 parent 9f858f7
commit 7da2859
Show file tree

Hide file tree

Showing 4 changed files with 74 additions and 39 deletions.
diff --git a/docs/index.rst b/docs/index.rst
@@ -34,8 +34,6 @@ gctree documentation
    :maxdepth: 1
    :caption: Notes
 
-   CHANGELOG
-   faq
 
 
 Indices and tables

diff --git a/docs/quickstart.rst b/docs/quickstart.rst
@@ -67,8 +67,8 @@ You now have two new files, ``outfile`` and ``outtree``.
 Note: if you want to rerun the above ``dnapars`` command, you must delete these two files first!
 
 
-gctree
-======
+Gctree Ranking
+==============
 
 We're now ready to run ``gctree infer`` to use abundance data (in ``abundances.csv``) to rank the equally parsimonious trees (in ``outfile``).
 We can use the optional argument ``--frame`` to indicate the coding frame of the sequence start, so that amino acid substitutions can be annotated on our trees.
@@ -106,48 +106,82 @@ This file may be manipulated using ``gctree infer``, instead of providing
 a dnapars ``outfile``.
 
 .. note::
-  Although described below, using context likelihood, mutability parsimony, or isotype parsimony
-   as ranking criteria is experimental, and has not yet been shown in a careful
-   validation to improve tree inference. Only the default branching process
-   likelihood is recommended for tree ranking!
+  Although described below, using context likelihood log loss, mutability parsimony, or isotype parsimony
+  as ranking criteria is experimental, and has not yet been shown in a careful
+  validation to improve tree inference. Only the default branching process
+  likelihood is recommended for tree ranking!
 
 Criteria other than branching process likelihoods can be used to break ties
 between trees. Providing arguments ``--isotype_mapfile`` and
 ``--idmapfile`` will allow trees to be ranked by isotype parsimony. Providing
 arguments ``--mutability`` and ``--substitution`` allows trees to be ranked
-according to a context-sensitive mutation model. By default, trees are ranked
-lexicographically, first maximizing likelihood, then minimizing isotype
-parsimony, and finally maximizing a context-based poisson likelihood, if such information is provided.
-Ranking priorities can be adjusted using the argument ``--ranking_coeffs``.
-
-For example, to find the optimal tree
-according to a linear combination of likelihood, isotype parsimony,
-mutabilities, and alleles:
-
-.. command-output:: gctree infer gctree.out.inference.parsimony_forest.p --frame 1 --idmap idmap.txt --isotype_mapfile ../example/isotypemap.txt --mutability ../HS5F_Mutability.csv --substitution ../HS5F_Substitution.csv --ranking_coeffs 1 0.1 0 --outbase newranking --summarize_forest --tree_stats --verbose
+according to a context-sensitive Poisson model. By default, trees are ranked
+lexicographically, first minimizing branching process log loss (negative
+log-likelihood), then minimizing isotype
+parsimony, minimizing context-based poisson log loss, and finally minimizing number of alleles,
+if required information for all criteria is provided. Criteria for which
+required information is not provided will be skipped.
+
+Ranking priorities can be adjusted using the argument ``--ranking_strategy``.
+This argument accepts a string describing either an alternative lexicographic
+ordering, or a linear combination of criteria to be minimized. The following
+identifiers are used to specify available ranking criteria:
+
+* ``B`` - branching process log loss
+* ``I`` - isotype parsimony
+* ``C`` - context-based Poisson log loss
+* ``M`` - old mutability parsimony
+* ``A`` - number of alleles
+* ``R`` - sitewise reversions to naive sequence
+
+An alternative lexicographic ordering can be specified using a comma-separated
+list of identifiers. For example by passing ``--ranking_strategy "C, B, R"`` to
+``gctree infer``, we will minimize context-based Poisson log loss, then
+branching process log-loss, and finally naive reversions. If for some reason
+the user wishes to maximize instead of minimizing a criterion, a negative
+coefficient can be provided, as in ``"C, B, -R"``. To compute the value of
+a criterion without using it for ranking, a coefficient of zero can be
+prepended to its identifier.
+
+Alternatively, the ``--ranking_strategy`` option can be used to rank trees to
+minimize a linear combination of criteria.
+For example, to find the optimal tree according to a linear combination of
+branching process log loss, isotype parsimony, context-based Poisson log loss,
+and alleles:
+
+.. command-output:: gctree infer gctree.out.inference.parsimony_forest.p \
+                                --frame 1 \
+                                --idmap idmap.txt \
+                                --isotype_mapfile ../example/isotypemap.txt \
+                                --mutability ../HS5F_Mutability.csv \
+                                --substitution ../HS5F_Substitution.csv \
+                                --ranking_strategy "B + I + 0.1C + 0.01A" \
+                                --outbase newranking --summarize_forest \
+                                --tree_stats \
+                                --verbose
    :shell:
 
 The files ``HS5F_Mutability.csv`` and ``HS5F_Substitution.csv`` are a context
-sensitive mutation model which can be downloaded from the `Shazam Project <https://bitbucket.org/kleinstein/shazam/src/master/data-raw/`>_.
+sensitive mutation model which can be downloaded from the `Shazam Project <https://bitbucket.org/kleinstein/shazam/src/master/data-raw/>`_, and are required here to compute context-based Poisson log loss.
 
 By default, only the files listed above will be generated, with the optional argument ``--outbase`` specifying how the output files should be named.
 
 .. image:: newranking.inference.1.svg
    :width: 1000
 
-For detailed information about each tree used for ranking, as well as a pairplot like the one below comparing the highest ranked tree to all other ranked trees,use the argument ``--tree_stats``.
+For detailed information about each tree used for ranking, as well as a pairplot like the one below comparing the highest ranked tree to all other ranked trees, use the argument ``--tree_stats``.
 
-.. image:: newranking.tree_stats.pairplot.png
+.. image:: newranking.tree_stats.pairplot.svg
    :width: 1000
 
-Sometimes ranked trees are too numerous, and generating the output of ``--tree_stats`` would require too many resources. For a summary of the collection of trees used for ranking, the argument ``--summarize_forest`` is provided. Most importantly, this option summarizes how much less likely the top ranked tree is, compared to the most likely tree being ranked, for example to validate coefficients passed to ``--ranking_coeffs``.
+Sometimes ranked trees are too numerous to generate the output of ``--tree_stats`` in a reasonable amount of time. For a summary of the collection of trees used for ranking, the option ``--summarize_forest`` is provided. Most importantly, this option summarizes how much less likely the top ranked tree is, compared to the most likely tree being ranked, for example to validate coefficients passed to ``--ranking_strategy``.
 
 .. command-output:: cat newranking.forest_summary.log
    :shell:
 
 
-isotype
-=======
+Isotypes
+========
 
 If we would like to add observed isotype data to trees output by gctree
 inference, we can now do so.

diff --git a/gctree/branching_processes.py b/gctree/branching_processes.py
@@ -1173,6 +1173,7 @@ def filter_trees(  # noqa: C901
         outbase: str = "gctree.out",
         summarize_forest: bool = False,
         tree_stats: bool = False,
+        img_type: str = "svg",
     ) -> CollapsedForest:
         """Filter trees according to specified criteria.
 
@@ -1196,6 +1197,7 @@ def filter_trees(  # noqa: C901
             outbase: file name stem for a file with information for each tree in the DAG.
             summarize_forest: whether to write a summary of the forest to file `[outbase].forest_summary.log`
             tree_stats: whether to write stats for each tree in the forest to file `[outbase].tree_stats.log`
+            img_type: format for output plots.
 
         Returns:
             The trimmed forest, containing all optimal trees according to the specified criteria, and a tuple
@@ -1613,7 +1615,7 @@ def minfunckey(tup):
                 hue="set",
                 diag_kind="hist",
             )
-            pplot.savefig(outbase + ".tree_stats.pairplot.pdf")
+            pplot.savefig(outbase + f".tree_stats.pairplot.{img_type}")
 
         return (ctrees, trimmed_forest, weighttuples)
 

diff --git a/gctree/cli.py b/gctree/cli.py
@@ -222,6 +222,7 @@ def isotype_add(forest):
         mutability_file=args.mutability,
         substitution_file=args.substitution,
         chain_split=args.chain_split,
+        img_type=args.img_type,
     )
 
     if args.colormapfile is not None:
@@ -609,9 +610,9 @@ def get_parser():
         type=float,
         default=None,
         help=(
-            "This argument is deprecated and will throw an error. Use `--ranking_strategy` instead. "
+            "This argument is deprecated and will throw an error. Use ``--ranking_strategy`` instead. "
             "Coefficient used for branching process likelihood, when ranking trees by a linear "
-            "combination of traits. This value will be ignored if `--ranking_coeffs` argument is not "
+            "combination of traits. This value will be ignored if ``--ranking_coeffs`` argument is not "
             "also provided."
         ),
     )
@@ -621,7 +622,7 @@ def get_parser():
         nargs=3,
         default=None,
         help=(
-            "This argument is deprecated and will throw an error. Use `--ranking_strategy` instead. "
+            "This argument is deprecated and will throw an error. Use ``--ranking_strategy`` instead. "
             "List of coefficients for ranking trees by a linear combination of traits. "
             "Coefficients are in order: isotype parsimony, mutation model parsimony, number of alleles. "
             "A coefficient of -1 will be applied to branching process likelihood. "
@@ -635,13 +636,13 @@ def get_parser():
         default=None,
         help=(
             "Expression describing tree ranking strategy. If provided, takes precedence over all other ranking arguments. "
-            "Two types of expressions are permitted: First are those describing lexicographic orderings, like `B,C,A`, which means "
+            "Two types of expressions are permitted: First are those describing lexicographic orderings, like ``B,C,A``, which means "
             "choose trees to minimize branching process log loss, then minimize context log loss, then minimize number "
-            "of alleles. Next are expressions describing linear combinations of criteria, like `B+2C-1.1A`, which means choose "
+            "of alleles. Next are expressions describing linear combinations of criteria, like ``B+2C-1.1A``, which means choose "
             "trees to minimize the specified linear combination of criteria. "
-            "If linear combination expression has leading `-`, use `=` instead of space to separate argument, "
+            "If linear combination expression has leading ``-``, use ``=`` instead of space to separate argument, "
             "e.g. ``--ranking_strategy=-B+R``. "
-            "These two methods of ranking cannot be combined. For example, `B+C,A` is not a valid ranking strategy expression. "
+            "These two methods of ranking cannot be combined. For example, ``B+C,A`` is not a valid ranking strategy expression. "
             "Ranking criteria are specified using the following identifiers. All are by default minimized:\n"
             "B - branching process log loss,\n"
             "I - isotype parsimony,\n"
@@ -650,23 +651,23 @@ def get_parser():
             "A - number of alleles,\n"
             "R - sitewise reversions to naive sequence.\n"
             "To compute the value of a criterion on ranked trees without affecting the ranking, include that ranking criterion "
-            "with a coefficient of zero, as in `B+2C+0A`, or `B,C,0A`.\n"
+            "with a coefficient of zero, as in ``B+2C+0A``, or ``B,C,0A``.\n"
             "To maximize instead of minimizing a criterion in lexicographic ranking, provide a negative coefficient. "
-            "For example, `B,-A` will first minimiz branching process log loss, then maximize the number of alleles. \n"
+            "For example, ``B,-A`` will first minimize branching process log loss, then maximize the number of alleles. \n"
             "A ranking strategy string containing a single ranking criterion identifier will be interpreted as a lexicographic ordering. "
-            "`gctree infer --verbose` will describe the ranking strategy used. Examine this output to make sure it's as expected."
+            "``gctree infer --verbose`` will describe the ranking strategy used. Examine this output to make sure it's as expected."
         ),
     )
     parser_infer.add_argument(
         "--use_old_mut_parsimony",
         action="store_true",
         help=(
             "This argument is deprecated and will throw an error. Use the identifier 'M' with the "
-            "argument `--ranking_strategy` instead. "
+            "argument ``--ranking_strategy`` instead. "
             "Use old mutability parsimony instead of poisson context likelihood. Not recommended "
             "unless attempting to reproduce results from older versions of gctree. "
             "This argument will have no effect unless an S5F model is provided with the arguments "
-            "`--mutability` and `--substitution`."
+            "``--mutability`` and ``--substitution``."
         ),
     )
     parser_infer.add_argument(
@@ -680,14 +681,14 @@ def get_parser():
         "--summarize_forest",
         action="store_true",
         help=(
-            "write a file `[outbase].forest_summary.log` with a summary of traits for trees in the forest."
+            "write a file ``[outbase].forest_summary.log`` with a summary of traits for trees in the forest."
         ),
     )
     parser_infer.add_argument(
         "--tree_stats",
         action="store_true",
         help=(
-            "write a file `[outbase].tree_stats.log` with stats for all trees in the forest. "
+            "write a file ``[outbase].tree_stats.log`` with stats for all trees in the forest. "
             "For large forests, this is slow and memory intensive."
         ),
     )