Skip to content

Commit

Permalink
Merge pull request #313 from fedarko/sm-barplot-optimization
Browse files Browse the repository at this point in the history
Speed up and test sample metadata barplot computations
  • Loading branch information
kwcantrell authored Aug 13, 2020
2 parents 6b73e58 + 472a248 commit 5bccc7f
Show file tree
Hide file tree
Showing 4 changed files with 248 additions and 62 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,17 +171,17 @@ We have a new layer to work with!

One thing we might be interested in doing is seeing what types of samples contain each tip. This is possible using the _Sample Metadata Coloring_ functionality described above, but this only lets us see information about tips that are unique to a given sample metadata category -- and in practice many tips are often shared between multiple metadata categories, complicating things.

Let's revisit our analysis above of which tips are unique to which body sites in this dataset -- now, we'll instead be asking the related question of "which tips are most common in the body sites in this dataset?" To investigate this, we'll use our new barplot layer to show this information.
Let's revisit our analysis above of which tips are unique to which body sites in this dataset -- now, we'll instead be asking the related question of "which tips are most frequently seen in which body sites in this dataset?" To investigate this, we'll use our new barplot layer to show this information.

In order to do this, we'll need to change our new layer ("Layer 2") from a feature metadata layer to a sample metadata layer. You can do this by clicking on the _Sample Metadata_ button underneath the text "Layer 2". The controls available for this barplot layer should change; in order to show sample presence information for body sites, change the _Show sample info for..._ drop-down menu to `body-site`. Try clicking _Update_ to see what our new Layer 2 looks like.

![empress barplots: class coloring layer 1, bodysite layer 2, and tree phylum coloring](docs/moving-pictures/img/empress_barplots_6.png)

Layer 2 now shows a stacked barplot for each tip. The colors used in this barplot are the same as with sample metadata coloring -- red corresponds to gut samples, green corresponds to tongue samples, etc. When we zoom in, we can see things in detail (you may want to re-color the tree using the _Sample Metadata Coloring_ controls and `body-site` so you can see what these colors mean):
Layer 2 now shows a stacked barplot for each tip, based on the proportions of sample groups containing a given tip. The colors used in this barplot are the same as with sample metadata coloring -- red corresponds to gut samples, green corresponds to tongue samples, etc. When we zoom in, we can see things in detail (you may want to re-color the tree using the _Sample Metadata Coloring_ controls and `body-site` so you can see what these colors mean):

![empress barplots: zoomed in on barplots: class coloring layer 1, bodysite layer 2](docs/moving-pictures/img/empress_barplots_7.png)

The top-most tip is only present in right palm samples, the second-from-the-top tip is only present in gut samples, and so on. The length taken up by a "block" for a given tip is proportional to how many samples of that type contain the tip -- so the fourth-from-the-top tip, for example, is present in mostly tongue samples but also present (albeit in fewer) left palm samples. When we click on this tip (name `9f1913b781d2cde1c8a4c57b7dc2ab83`) in the tree, we can see that this matches up with the _Sample Presence Information_ `body-site` summary for this tip: it's present in 2 tongue samples and 1 left palm sample.
The top-most tip is only present in right palm samples, the second-from-the-top tip is only present in gut samples, and so on. The length taken up by a "block" for a given tip is proportional to how many samples of that type contain the tip (relative to the total number of samples containing the tip; it's [not absolute](https://github.com/biocore/empress/issues/322)) -- so the fourth-from-the-top tip, for example, is present in mostly tongue samples but also present (albeit in fewer) left palm samples. When we click on this tip (name `9f1913b781d2cde1c8a4c57b7dc2ab83`) in the tree, we can see that this matches up with the _Sample Presence Information_ `body-site` summary for this tip: it's present in 2 tongue samples and 1 left palm sample.

This was a brief introduction to some of the barplot functionality available in Empress. There's a lot more that hasn't been documented here -- scaling bars' lengths by a continuous feature metadata field, coloring bars by a continuous feature metadata field (e.g. an importance score or [Songbird](https://github.com/biocore/songbird/)/[ALDEx2](https://www.bioconductor.org/packages/release/bioc/html/ALDEx2.html)-style feature differential), adjusting the default colors or lengths of bars, and so on. We encourage you to try things out; feel free to contact us if you have any questions!

Expand Down
100 changes: 98 additions & 2 deletions empress/support_files/js/biom-table.js
Original file line number Diff line number Diff line change
Expand Up @@ -323,7 +323,7 @@ define(["underscore", "util"], function (_, util) {
// This sample actually contains the feature!
cVal = scope._sm[sIdx][colIdx];
// Update our output Object's count info accordingly.
valueToCountOfSampleWithObs[cVal] += 1;
valueToCountOfSampleWithObs[cVal]++;
}
});
return valueToCountOfSampleWithObs;
Expand Down Expand Up @@ -472,13 +472,109 @@ define(["underscore", "util"], function (_, util) {
var sampleIdx = scope._getSampleIndexFromID(sID);
var cVal = scope._sm[sampleIdx][colIdx];
if (_.has(valueToSampleCount, cVal)) {
valueToSampleCount[cVal] += 1;
valueToSampleCount[cVal]++;
} else {
valueToSampleCount[cVal] = 1;
}
});
return valueToSampleCount;
};

/**
* Maps each feature ID in the table to a "frequencies" Object for a sample
* metadata field.
*
* Each "frequencies" Object contains information on the number of samples
* from each unique sample metadata value that contain the feature ID in
* question. Keys in these objects are unique sample metadata values, and
* values in these objects are the proportion of samples containing the
* feature that have this unique value. Only frequency information for
* unique values where at least 1 sample with this value contains the
* feature is included in a given "frequencies" Object.
*
* This function is designed to be reasonably fast, which is a big part of
* why this works on the order of "each feature ID in the table" rather
* than on a feature-per-feature basis. (The reason for this design is that
* this is used for generating sample metadata barplots, and that was
* previously very slow on large trees: see issue #298 on GitHub. Thanks
* to Yoshiki for discussing this with me.)
*
* @param {String} col Sample metadata column
*
* @return {Object} fID2Freqs
*
* @throws {Error} If the sample metadata column is unrecognized.
*/
BIOMTable.prototype.getFrequencyMap = function (col) {
var scope = this;
var colIdx = this._getSampleMetadataColIndex(col);
var fIdx2Counts = [];
var fIdx2SampleCt = [];
var containingSampleCount, cVal, cValIdx;

// Find unique (sorted) values in this sample metadata column; map
// sample metadata values to a consistent index. (Using an index to
// store this data means we can store the sample metadata values for
// each feature in an Array rather than in an Object for now.)
var uniqueSMVals = this.getUniqueSampleValues(col);
var numUniqueSMVals = uniqueSMVals.length;
var smVal2Idx = {};
_.each(uniqueSMVals, function (smVal, c) {
smVal2Idx[smVal] = c;
});

// Assign each feature an empty counts array with all 0s. Also set
// things up so we can keep track of the total number of samples
// containing each feature easily.
var i, emptyCounts;
_.each(this._fIDs, function (fID, fIdx) {
emptyCounts = [];
for (i = 0; i < numUniqueSMVals; i++) {
emptyCounts.push(0);
}
fIdx2Counts.push(emptyCounts);
fIdx2SampleCt.push(0);
});

// Iterate through each the feature presence data for each sample in
// the BIOM table, storing unique s.m. value counts and total sample
// counts for each feature
_.each(this._tbl, function (presentFeatureIndices, sIdx) {
// Figure out what metadata value this sample has at the column.
cVal = scope._sm[sIdx][colIdx];
cValIdx = smVal2Idx[cVal];
// Increment s.m. value counts for each feature present in this
// sample
_.each(presentFeatureIndices, function (fIdx) {
fIdx2Counts[fIdx][cValIdx]++;
fIdx2SampleCt[fIdx]++;
});
});

// Convert counts to frequencies
// Also, return an Object where the keys are feature IDs pointing to
// other Objects where the keys are sample metadata values, rather than
// a 2D array (which is how fIdx2Counts has been stored).
//
// TODO: It should be possible to return a 2D array without
// constructing an Object, which would save some space. This would
// require decently substantial refactoring of the tests / of
// Empress.addSMBarplotLayerCoords(), but if this gets to be too
// inefficient for large trees it's an option.
var fID2Freqs = {};
var totalSampleCount;
_.each(this._fIDs, function (fID, fIdx) {
totalSampleCount = fIdx2SampleCt[fIdx];
fID2Freqs[fID] = {};
_.each(fIdx2Counts[fIdx], function (count, smValIdx) {
if (count > 0) {
fID2Freqs[fID][uniqueSMVals[smValIdx]] =
count / totalSampleCount;
}
});
});
return fID2Freqs;
};

return BIOMTable;
});
118 changes: 61 additions & 57 deletions empress/support_files/js/empress.js
Original file line number Diff line number Diff line change
Expand Up @@ -682,7 +682,7 @@ define([
this.getNodeInfo(rNode, "highestchildyr")
);
}
// iterate throught the tree in postorder, skip root
// iterate through the tree in postorder, skip root
for (var i = 1; i < tree.size; i++) {
// name of current node
var nodeInd = i;
Expand Down Expand Up @@ -1139,66 +1139,70 @@ define([
);
var colorer = new Colorer(layer.colorBySMColorMap, sortedUniqueValues);
var sm2color = colorer.getMapRGB();
// Do most of the hard work: compute the frequencies for each tip (only
// the tips present in the BIOM table, that is)
var feature2freqs = this._biom.getFrequencyMap(layer.colorBySMField);
// Bar thickness
var halfyrscf = this._yrscf / 2;
for (i = 1; i < this._tree.size; i++) {
if (this._tree.isleaf(this._tree.postorderselect(i))) {
var node = this._treeData[i];
var name = this.getNodeInfo(node, "name");
// Don't draw bars for tips that aren't in the BIOM table
// (Note that this is only for the sample metadata barplots --
// these tips could still ostensibly have associated
// feature metadata)
if (this._biom.getObsIDsDifference([name]).length > 0) {
continue;
}
// Figure how many samples across each unique value in the
// selected sample metadata field contain this tip. (This is
// computed the same way as the information shown in the
// selected node menu's "Sample Presence Information" section.)
var spi = this.computeTipSamplePresence(name, [
layer.colorBySMField,
])[layer.colorBySMField];

// Sum the values of the sample presence information, getting
// us the total number of samples containing this tip.
// JS doesn't have a built-in sum() function, so I couldn't
// think of a better way to do this. Taken from
// https://underscorejs.org/#reduce.
var totalSampleCt = _.reduce(
_.values(spi),
function (a, b) {
return a + b;
},
0
);
var prevSectionMaxX = prevLayerMaxX;
for (var v = 0; v < sortedUniqueValues.length; v++) {
var smVal = sortedUniqueValues[v];
var ct = spi[smVal];
if (ct > 0) {
var sectionColor = sm2color[smVal];
// Assign each unique sample metadata value a length
// proportional to its, well, proportion within the sample
// presence information for this tip.
var barSectionLen =
layer.lengthSM * (ct / totalSampleCt);
var thisSectionMaxX = prevSectionMaxX + barSectionLen;
var y = this.getY(node);
var ty = y + halfyrscf;
var by = y - halfyrscf;
var corners = {
tL: [prevSectionMaxX, ty],
tR: [thisSectionMaxX, ty],
bL: [prevSectionMaxX, by],
bR: [thisSectionMaxX, by],
};
this._addTriangleCoords(coords, corners, sectionColor);
prevSectionMaxX = thisSectionMaxX;
}
// For each tip in the BIOM table...
// (We implicitly ignore [and don't draw anything for] tips that
// *aren't* in the BIOM table.)
_.each(feature2freqs, function (freqs, tipName) {
// Get the tree data for this tip.
// We can just get the 0-th key because tip names are guaranteed to
// be unique, so the nameToKeys entry for a tip name should be an
// array with 1 element.
var node = scope._treeData[scope._nameToKeys[tipName][0]];

// This variable defines the left x-coordinate for drawing the next
// "section" of the stacked barplot. It'll be updated as we iterate
// through the unique values in this sample metadata field below.
var prevSectionMaxX = prevLayerMaxX;

// For each unique value for this sample metadata field...
// NOTE: currently we iterate through all of sortedUniqueValues
// once for every tip in the table, detecting and skipping
// unique values where no samples contain this tip.
// The reason we do things this way, rather than just
// iterating directly over the keys of this tip's Object within
// the frequency map, is that we want to ensure that unique
// values are processed in the same order for every tip (so for
// a "body site" barplot you'd always see e.g. gut, left palm,
// right palm, tongue in that order).
//
// Ideally we'd skip having to do this full iteration, though,
// and only look at the unique values containing this tip from
// the start (saving time). This might require refactoring the
// output of BiomTable.getFrequencyMap(), though.
for (var v = 0; v < sortedUniqueValues.length; v++) {
var smVal = sortedUniqueValues[v];
var freq = freqs[smVal];
// Ignore sample metadata values where no sample with this
// value contains this tip. We can detect this using
// !_.isUndefined() because freqs should only include
// entries for metadata values where this feature is
// present in at least one sample with that value.
if (!_.isUndefined(freq)) {
var sectionColor = sm2color[smVal];
// Assign each unique sample metadata value a length
// proportional to its, well, proportion within the sample
// presence information for this tip.
var barSectionLen = layer.lengthSM * freq;
var thisSectionMaxX = prevSectionMaxX + barSectionLen;
var y = scope.getY(node);
var ty = y + halfyrscf;
var by = y - halfyrscf;
var corners = {
tL: [prevSectionMaxX, ty],
tR: [thisSectionMaxX, ty],
bL: [prevSectionMaxX, by],
bR: [thisSectionMaxX, by],
};
scope._addTriangleCoords(coords, corners, sectionColor);
prevSectionMaxX = thisSectionMaxX;
}
}
}
});
// The bar lengths are identical for all tips in this layer, so no need
// to do anything fancy to compute the maximum X coordinate.
return prevLayerMaxX + layer.lengthSM;
Expand Down
Loading

0 comments on commit 5bccc7f

Please sign in to comment.