-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add model_embedding_dim
argument to Dataset constructors
#107
Conversation
To recap our conversation just now about this:
Yes, we've decided so. I'll backup the current working status of this branch (and the corresponding dnsm-experiments-1 branch) to
We'll just use pretrained crepe names to make this clear.
We understand that although all sites in the input are included in the embedding, the model ignores sites masked sites while keeping track of the fact that they take up space. We may want to try unmasking X's that aren't just for padding here, as a separate issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to take a quick look (half-attention as attending conference)...
netam/dasm.py
Outdated
# just have it take aa strings, but that's not what it did before, so I'm | ||
# keeping the original behavior for now. (although, the old docstring | ||
# claimed incorrectly that it took an aa sequence) | ||
def build_selection_matrix_from_parent(self, parent: Tuple[str, str]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not used for branch length optimization?
If not, let's cut it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used in a couple of tests, so I prepended an underscore to its name instead of removing it.
netam/framework.py
Outdated
if chosen_v_families is not None: | ||
chosen_v_families = set(chosen_v_families) | ||
# TODO is this the right way to handle this? Or should it be OR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we decided on OR.
I am still waiting for all the notebooks to run, and I expect that changes will be needed to make some of them run. However, all tests pass, and the test snakemake configs run for both dnsm and dasm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
This PR makes DXSM datasets specific to the embedding dimension of the model that the dataset is intended for.
Minimal changes in dnsm-experiments-1 were required, and these are in https://github.com/matsengrp/dnsm-experiments-1/pull/78.
Here are some discussion points: