-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hashset to ranges #2
Conversation
… int set and 100x less memory (for 100 long reads)
…on every 100000 reads by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dhakim87, a few questions and suggestions.
@@ -15,7 +16,7 @@ def calculate_coverages(input, output, database): | |||
################################### | |||
#Calculate coverage of each contig# | |||
################################### | |||
gotu_dict = defaultdict(set) | |||
gotu_dict = defaultdict(SortedRangeList) | |||
file_list = glob(input + "/*.sam") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth doing "/.sam" or perhaps "/.sam" and "/.sam.xz"? If I remember correctly wolka also support xz.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other PR: #1. It's just not the focus of this one.
# case 2: active range continues through this range | ||
# extend active range | ||
end_val = max(end_val, r[1]) | ||
else: # if end_val < r[0] - 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the comment after the else be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting the indexing right was tricky, I think a reminder that it will join not just [x, y] [y, z] but [x, y] [y+1, z] is helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, not really important but perhaps worth merging to the comment below.
read_len = random.randint(85, 150) | ||
for j in range(read_start, read_start + read_len): | ||
intset.add(j) | ||
print("SET_ADD: ", perf_counter() - start_set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this print and the other below needed? Normally tests are silent, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First and second could be made into pure unit tests. But second, and also third also function as performance tests to show the improvement. There was no unit testing framework in place as far as I could tell, and it seemed unnecessary to add one for a repo with two scripts in it.
I wasn't sure if cover_test.py would be deleted and the 60 lines of cover.py added to the top of calculate_coverage.py.
If we do go with a unit test framework, is there a reason unit tests should be silent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMOO if the idea is to test the performance of a code, you need to add a test for that specifically, and the specific test will depend on how you define "better", for example if better is that x > y
, then the test should be something like self.assertTrue(x > y)
.
Now, I have never seen prints in tests before, except for debugging, because normally the idea is that nobody should be looking at the output of the tests to confirm that things are good/bad but have tests to check that ...
@@ -0,0 +1,67 @@ | |||
from cover import SortedRangeList |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to not use unittest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None afaik, but no framework was in place and I didn't feel comfortable deciding on one in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, it will be good to decide that before adding more code ... could you open an issue so it can be discussed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will also be nice to discuss if the repo needs continuous integration
so flake8 and the testing can be run automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! What I propose is we merge this now, discuss the longer term plan for this repository at the thinktank meeting this Monday, and proceed from there (@dhakim87, please let me know if you don't know what I'm talking about). The codebase is small enough that resolving #3 and #4 outside of this PR wouldn't be that bad. Does that sound reasonable? |
I'm fine to see it merged in - it successfully ran on the several thousand sam files of my dataset after these changes. |
When running calculate_coverage on several thousand .sam files, I noticed the memory usage was quite high, causing me to run out of memory. I tracked usage to the sets of integers used to keep track of coverage.
I've added two new files, cover.py and cover_test.py.
cover.py implements a new data structure: SortedRangeList. This takes inclusive integer ranges and maintains the total coverage of those ranges, automatically compressing its internal representation over time.
cover_test.py is a collection of unit tests and performance test functions to compare it against the original hashset implementation.
Theoretical performance from cover_test is about 3x faster with 1/100 memory usage for set operations.
Real world performance is closer to 1.2-1.5x faster for the whole program, 1/10 memory usage.
This takes it from ~40-50GB to ~4GB for the first set of files in my use case.