Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hashset to ranges #2

Merged
merged 9 commits into from
May 18, 2021
Merged

Hashset to ranges #2

merged 9 commits into from
May 18, 2021

Conversation

dhakim87
Copy link
Collaborator

When running calculate_coverage on several thousand .sam files, I noticed the memory usage was quite high, causing me to run out of memory. I tracked usage to the sets of integers used to keep track of coverage.

I've added two new files, cover.py and cover_test.py.

cover.py implements a new data structure: SortedRangeList. This takes inclusive integer ranges and maintains the total coverage of those ranges, automatically compressing its internal representation over time.

cover_test.py is a collection of unit tests and performance test functions to compare it against the original hashset implementation.

Theoretical performance from cover_test is about 3x faster with 1/100 memory usage for set operations.
Real world performance is closer to 1.2-1.5x faster for the whole program, 1/10 memory usage.

This takes it from ~40-50GB to ~4GB for the first set of files in my use case.

Copy link
Collaborator

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dhakim87, a few questions and suggestions.

@@ -15,7 +16,7 @@ def calculate_coverages(input, output, database):
###################################
#Calculate coverage of each contig#
###################################
gotu_dict = defaultdict(set)
gotu_dict = defaultdict(SortedRangeList)
file_list = glob(input + "/*.sam")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth doing "/.sam" or perhaps "/.sam" and "/.sam.xz"? If I remember correctly wolka also support xz.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other PR: #1. It's just not the focus of this one.

cover.py Outdated Show resolved Hide resolved
# case 2: active range continues through this range
# extend active range
end_val = max(end_val, r[1])
else: # if end_val < r[0] - 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the comment after the else be deleted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting the indexing right was tricky, I think a reminder that it will join not just [x, y] [y, z] but [x, y] [y+1, z] is helpful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, not really important but perhaps worth merging to the comment below.

read_len = random.randint(85, 150)
for j in range(read_start, read_start + read_len):
intset.add(j)
print("SET_ADD: ", perf_counter() - start_set)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this print and the other below needed? Normally tests are silent, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First and second could be made into pure unit tests. But second, and also third also function as performance tests to show the improvement. There was no unit testing framework in place as far as I could tell, and it seemed unnecessary to add one for a repo with two scripts in it.

I wasn't sure if cover_test.py would be deleted and the 60 lines of cover.py added to the top of calculate_coverage.py.

If we do go with a unit test framework, is there a reason unit tests should be silent?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMOO if the idea is to test the performance of a code, you need to add a test for that specifically, and the specific test will depend on how you define "better", for example if better is that x > y, then the test should be something like self.assertTrue(x > y).

Now, I have never seen prints in tests before, except for debugging, because normally the idea is that nobody should be looking at the output of the tests to confirm that things are good/bad but have tests to check that ...

@@ -0,0 +1,67 @@
from cover import SortedRangeList
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to not use unittest?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None afaik, but no framework was in place and I didn't feel comfortable deciding on one in this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, it will be good to decide that before adding more code ... could you open an issue so it can be discussed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will also be nice to discuss if the repo needs continuous integration so flake8 and the testing can be run automatically.

Copy link
Collaborator

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dhakim87.

IMOO #3 and #4 should be decided before merging this code, and more because this push: Screen Shot 2021-05-14 at 7 15 20 AM shows why is important to have CI. However, I understand that this might still be a POC. @wasade, could you take a look? Thank you.

@wasade
Copy link
Member

wasade commented May 14, 2021

Thanks! What I propose is we merge this now, discuss the longer term plan for this repository at the thinktank meeting this Monday, and proceed from there (@dhakim87, please let me know if you don't know what I'm talking about). The codebase is small enough that resolving #3 and #4 outside of this PR wouldn't be that bad. Does that sound reasonable?

@dhakim87
Copy link
Collaborator Author

I'm fine to see it merged in - it successfully ran on the several thousand sam files of my dataset after these changes.

@adswafford adswafford merged commit e5c793a into biocore:master May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants