Hashset to ranges #2

dhakim87 · 2021-05-12T19:35:39Z

When running calculate_coverage on several thousand .sam files, I noticed the memory usage was quite high, causing me to run out of memory. I tracked usage to the sets of integers used to keep track of coverage.

I've added two new files, cover.py and cover_test.py.

cover.py implements a new data structure: SortedRangeList. This takes inclusive integer ranges and maintains the total coverage of those ranges, automatically compressing its internal representation over time.

cover_test.py is a collection of unit tests and performance test functions to compare it against the original hashset implementation.

Theoretical performance from cover_test is about 3x faster with 1/100 memory usage for set operations.
Real world performance is closer to 1.2-1.5x faster for the whole program, 1/10 memory usage.

This takes it from ~40-50GB to ~4GB for the first set of files in my use case.

…rage

… int set and 100x less memory (for 100 long reads)

…on every 100000 reads by default

antgonza

Thank you @dhakim87, a few questions and suggestions.

antgonza · 2021-05-13T16:03:06Z

calculate_coverages.py

@@ -15,7 +16,7 @@ def calculate_coverages(input, output, database):
    ###################################
    #Calculate coverage of each contig#
    ###################################
-    gotu_dict = defaultdict(set)
+    gotu_dict = defaultdict(SortedRangeList)
    file_list = glob(input + "/*.sam")


Would it be worth doing "/.sam" or perhaps "/.sam" and "/.sam.xz"? If I remember correctly wolka also support xz.

See my other PR: #1. It's just not the focus of this one.

cover.py

antgonza · 2021-05-13T16:28:21Z

cover.py

+                # case 2: active range continues through this range
+                # extend active range
+                end_val = max(end_val, r[1])
+            else:  # if end_val < r[0] - 1:


Should the comment after the else be deleted?

Getting the indexing right was tricky, I think a reminder that it will join not just [x, y] [y, z] but [x, y] [y+1, z] is helpful.

Got it, not really important but perhaps worth merging to the comment below.

antgonza · 2021-05-13T16:30:45Z

cover_test.py

+        read_len = random.randint(85, 150)
+        for j in range(read_start, read_start + read_len):
+            intset.add(j)
+    print("SET_ADD: ", perf_counter() - start_set)


Is this print and the other below needed? Normally tests are silent, right?

First and second could be made into pure unit tests. But second, and also third also function as performance tests to show the improvement. There was no unit testing framework in place as far as I could tell, and it seemed unnecessary to add one for a repo with two scripts in it.

I wasn't sure if cover_test.py would be deleted and the 60 lines of cover.py added to the top of calculate_coverage.py.

If we do go with a unit test framework, is there a reason unit tests should be silent?

IMOO if the idea is to test the performance of a code, you need to add a test for that specifically, and the specific test will depend on how you define "better", for example if better is that x > y, then the test should be something like self.assertTrue(x > y).

Now, I have never seen prints in tests before, except for debugging, because normally the idea is that nobody should be looking at the output of the tests to confirm that things are good/bad but have tests to check that ...

antgonza · 2021-05-13T16:32:25Z

cover_test.py

@@ -0,0 +1,67 @@
+from cover import SortedRangeList


Is there a reason to not use unittest?

None afaik, but no framework was in place and I didn't feel comfortable deciding on one in this PR.

Got it, it will be good to decide that before adding more code ... could you open an issue so it can be discussed?

It will also be nice to discuss if the repo needs continuous integration so flake8 and the testing can be run automatically.

antgonza

Thank you @dhakim87.

IMOO #3 and #4 should be decided before merging this code, and more because this push: shows why is important to have CI. However, I understand that this might still be a POC. @wasade, could you take a look? Thank you.

wasade · 2021-05-14T15:21:11Z

Thanks! What I propose is we merge this now, discuss the longer term plan for this repository at the thinktank meeting this Monday, and proceed from there (@dhakim87, please let me know if you don't know what I'm talking about). The codebase is small enough that resolving #3 and #4 outside of this PR wouldn't be that bad. Does that sound reasonable?

dhakim87 · 2021-05-14T19:02:45Z

I'm fine to see it merged in - it successfully ran on the several thousand sam files of my dataset after these changes.

dhakim87 added 7 commits May 12, 2021 12:16

How big is this gotu dictionary

a0993e1

Oh I get where it leaks

904a230

Basic index range merging to replace the use of set for tracking cove…

02ec56d

…rage

Removed unnecessary object construction, now 3x faster than using the…

1e2cb3f

… int set and 100x less memory (for 100 long reads)

Further optimization with __slots__

7d33520

Switched out IndexRange for tuples, even faster. Added auto compressi…

f484cd7

…on every 100000 reads by default

Removed profiling print calls

c1acf00

antgonza requested changes May 13, 2021

View reviewed changes

dhakim87 added 2 commits May 13, 2021 10:29

Flake8

dae0e14

Argh, missed a len() call

d6d855b

antgonza approved these changes May 14, 2021

View reviewed changes

adswafford merged commit e5c793a into biocore:master May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hashset to ranges #2

Hashset to ranges #2

dhakim87 commented May 12, 2021

antgonza left a comment

antgonza May 13, 2021

dhakim87 May 13, 2021

antgonza May 13, 2021

dhakim87 May 13, 2021

antgonza May 13, 2021

antgonza May 13, 2021

dhakim87 May 13, 2021

antgonza May 13, 2021

antgonza May 13, 2021

dhakim87 May 13, 2021

antgonza May 13, 2021

antgonza May 13, 2021

antgonza left a comment

wasade commented May 14, 2021

dhakim87 commented May 14, 2021

Hashset to ranges #2

Hashset to ranges #2

Conversation

dhakim87 commented May 12, 2021

antgonza left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antgonza left a comment

Choose a reason for hiding this comment

wasade commented May 14, 2021

dhakim87 commented May 14, 2021