Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A memory efficient implementation of the .mtx reading function #3389

Merged
merged 5 commits into from
Feb 25, 2025

Conversation

gjeuken
Copy link
Contributor

@gjeuken gjeuken commented Nov 28, 2024

  • Closes #
  • Tests included or not required because: test_datasets.py already implemented
  • Release notes not necessary because: This is a backend change

Pandas read_csv function is very memory intensive, and this makes loading data (especially large datasets from EBI Single Cell Expression Atlas) impossible on computers with 16gb of ram or less. The subsequent analysis of such datasets with scanpy, however, works well on such computers.

Loading the data into chunks, using the same pandas function, solves this problem.

Copy link

codecov bot commented Nov 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.41%. Comparing base (9741ca6) to head (73c2a21).
Report is 9 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3389      +/-   ##
==========================================
- Coverage   75.44%   75.41%   -0.04%     
==========================================
  Files         113      113              
  Lines       13250    13266      +16     
==========================================
+ Hits         9997    10005       +8     
- Misses       3253     3261       +8     
Files with missing lines Coverage Δ
src/scanpy/datasets/_ebi_expression_atlas.py 94.44% <100.00%> (+0.46%) ⬆️

... and 3 files with indirect coverage changes

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! some small notes:

@gjeuken
Copy link
Contributor Author

gjeuken commented Feb 22, 2025

Please note that the selection of the chunk size of 1e7 is rather arbitrary.
This value solves the loading issue for computers with 8 or 16gb of RAM without looping through too many chunks.

@flying-sheep flying-sheep added this to the 1.11.1 milestone Feb 25, 2025
@flying-sheep
Copy link
Member

flying-sheep commented Feb 25, 2025

Thank you! Since you don’t seem interested in appearing in a release note:

grafik

… I’ll merge this as-is. If you want a release note after all, please comment, and I’ll add one!

@flying-sheep flying-sheep merged commit f6a665b into scverse:main Feb 25, 2025
15 of 16 checks passed
meeseeksmachine pushed a commit to meeseeksmachine/scanpy that referenced this pull request Feb 25, 2025
flying-sheep pushed a commit that referenced this pull request Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants