Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block compression support for CRAM writer #320

Merged
merged 2 commits into from
Aug 26, 2024
Merged

Conversation

athos
Copy link
Member

@athos athos commented Aug 21, 2024

This PR adds support for block compression to the CRAM writer.

The primary features are as follows:

  • Blocks are compressed using the compression method specified for each data series and tag encoding
    • The currently available compression methods are :raw, :gzip, :bzip, :lzma, and :best, which automatically selects the compression method with the highest compression rate
  • The default compression methods are:
    • data series: :gzip, although the default compression method for each data series may be tuned in the future
    • tags: :best
  • The compression methods for data series and tags can be overridden using the ds-compressor-overrides and tag-compressor-overrides options, respectively (explained below)

Block compression is applied along with record encoding rather than after all the records have been fully encoded.

Compressor overrides

To override the compression method for data series and tags, specify the ds-compressor-overrides and tag-compressor-overrides options to the CRAM writer:

(require '[cljam.io.cram :as cram])

(def writer
  (cram/writer "path/to/cram/file"
               { ...
                :ds-compressor-overrides <ds compressor overrides>
                :tag-compressor-overrides <tag compressor overrides>
                ... }))

The full specification of the ds-compressor-overrides/tag-compressor-overrides is somewhat intricate. The description here does not aim to be exhaustive but rather to provide the big picture and offer some practical examples of usage.

ds-compressor-overrides

ds-compressor-overrides is a function that takes a keyword representing a data series and returns a keyword representing a compression method. Here are some examples:

  • To compress only the BF and CF data series with :bzip and leave the others with their default methods
    :ds-compressor-overrides (fn [ds] (when (#{:BF :CF} ds) :bzip))
  • Since Clojure maps also behave like functions, the same thing can be written using a map
    :ds-compressor-overrides {:BF :bzip, :CF :bzip}
  • To change the compression method for all data series to :bzip
    :ds-compressor-overrides (constantly :bzip)

The function can also return a set of compression method keywords. In this case, the compression method with the highest compression rate will be selected from the specified methods:

  • To compress the BF series with both :bzip and :lzma, and choose the more efficient one
    :ds-compressor-overrides {:BF #{:bzip :lzma}}

Additionally, the function can return another function to further condition the compression method based on the codec:

  • To compress all the blocks encoded with the :external codec using the :bzip compression method:
    :ds-compressor-overrides (constantly {:external :bzip})
  • To compress all the blocks allocated for the :len-encoding of the :byte-array-len codec using :lzma and the blocks for the :val-encoding using :bzip
    :ds-compressor-overrides (constantly {:byte-array-len/len :lzma, :byte-array-len/val :bzip})

For more detailed usage, see the test code for ds-compressor-overrides.

tag-compressor-overrides

tag-compressor-overrides works similarly to ds-compressor-overrides, but can also add conditions based on tag types by returning a function:

  • To change the compression method for all tags to :bzip
    :tag-compressor-overrides (constantly :bzip)
  • To compress the XA:c tag with :gzip and the XA:i tag with :bzip
    :tag-compressor-overrides {:XA {\c :gzip, \i :bzip}}
  • To compress all the blocks allocated for the :val-encoding of all the tags of type Z using :bzip
    :tag-compressor-overrides (constantly {\Z {:byte-array-len/val :bzip}})

For more detailed usage, see the test code for tag-compressor-overrides.

@athos athos requested a review from a team August 21, 2024 01:07
@athos athos self-assigned this Aug 21, 2024
@athos athos requested review from niyarin and removed request for a team August 21, 2024 01:07
@athos athos requested review from alumi and a team as code owners August 21, 2024 01:07
@athos athos requested review from r6eve and a team and removed request for a team, alumi, r6eve and niyarin August 21, 2024 01:07
Copy link

codecov bot commented Aug 21, 2024

Codecov Report

Attention: Patch coverage is 95.28796% with 9 lines in your changes missing coverage. Please review.

Project coverage is 89.72%. Comparing base (66dad5b) to head (3643da8).

Files Patch % Lines
src/cljam/io/cram/encode/compressor.clj 87.09% 8 Missing ⚠️
src/cljam/io/cram/encode/structure.clj 91.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #320      +/-   ##
==========================================
+ Coverage   89.61%   89.72%   +0.11%     
==========================================
  Files         100      101       +1     
  Lines        9032     9129      +97     
  Branches      480      480              
==========================================
+ Hits         8094     8191      +97     
  Misses        458      458              
  Partials      480      480              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alumi alumi self-requested a review August 21, 2024 06:55
@alumi alumi self-assigned this Aug 21, 2024
Copy link
Contributor

@r6eve r6eve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for the useful overrides.

@r6eve r6eve merged commit b0b8cb4 into master Aug 26, 2024
18 checks passed
@r6eve r6eve deleted the feature/block-compression branch August 26, 2024 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants