Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Allow comparison of XVector objects (part 2 of 2) #6

Open
wants to merge 3 commits into
base: devel
Choose a base branch
from

Conversation

ahl27
Copy link
Contributor

@ahl27 ahl27 commented Jan 7, 2025

This is part 2 of a PR, you can find part 1 here: Bioconductor/S4Vectors#127

Background is covered in that PR description; this one will just cover was wasn't mentioned there.

This PR implements the XVector methods required to allow comparisons between XVectors. After merging both these PRs, the following are now supported:

x <- DNAString("ATGC")
x[order(x)]
## 4-letter DNAString object
## seq: ACGT

x == "A"
## [1] FALSE FALSE FALSE  TRUE

pcompare(x, DNAString("GCCC")) == 0
## [1]  TRUE  TRUE FALSE FALSE

x <- as(1:10, "XInteger")
x == 1:10
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

x <= 10:1
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

identical(10:1 < x, 10:1 < as.integer(x))
## [1] TRUE

And similarly for XRaw and XDouble objects.

Known Issues:

  1. Different sorting methods aren't supported for SharedVector.order or XVector.order. That implementation will depend on the backend being implemented in S4Vectors (see this comment).
  2. No sameAsPreviousROW function exists for SharedVector objects, only XVector. Do we need one for SharedVector? Since the class isn't exported I'm not sure if it's necessary.

@hpages
Copy link
Contributor

hpages commented Jan 18, 2025

Hi Aidan,

Sorry for the slow response.

Comparing 2 DNAString objects (or other XString derivatives) with == or != works atomically i.e. it doesn't compare the 2 objects letter-wise but as a whole:

library(Biostrings)

DNAString("GACC") == DNAString("GACC")
# [1] TRUE

x <- DNAString("GACC")
y <- DNAString("GACCTAT")
x == y
# [1] FALSE

I made this decision a long time ago, based on my feeling at the time that this semantic would be more useful than a vectorized letter-wise comparison. The letter-wise comparison can be useful too, and you can obtain it by coercing the 2 objects to raw vectors:

as.raw(x) == as.raw(y)
# [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
# Warning message:
# In as.raw(x) == as.raw(y) :
#   longer object length is not a multiple of shorter object length

I don't know if this is documented somewhere but it probably should.

Had I made the decision to go the other way around (i.e. have == do a letter-wise comparison), then the user would still have the ability to perform atomic comparison with something like length(x) == length(y) && all(x == y). However, note that this wouldn't be as efficient as an optimized atomic comparison, especially when comparing long sequences of hundreds of millions of nucleotides (like Human chromosomes). In such case, x == y would expand to a big logical vector of 4 * length(x) bytes in memory (i.e. 1 Gb for Human chr1).

Note that XRaw objects (which XString objects derive from) also compare atomically:

xraw1 <- as(charToRaw('q5^#.*A'), "XRaw")
xraw2 <- as(charToRaw('q5^#.*a'), "XRaw")
xraw1 != xraw2
# [1] TRUE

Anyways, for better or worse, we're stuck with the atomic comparison 😉

This means that <=, >=, <, and > should follow.

Note that all these comparisons are already implemented for XStringSet derivatives so an easy way to make them work on XString objects is to coerce the latter to the former (this coercion is a no-copy operation so is very cheap). Something like this:

setMethod("<=", c("XString", "XString"),
    function(e1, e2) as(e1, "XStringSet") <= as(e2, "XStringSet")
)

Note that there's no need to define the >=, <, and > methods. These work out-of-the-box, as long as <= works, thanks to the default >=, <, and > methods defined in the S4Vectors package for all Vector derivatives e.g.:

> selectMethod('<', c("DNAString", "DNAString"))
Method Definition:

function (e1, e2) 
{
    !(e2 <= e1)
}
<bytecode: 0x60cf37611818>
<environment: namespace:S4Vectors>

Signatures:
        e1          e2         
target  "DNAString" "DNAString"
defined "Vector"    "Vector"   

Thanks for working on this. The S4Vectors/IRanges/XVector/Biostrings code base is a maze with 2 levels, the R level and the C level, each of them split across 4 packages. So 8 regions! It can be hard to navigate and understand the interactions between the 8 regions. It's great that you were able to find your way from the R level in Biostrings all the way down to the C level in S4Vectors. This was a good exercise that will help you in the future.

Thanks again and let me know if you have questions,

H.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants