Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

questions about caching with path #442

Closed
mkoohafkan opened this issue Feb 15, 2024 · 3 comments · Fixed by #531
Closed

questions about caching with path #442

mkoohafkan opened this issue Feb 15, 2024 · 3 comments · Fixed by #531
Labels
bug an unexpected problem or unintended behavior

Comments

@mkoohafkan
Copy link

mkoohafkan commented Feb 15, 2024

I'm a little confused about if/how caching works when path is specified. Modifying the example in the documentation for req_cache():

library(httr2)

# create cache directory
td = file.path(tempdir(), "cache")
dir.create(td)

url <- paste0(
  "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/",
  "master/inst/extdata/penguins.csv"
)

# Here I set debug = TRUE so you can see what's happening
req <- request(url) |> req_cache(td, debug = TRUE)

# First request downloads the data
tf <- tempfile(fileext = ".csv")
resp <- req |> req_perform(path = tf)

toString(resp$body)
## [1] "C:\\Temp\\1\\RtmpExocAw\\file53f823441cef.csv"

# Second request retrieves it from the cache
tf2 <- tempfile(fileext = ".csv")
resp <- req |> req_perform(path = tf2)
## Found url in cache "d5d1ddd7f99f55dbc920c63f942804c0"
## Cached value is fresh; retrieving response from cache

toString(resp$body)
## [1] "C:\\Temp\\1\\RtmpExocAw/foo/d5d1ddd7f99f55dbc920c63f942804c0.body"

file.exists(tf2)
## [1] FALSE


# wait a while, cache is now stale
tf3 <- tempfile(fileext = ".csv")
 resp <- req |> req_perform(path = tf3)
## Found url in cache "d5d1ddd7f99f55dbc920c63f942804c0"
## Cached value is stale; checking for updates
## Cached value still ok; retrieving body from cache

toString(resp$body)
## [1] "C:\\Temp\\1\\RtmpExocAw\\file53f815c51993.csv"

file.exists(tf3)
## [1] TRUE

When the cached value is fresh, the response returns a different path, which makes sense since there is no guarantee that the path specified in the original call still exists when the request is made a second time. My questions are:

  1. Is the file downloaded once and written to both the path and cache? (Appears to be the case).
  2. Is there a way to tell the cache to use the same file extension as specified in path? I can see this possibly being an issue for some functions that expect a certain file extension.
  3. If the cache is stale, it appears to re-download the file. If it decides the cached value is still ok, it claims to retrieve the body from cache but actually provides path. It seems to do this for every subsequent request, i.e., the cache is not "refreshed" and it continues to think the cache is stale. Is this a bug?
@hadley
Copy link
Member

hadley commented Feb 20, 2024

Hmmmm, I'm not sure how well I thought through the case of using path with cache files, because the behaviour certainly doesn't look correct to me. I think the behaviour you see in the first request is correct: we save the file to the cache and copy it to the requested path. But we also need to do that for the second request, so that the specified path is actually used (that would fix the problem with the extension too). I'm not sure what's going on with the stale cache; that definitely sounds like a bug.

I don't have time to look into this in more detail right now, but I'll definitely take a look and fix when I'm next working on httr2 so thanks for filing this issue!

@hadley hadley added the bug an unexpected problem or unintended behavior label Feb 20, 2024
@mkoohafkan
Copy link
Author

mkoohafkan commented Mar 4, 2024

Regarding question 2, one approach might be to pass path to cache_pre_fetch(), and replace

cache_get(req)

with something like

    cached_req = cache_get(req)
    if (!is.null(path)) {
      file.copy(cached_req$body, path, overwrite = TRUE)
      cached_req$body <- path
    }
    cached_req

But I haven't thought through the full consequences of that.

@hadley
Copy link
Member

hadley commented Sep 3, 2024

Simpler reprex:

library(httr2)
library(testthat, warn.conflicts = FALSE)

req <- request(example_url()) |>
  req_url_path("/cache/2") |> 
  req_cache(tempfile(), debug = TRUE)

path1 <- tempfile()
resp1 <- req |> req_perform(path = path1)
#> Pruning cache
#> Saving response to cache "38c683fd8d6c408b437f509bc0f0ca9b"
expect_equal(resp1$body[[1]], path1)

path2 <- tempfile() 
resp2 <- req |> req_perform(path = path2)
#> Found url in cache "38c683fd8d6c408b437f509bc0f0ca9b"
#> Cached value is fresh; using response from cache
expect_equal(resp2$body[[1]], path2)
#> Error: resp2$body[[1]] not equal to `path2`.
#> 1/1 mismatches
#> x[1]: "/tmp/RtmpgJztVU/file79df1addc1e6/38c683fd8d6c408b437f509bc0f0ca9b.body"
#> y[1]: "/tmp/RtmpgJztVU/file79df2a6fb4d6"

Sys.sleep(2) # wait for cache to expire
path3 <- tempfile() 
resp3 <- req |> req_perform(path = path3)
#> Found url in cache "38c683fd8d6c408b437f509bc0f0ca9b"
#> Cached value is stale; checking for updates
#> Saving response to cache "38c683fd8d6c408b437f509bc0f0ca9b"
expect_equal(resp3$body[[1]], path3)

path4 <- tempfile() 
resp4 <- req |> req_perform(path = path4)
#> Found url in cache "38c683fd8d6c408b437f509bc0f0ca9b"
#> Cached value is fresh; using response from cache
expect_equal(resp4$body[[1]], path4)
#> Error: resp4$body[[1]] not equal to `path4`.
#> 1/1 mismatches
#> x[1]: "/tmp/RtmpgJztVU/file79df1addc1e6/38c683fd8d6c408b437f509bc0f0ca9b.body"
#> y[1]: "/tmp/RtmpgJztVU/file79df529ab65d"

Created on 2024-09-03 with reprex v2.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants