Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache Deserialization (part two) -- warning message incorrect? #209

Open
kratsg opened this issue Aug 27, 2019 · 4 comments
Open

Cache Deserialization (part two) -- warning message incorrect? #209

kratsg opened this issue Aug 27, 2019 · 4 comments

Comments

@kratsg
Copy link

kratsg commented Aug 27, 2019

This is related to #148 - but I would prefer not to necro. I'm seeing this repeatedly when I make the same request over and over again. I'll detail everything I'm using right now.

(stare) $ python --version
Python 3.7.3
(stare) $ pip --version
pip 19.2.3 from /Users/kratsg/.virtualenvs/stare/lib/python3.7/site-packages/pip (python 3.7)
Package             Version   Location               
------------------- --------- -----------------------
appdirs             1.4.3     
aspy.yaml           1.3.0     
atomicwrites        1.3.0     
attrs               19.1.0    
bandit              1.6.2     
betamax             0.8.1     
betamax-serializers 0.2.1     
black               19.3b0    
bleach              3.1.0     
bumpversion         0.5.3     
CacheControl        0.12.5    
certifi             2019.6.16 
cfgv                2.0.1     
chardet             3.0.4     
Click               7.0       
coverage            4.5.4     
docutils            0.15.2    
ecdsa               0.13.2    
future              0.17.1    
gitdb2              2.0.5     
GitPython           3.0.2     
identify            1.4.6     
idna                2.8       
importlib-metadata  0.19      
lockfile            0.12.2    
more-itertools      7.2.0     
msgpack             0.6.1     
nodeenv             1.3.3     
packaging           19.1      
pbr                 5.4.2     
pip                 19.2.3    
pkginfo             1.5.0.1   
pluggy              0.12.0    
pre-commit          1.18.2    
py                  1.8.0     
pyasn1              0.4.6     
pyflakes            2.1.1     
Pygments            2.4.2     
pyparsing           2.4.2     
pytest              5.1.1     
pytest-cov          2.7.1     
pytest-mock         1.10.4    
python-dotenv       0.10.3    
python-jose         3.0.1     
PyYAML              5.1.2     
readme-renderer     24.0      
requests            2.22.0    
requests-toolbelt   0.9.1     
rsa                 4.0       
setuptools          41.2.0    
six                 1.12.0    
smmap2              2.0.5     
stare               0.0.2     /Users/kratsg/stare/src
stevedore           1.30.1    
toml                0.10.0    
tqdm                4.35.0    
twine               1.13.0    
urllib3             1.25.3    
virtualenv          16.7.4    
wcwidth             0.1.7     
webencodings        0.5.1     
wheel               0.33.6    
zipp                0.6.0     

Whenever I make the same request over and over (no warning the first time when there's no cache) but on subsequent requests, I get a warning...

cache entry deserialization failed, entry ignored (controller.py:147)

which indicates this line here: https://github.com/ionrock/cachecontrol/blob/0234b80/cachecontrol/controller.py#L147

So I'm going manually through the steps to find where the crash/error is. I see that we're using v4 so msgpack is being used:

>>> from cachecontrol.serialize import Serializer
>>> s = Serializer()
>>> data = open('.webcache/6/0/a/6/d/60a6d7942f39f888d6dc7b3a89964aea0a63c8f708eef2bdefd601be', 'rb').read()
>>> s.loads('', data)

No error there, so digging in more (given that there's no error on the empty request, I suspect it is not related to the cache but to the request itself...). So reverse-engineering what serialize.py is doing:

>>> ver, data = data.split(b",", 1)
>>> ver
b'cc=4'
>>> ver = ver.split(b"=", 1)[-1].decode("ascii")
>>> ver
'4'
>>> cached = msgpack.loads(data, raw=False)
>>> cached
{'response': {'body': b'<body suppressed for now>', 'headers': {'Date': 'Tue, 27 Aug 2019 14:51:20 GMT', 'Server': 'Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips mod_fcgid/2.3.9 PHP/5.4.16', 'Vary': 'Authorization,Accept-Encoding,User-Agent', 'X-Powered-By': 'PHP/5.4.16', 'Set-Cookie': 'PHPSESSID=<COOKIE>; path=/', 'Expires': 'Thu, 19 Nov 1981 08:52:00 GMT', 'Cache-Control': 'private', 'Pragma': 'no-cache', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Headers': 'API-KEY, Authorization', 'Access-Control-Allow-Methods': 'GET, POST, PUT, DELETE', 'Content-Encoding': 'gzip', 'Content-Length': '45806', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'application/json'}, 'status': 200, 'version': 11, 'reason': 'OK', 'strict': 0, 'decode_content': False}, 'vary': {'Authorization': 'Bearer <BEARER TOKEN>', 'Accept-Encoding': 'gzip, deflate', 'User-Agent': 'python-requests/2.22.0'}}

So now I prep a request to compare against to see how cachecontrol works in the prepare_response section

To do this part, I just insert pdb into the relevant section and just evaluate specific lines, but I found out that the problem is that the headers didn't match (and in fact, I'm making different requests since the Bearer Token changed between requests when I submit repeatedly since I'm not caching the access tokens).

https://github.com/ionrock/cachecontrol/blob/0234b80/cachecontrol/serialize.py#L115-L119

So what ends up happening, to me, is that the warning message being spit out is not very clear at all. In fact, it just (in this case) means that the headers didn't match -- so the cache should be rebuilt. That is cache deserialization did not fail.

@thatch
Copy link
Contributor

thatch commented Feb 15, 2024

I've seen this too; the simplest way to repro is to send a request to an endpoint that serves a Vary: Accept response header, and then send a different Accept request header the second time. It looks to me that the None signals both a corrupt or unacceptable cached resource, and the simplest thing to do is just remove the warning message.

import logging
logging.basicConfig(level=logging.WARNING)

from cachecontrol import CacheControlAdapter
from cachecontrol.caches import SeparateBodyFileCache
from requests.adapters import HTTPAdapter
from requests.sessions import Session

cache_dir = "foo"
sess = Session()
cache_adapter = CacheControlAdapter(
    cache=SeparateBodyFileCache(cache_dir)
)
sess.mount("https://", cache_adapter)

sess.get("https://pypi.org/simple/cachecontrol", headers={"Accept": "text/html"})
sess.get("https://pypi.org/simple/cachecontrol", headers={"Accept": "application/vnd.pypi.simple.v1+json"})

@dvershinin
Copy link
Contributor

@thatch Is your solution just requesting same resource twice with different Accept? I don't understand, can you elaborate?

@thatch
Copy link
Contributor

thatch commented Feb 15, 2024

@dvershinin sorry I wasn't clear -- that's just a simple repro, against a well-behaved server, showing the server using Vary: Accept is enough to trigger this message.

@kratsg
Copy link
Author

kratsg commented Feb 15, 2024

My understanding based on @thatch's reproduction, is that there are perhaps certain keys or fields in the header that do not necessarily change the response from the server (for example the Accept could change the response, and so Vary: Accept is used to tell the cache that you need to also cache based on the value of the header's Accept). My guess is that cachecontrol is not ignoring most of the header fields (and not relying on Vary which admittedly is probably not as well-known?).

Particularly @thatch is expecting two separate caches to be made, rather than trying to (re)use the text/html cache for the json one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants