Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace single curl instance with dynamic pool to allow setting curl options independently from request execution #610

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented May 27, 2019

No description provided.

…options independently from request execution
@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: failed
    • 12 warnings and errors that must be fixed
    • 12 comments to review
  • Pycodestyle check: succeeded
    • 1 comments to review
  • Python3 compatibility checks: succeeded
    • there are suggested fixes for newer python3 idioms

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-DBS-PR-test/113/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 12 comments to review
  • Pycodestyle check: succeeded
    • 1 comments to review
  • Python3 compatibility checks: succeeded
    • there are suggested fixes for newer python3 idioms

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-DBS-PR-test/114/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: succeeded
    • 12 comments to review
  • Pycodestyle check: succeeded
    • 1 comments to review
  • Python3 compatibility checks: succeeded
    • there are suggested fixes for newer python3 idioms

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-DBS-PR-test/115/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

@yuyiguo just to give you a context. The changes Valentin is proposing here are meant to fix a curl problem that we hit every day in production. Error reported here: dmwm/WMCore#9221

Thanks, Valentin!

@yuyiguo
Copy link
Member

yuyiguo commented May 28, 2019

@vkuznet @amaltaro
We are not going to deploy this For June, right? I will come back this late..

@amaltaro
Copy link
Contributor

Given the high impact of this change, I'd recommend against pushing it now and having only a couple of days to validate it.

@yuyiguo
Copy link
Member

yuyiguo commented May 28, 2019

I need to do the code review and testing before cmsweb-testbed. It would be really hard to make it since production is next Tuesday. Will try, but not guarantee.

@amaltaro
Copy link
Contributor

@yuyiguo what I meant is that you should skip it from this cycle and wait for the July upgrade. It's too dangerous to insert such changes at this stage.

@yuyiguo
Copy link
Member

yuyiguo commented May 28, 2019

@amaltaro
Thanks for correcting me. I read your message too quick .
yes, put it in July release was my plan too.

@amaltaro
Copy link
Contributor

amaltaro commented Jun 4, 2019

BTW, this will of course happen more frequently now that we have DBS running with twice more threads. Which I can confirm by scanning one of the agents logs, it happens 10s of times per day.

@yuyiguo
Copy link
Member

yuyiguo commented Jun 12, 2019

@amaltaro
How many concurrent DBS connections wmagents need? How often these connections shared/reused?

@yuyiguo
Copy link
Member

yuyiguo commented Jun 12, 2019

@vkuznet
"setting curl options independently from request execution" does not mean to make the connection pool. Curl options can be set on single connection too? How to understand that the original issue has to be solved by connection pool?

@amaltaro
Copy link
Contributor

Yuyi, DBS is used in a few different places in central production and most of them are just sequential, so a single thread handling DBS requests.
For the DBS3Upload component, which uploads/inserts data into DBS, we seem to use 4 processes to go through the pool of requests to be made.

Problem reported is actually happening on the sequential application (WorkQueueManager); where a DBSReader (and DbsApi) instance is created:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/WorkQueueUtils.py#L33

and then reused through all the datasets to be handled (in different http requests though). I don't see where the cache gets deleted as well, so the same DBS object is used among different component cycles.

@yuyiguo
Copy link
Member

yuyiguo commented Jun 13, 2019

Alan,
If we use the pool connection, then we will have a limit on the pool size of 4 or 5. However, I still don't see how the problem will be solved by introduce the pool.
I reuse my connections with different APIs, but I might not do as much times as workQueueManager. Any ideas on how frequent this error happened? or after how many time the error showed up? can we reproduce the error in a simple code?

@amaltaro
Copy link
Contributor

Yuyi, I haven't looked at the internals of pycurl. What I understood is that each http request would use a different (idle) connection, such that the same object isn't shared among different client requests. If that's correct, then I assume the connection pool should have at least the same length of the number of cherrypy threads.

About reproducing this error, I unfortunately don't have any recipe. It happens a few 10s of times per day. Grepping the global workqueue logs, you can see how many times it happened over the last 3 days:

amaltaro@vocms055:/build/amaltaro/WMCore $ grep 'cannot invoke setop' /build/srv-logs/vocms0740/workqueue/workqueue-20190613.log | wc -l
28
amaltaro@vocms055:/build/amaltaro/WMCore $ grep 'cannot invoke setop' /build/srv-logs/vocms0740/workqueue/workqueue-20190612.log | wc -l
18
amaltaro@vocms055:/build/amaltaro/WMCore $ grep 'cannot invoke setop' /build/srv-logs/vocms0740/workqueue/workqueue-20190611.log | wc -l
62

@yuyiguo
Copy link
Member

yuyiguo commented Jun 13, 2019

Alan,
What I understood was that each client held their own pool. So you and I would have different pools if we both connect to DBS at the same time. Then if we both held a pool with the length is the number of cherrypy threads, we will exceed the total number of threads?

This is how WorkQueueManager make DBS connection. Could you pointed me where the code invoke setop?

@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 13, 2019

I think you're mixing different pool concepts. One pool can be on server side and another on client side. My changes create the later, i.e. the client who will get RestApi class instance will get a pool of connection. The pool is dynamic in size, i.e. it will grow to a number of concurrent/independent/parallel invocation of execute API. In other words, the new curl object is added to pool within execute API if it does not exists. Therefore, in your client if you call execute from different threads/processes the pool will grow up to that number of threads/processes. I think it is desired behavior. Of course, we can constraint pool to a certain size, but then in order to get connection from it client code will need to wait when it will be available.

@yuyiguo
Copy link
Member

yuyiguo commented Jun 14, 2019

@vkuznet
Could you add pool size ? And address the question: why connection pool could fix the problem ? The error was from the sequential application (WorkQueueManager).

@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 19, 2019 via email

@yuyiguo
Copy link
Member

yuyiguo commented Jun 19, 2019

@vkuznet
Valentin,
Thanks for the code updating. I still did not get how this will solve the problem?
Yuyi

@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 19, 2019 via email

@yuyiguo
Copy link
Member

yuyiguo commented Jun 19, 2019

@vkuznet
Valentin,
Please see below Alan's comment. It is a single thread/process.
"
Problem reported is actually happening on the sequential application (WorkQueueManager); where a DBSReader (and DbsApi) instance is created:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/WorkQueueUtils.py#L33
"
You will find his entire comment in this thread. He made the comment about a week ago.

Yuyi

@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 19, 2019

Yuyi, the code you point is telling nothing about usage of dbsClient, while Alan explicitly said quoting:
"in different http requests though". This is where you need to dig since it is how http requests are done. If you provide pointer to the code in WorkQueue which do http requests then we/you can investigate.

@yuyiguo
Copy link
Member

yuyiguo commented Jun 19, 2019

@vkuznet @amaltaro
What I understood "in different http requests though" meant that Alan used the same DBS client connection to call different DBS APIs sequentially. Alan, Could you please point out the code in WorkQueue which do http requests then we/you can investigate as Valentine asked?
Yuyi

@amaltaro
Copy link
Contributor

The traceback is avaialble in the issue opened against WMCore, for full information.

And here is the place where the actual DBS request is made (a simple serverinfo call to evaluate which DBS instance we're talking to):
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/DataLocationMapper.py#L38

For completeness of what I said before, these DBS calls are made in a sequential mode, but I figured that both local and global workqueue are multithreaded. My original report was based on the global workqueue, which runs mostly 2 cherrypy threads performing distinct operations, e.g.:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/GlobalWorkQueue/CherryPyThreads/ReqMgrInteractionTask.py#L28
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/GlobalWorkQueue/CherryPyThreads/LocationUpdateTask.py#L23

but diving in the code, they seem to use the same DBSReader and DbsApi object, using this utilitarian function that does caching:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/WorkQueueUtils.py#L33

If I got all this right, making sure that each thread has its own DBSReader/DbsApi instance would solve this problem. Did I miss anything?

@yuyiguo
Copy link
Member

yuyiguo commented Jun 20, 2019

I think each thread has its own DBSApi object should solve the problem. Could you please test in one of the workqueue?

@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 20, 2019 via email

@amaltaro
Copy link
Contributor

I agree with Valentin. Having DbsApi thread-safe will be important for future larger activities.
I'm also going to reopen the WMCore issue such that I can address it once I get back from vacation.

@yuyiguo
Copy link
Member

yuyiguo commented Jun 21, 2019

Sounds a good plan, Valentin and Alan.
I am very curious that if we still see the same error with the current version when each thread has its own DBS connection object. Alan, could give this a try?
Yuyi

@yuyiguo
Copy link
Member

yuyiguo commented Jan 16, 2020

Valentin,
I came back to this PR. Could you answer the question I wrote in the code comment? How big the pool is going to be? Each requester has its own pool? In other words, a pool is not shared between different requesters?

@vkuznet
Copy link
Contributor Author

vkuznet commented Jan 17, 2020

Yuyi,
the pool is a member data of RestApi class, it means that each instance of RestApi class will have their own pool. It also means that of this instance will be used in different threads it will not overlap with other instances since it will carry its own curl pool. Therefore other calls to curl object will not affect each other since curl object within pool belong to concrete instance.

So, it is not about requester, it is about instances of the class. A single requestor may create multiple instance of RestApi and each instance will carry its own pool. The pool will not be shared among different instances of RestApi class.

@yuyiguo
Copy link
Member

yuyiguo commented Jan 17, 2020

Thanks Valentin !
Could you limit the max pool size for each instances?
I will have the NTAS PR today and if it goes well, I may get this PR in for the HG2002 too if Alan can test it.
Yuyi

@vkuznet
Copy link
Contributor Author

vkuznet commented Jan 17, 2020

Yuyi, I made it configurable already, please see

def __init__(self, auth=None, proxy=None, additional_curl_options=None, use_shared_handle=False, curl_pool_size=0):

the RestApi contains curl_pool_size parameter which others can use to setup whatever they feel necessary for their use case. Do you want to change the default?

@vkuznet
Copy link
Contributor Author

vkuznet commented Jan 17, 2020

I put 10 as default value of curl_pool_size.

@yuyiguo
Copy link
Member

yuyiguo commented Jan 17, 2020

Thanks, Valentin !

@yuyiguo
Copy link
Member

yuyiguo commented Apr 30, 2020

@vkuznet
I am picking up this again and plan to put it in the May release. I have two questions for your PR.
Q1
def getCurl(self): "Fetch one curl object form the pool or create a new one" if len(self.curl_pool): idx = random.randint(0, len(self.curl_pool)-1) curl = self.curl_pool[idx] else: curl = self.newCurl() return curl
How this getCurl will be thread safe? curl_pool starts with 0 object, then when new ones created and used by execute function, they will be put in the curl_pool. Let's say if the curl_pool has two objects after a while. Then two threads try to get objects from the curl_pool. What will make them to get a different object instead of same one? This is 50% to get one of them.

Q2:
def execute(self, http_request): "Execute given http_request with available curl instance" curl = self.getCurl() res = http_request(curl) if self.curl_pool_size: if len(self.curl_pool) < self.curl_pool_size: self.curl_pool.append(curl) else: self.curl_pool.append(curl) return res
What is the difference curl_pool_size none or 10? if None, the pool size is unlimited?

Thanks,
Yuyi

@vkuznet
Copy link
Contributor Author

vkuznet commented Apr 30, 2020

Yuyi,
you are mixing two different concepts here, the pool within the class and class thread-safeness. The curl pool here refers to class RestApi, i.e. this class holds the pool. Therefore if you use one instance of this class per thread then there is no issues with the pool, while if you use RestApi instance of the class across many threads then the pool (as the class itself) is not thread-safe. Therefore within a thread, the RestApi instance has its own pool and it will speed up things as curl objects within the pool will be reused.

If you want to make RestApi class thread-safe then additional changes will be required, e.g. add a Lock to curl pool.

The same applies to the execute method of the RestApi class. It is only thread-safe if there is one instance of the class.

The point is that a single RestApi class instance can serve multiple requests though. Therefore the pool will be used for them, i.e. all your requests served by the RestApi class instance will go through its own curl pool.

And curl_pool_size controls how large is your pool. If you will not providecurl_pool_size your pool will be unlimited, i.e. the pool will grow with each request.

@yuyiguo
Copy link
Member

yuyiguo commented Apr 30, 2020

@vkuznet
Valentin,

I understood your explanation on the thread safe issue, but how this PR will solve @amaltaro 's original problem. He was running multiple threads and inside a thread the operation was sequential. So 1. thread safe is required. 2. pool is nice to have, but it will not help him since the operation is sequential. Am I right?

If curl_pool_size is none, may the unlimited pool cause memory leaks on the client side? I think if curl_pool_size is none, we make it no pool just one curl object or pool size 1.

Do you think what I suggested make sense? If so, could you please update the PR?

@vkuznet
Copy link
Contributor Author

vkuznet commented May 1, 2020

Yuyi,
I don't have time now to work on this.

As I wrote you should discuss with Alan if thread-safe code is required. This is up to application which uses RestApi, i.e. if one instance of the class is used across many threads than we need to have thread-safe code, if not that it is not required. As I wrote to make code thread safe we need to add Locks to the pool such that curl object from the pool will not be modified between threads. But thread-safeness has nothing to do with sequential/concurrent issue you're talking about. You can still use threads and be sequential or concurrent. In latter case more changes are required to run concurrent requests.

Regarding pool, I provided sensitive default 10. If you want to change code when no pool size is provided then you should modify code accordingly, I don't have time for that anymore.

@yuyiguo
Copy link
Member

yuyiguo commented May 1, 2020

@vkuznet
Thanks for your reply.
@amaltaro
Is thread safe are required for this PR?
Thanks,
Yuyi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants