-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resource Request transform under-requests glideins in certain circumstances #197
Comments
Just to note that these behaviors continue in the current production issue of the decision engine 1.1 |
Just also to note that the effect is occurring in production as we speak, in which there are two jobs for GM2 but no glideins submitted at all. In the case of more than one factory entry matched to one group, we tend to split them out in such a way that no glideins are requested from either entry. |
There are three major cases currently that affect production on a regular basis.
|
There is now a cross-referenced issue in the glideinwms tracker. |
Marco claims in stakeholder meeting this will be fixed in glideinwms 3.6.3. Need to figure out how he plans to do that. |
'I am currently seeing the following issue:
--+---------------------------------------------------------------------------------------------------------------------+----------+
Found in channel cms_job_classification
+----+---------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+----------+
| | Frontend_Group | Job_Bucket_Criteria_Expr | Site_Bucket_Criteria_Expr | Totals |
|----+---------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+----------|
| 0 | cms_jetstream_passthrough | x509UserProxyVOName=='cms' and (DESIRED_Sites.str.contains('T3_US_OSG')) | [u"(GLIDEIN_CMSSite=='T3_US_OSG') and GLIDEIN_Site=='JetStreamTACC' and GLIDEIN_Supported_VOs.str.contains('CMS')"] | 93481 |
| 1 | cms_tacc_passthrough | x509UserProxyVOName=='cms' and DESIRED_Sites.str.contains('T3_US_TACC') and (REQUIRED_OS=='rhel7' or REQUIRED_OS=='any') | [u"(GLIDEIN_CMSSite=='T3_US_TACC') and GLIDEIN_Supported_VOs.str.contains('CMS')"] | 52466 |
| 2 | cms_nersc_passthrough | x509UserProxyVOName=='cms' and DESIRED_Sites.str.contains('T3_US_NERSC') and (REQUIRED_OS=='rhel6') | [u"GLIDEIN_CMSSite=='T3_US_NERSC' and GLIDEIN_Supported_VOs.str.contains('CMS') and GLIDEIN_REQUIRED_OS=='rhel6'"] | 41015 |
| 3 | cms_nersc_passthrough_sl7 | x509UserProxyVOName=='cms' and DESIRED_Sites.str.contains('T3_US_NERSC') and (REQUIRED_OS=='rhel7') | [u"GLIDEIN_CMSSite=='T3_US_NERSC' and GLIDEIN_Supported_VOs.str.contains('CMS') and GLIDEIN_REQUIRED_OS=='rhel7'"] | 52466 |
| 4 | cms_sdsc_passthrough | x509UserProxyVOName=='cms' and (DESIRED_Sites.str.contains('T3_US_SDSC')) | [u"(GLIDEIN_CMSSite=='T3_US_SDSC') and GLIDEIN_Supported_VOs.str.contains('CMS')"] | 93481 |
| 5 | cms_xsede_passthrough | x509UserProxyVOName=='cms' and DESIRED_Sites.str.contains('T3_US_PSC') | [u"(GLIDEIN_CMSSite=='T3_US_PSC') and GLIDEIN_Supported_VOs.str.contains('CMS')"] | 93481 |
+----+---------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+----------+
93000 idle jobs overall, including 41015 for SL6.
The mapping of DE/FE groups to factory entries is 1:1
i.e cms_tacc_passthrough -> CMSHTPC_T3_US_TACC (sl7 only)
cms_xsede_passthrough -> CMSHTPC_T3_US_Bridges (both)
cms_nersc_passthrough -> CMSHTPC_T3_US_NERSC_Cori_KNL (sl6 only)
cms_nersc_passthrough_sl7 -> CMSHTPC_T3_US_NERSC_Cori_KNL_SL7 (sl7 only)
cms_jetstream_passthrough -> OSG_US_TACC_JETSTREAM (both)
cms_sdsc_passthrough -> CMSHTPC_T3_US_SDSC-osg_comet_frontend
For purposes of this ticket the one in question is CMSHTPC_T3_US_NERSC_Cori_KNL, the SL6 NERSC entry.
from cms_resource_request.log we get:
2019-12-16 11:20:29,793 - root - glidein_requests - 43903 - GlideinRequestManifests - INFO - --------------------------------------------
2019-12-16 11:20:29,793 - root - glidein_requests - 43903 - GlideinRequestManifests - INFO - Processing glidein requests for the FE Group: cms_nersc_passthrough
2019-12-16 11:20:29,793 - root - glidein_requests - 43903 - GlideinRequestManifests - INFO - Frontend Group cms_nersc_passthrough job query: x509UserProxyVOName=='cms' and DESIRED_Sites.str.contains('T3_US_NERSC') and (REQUIRED_OS=='rhel6')
2019-12-16 11:20:29,793 - root - glidein_requests - 43903 - GlideinRequestManifests - INFO - Frontend Group cms_nersc_passthrough site matching expression : GLIDEIN_CMSSite=='T3_US_NERSC' and GLIDEIN_Supported_VOs.str.contains('CMS') and GLIDEIN_REQUIRED_OS=='rhel6'
2019-12-16 11:20:29,793 - root - glidein_requests - 43903 - GlideinRequestManifests - INFO - --------------------------------------------
2019-12-16 11:20:29,801 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Number of credentials found from the configuration 2
2019-12-16 11:20:30,120 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Jobs found total 51024 idle 41015 (good 41015, old(10min 40310, 60min 38280), grid 41015, voms 41015) running 10009
2019-12-16 11:20:30,120 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Group slots found total 0 (limit 60000 curb 59000) idle 0 (limit 60000 curb 59000) running 0
2019-12-16 11:20:30,120 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Frontend slots found total 641 (limit 170000 curb 167000) idle 4 (limit 35000 curb 25000) running 641
2019-12-16 11:20:30,121 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Overall slots found total 7339 (limit 170000 curb 167000) idle 800 (limit 35000 curb 25000) running 6684
2019-12-16 11:20:32,564 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Number of credentials found: 2
2019-12-16 11:20:32,660 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Jobs in schedd queues | Slots | Cores | Glidein Req | Factory Entry Information
2019-12-16 11:20:32,660 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Idle (match eff old uniq ) Run ( here max ) | Total Idle Run Fail | Total Idle Run | Idle MaxRun | State FigureOfMerit EntryName
2019-12-16 11:20:32,673 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Request CMSHTPC_T3_US_NERSC_Cori@gfactory_instance_fermifactory02@gfactory_service_fermifactory02: prop jobs 0(mc 0, min 0), available slots 0
2019-12-16 11:20:32,674 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Limits triggered: NoEffectiveIdle: no glidein is needed
2019-12-16 11:20:32,679 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - 0( 0 0 0 0) 0( 0 60000) | 0 0 0 0 | 0 0 0 | 0 0 | Down 0.0060 CMSHTPC_T3_US_NERSC_Cori@gfactory_instance_fermifactory02@[email protected]
2019-12-16 11:20:32,690 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Request CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory02@gfactory_service_fermifactory02: prop jobs 41015(mc 27.0, min 0), available slots 0
2019-12-16 11:20:32,691 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Limits triggered:
2019-12-16 11:20:32,696 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - 41015(41015 41015 40310 0) 10009( 0 60000) | 0 0 0 0 | 0 0 0 | 17 82 | Up 0.0024 CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory02@[email protected]
2019-12-16 11:20:32,705 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Request CMSHTPC_T3_US_NERSC_Cori_shared@gfactory_instance_fermifactory02@gfactory_service_fermifactory02: prop jobs 0(mc 0, min 0), available slots 0
2019-12-16 11:20:32,705 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Limits triggered: NoEffectiveIdle: no glidein is needed
2019-12-16 11:20:32,709 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - 0( 0 0 0 0) 0( 0 60000) | 0 0 0 0 | 0 0 0 | 0 0 | Down 0.0012 CMSHTPC_T3_US_NERSC_Cori_shared@gfactory_instance_fermifactory02@[email protected]
So we are requesting but 17 idle glideins for a group in which there is 41015 idle jobs.CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory02@gfactory_service_fermifactory02: prop jobs 41015(mc 27.0, min 0), available slots 0
It should be pointed out that the job content of these six different groups is almost the same, differing only
by OS, some of which take both, some of which just takes one or tthe other. So it reports that 10009 of this
type of job are already running somewhere else in the global pool. That statement is true.. but it greatly cuts
down the numbers of glideins that we would like submitted to NERSC in this case. If the
DE considers this group in isolation, we have need for 603 glideins worth of cores.. one third of that should be 201.
In previous time periods when I have been looking at the decision engine sometimes we will see the line
CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory02@gfactory_service_fermifactory02: prop jobs 41015(mc 27.0, min 0), available slots 0
the "mc" count will go much higher than 27 and all of a sudden a bunch of a few hundred glideins will be requested and then it goes back down to these levels.
Please investigate why the count of glideins requested is artificially low and if there is any reason that could
explain the flucuation.
We have only seen this behavior thus far in the decision engine (standard library version 0.3.14 which is the current version). There is enough similarity in the glidein request code to make me believe it must also happen in the frontend but I have no direct evidence of that. Factory version is 3.4.5 if it matters.
Steve Timm
The text was updated successfully, but these errors were encountered: