Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addressing multiple issues with timing out with check-ssl-qualys.rb: #36

Merged
merged 1 commit into from
Mar 27, 2018

Conversation

majormoses
Copy link
Member

@majormoses majormoses commented Mar 17, 2018

  • added an overall timeout parameter to short circuit slow api
  • when you call the api beyond the initial attempt it needs to wait until the report is done. Previously it would sleep for a specified arbitrary value and would often timeout. Now we look at the api response and if it has an ETA and the ETA is larger than the value of --between-checks we use it. We check if its greater than the specified as in many cases this has been found to be unreliable when seeing small numbers but seems legit when the api is slower to complete the report. In some cases we see a 0 eta which one might assume is complete or near complete but often saw several 0's in a row leading me to put in that stipulation on using it.
  • adding documentation about how long it takes and that you probably need to configure your sensu check to have a longer timeout or sensu will kill the process.
  • moved to v3 of the api (backwards compatible) I saw little to no difference in speed but the docs indicate that v2 is deprecated

testing artifact: https://gist.github.com/majormoses/b549f92fc02894a15a842b29f528da6f

Signed-off-by: Ben Abrams [email protected]

Pull Request Checklist

Fixes #35, #33

General

  • Update Changelog following the conventions laid out here

  • Update README with any necessary configuration snippets

  • Binstubs are created if needed

  • RuboCop passes

  • Existing tests pass

Purpose

Make this check actually work with defaults consistently

Known Compatibility Issues

Essentially the check may run longer than in previous versions. There are now multiple nobs to fine tune what you want but it now works every time I have tried it.
Knobs:

  • --timeout: this is an overall timeout circuit breaker and defaults to 5 minutes
  • --number-checks: this indicates how many requests will be attempted before giving up, left default at 24
  • --time-between-checks: previously this was a constant amount of time in between polling for report being done. Now this acts more as a fallback value. The api response includes an eta key and value that indicates when the api thinks you should check back. It seemed very temperamental and you still need to have a value to wait between the initial request and polling for completion so we decided to use the same value and treat this as a fallback. In cases where the eta is less than this we use this value instead

@majormoses
Copy link
Member Author

The failure is unrelated to the change and is currently failing on master as well.

@majormoses
Copy link
Member Author

I found this in the tests: https://github.com/sensu-plugins/sensu-plugins-ssl/blob/1.5.0/test/check-ssl-hsts-preloadable_spec.rb#L2-L5 I am gonna comment it out and raise an issue to address it.

@majormoses
Copy link
Member Author

I have commented out the flaky test and raised an issue to address.

'off'
else
'on'
end

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if passing startNew=off is needed. The v3 api doc is vague but it seems to suggests the parameter be omitted completely for some reason.

startNew - if set to "on" then cached assessment results are ignored and a new assessment is started. However, if there's already an assessment in progress, its status is delivered instead. This parameter should be used only once to initiate a new assessment; further invocations should omit it to avoid causing an assessment loop.

https://github.com/ssllabs/ssllabs-scan/blob/master/ssllabs-api-docs-v3.md#invoke-assessment-and-check-progress

This seems a bit odd -- perhaps we should ask for clarification?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That feels pretty straightforward the first request should be with it off and the second one should be on. This assumes that you do not want cached or in-progress results. This keeps the old behavior and I am not sure the value of checking for a new report on a somewhat regular basis if you are gonna take cached results. Maybe we can expose that as a param so the user can decide whether they want it be sure that it has just been initiated. The reason I think that the current implementation makes sense is that say a report was in progress and you make a change to fix it you will not get the new results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could have an option to control if all requests should ask for cached.

@@ -107,11 +129,37 @@ def ssl_check(from_cache)

def ssl_recheck
1.upto(config[:num_checks]) do |step|
json = ssl_check(step != 1)
p "step: #{step}" if config[:debug]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these debug statements are intended to be long term, wouldn't puts be better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well p shows the objects by doing an inspect while puts does a .to_s which makes it harder to spot issues like improper types. For debug purposes it is much better for normal user interaction puts is generally preferable.

p "step: #{step}" if config[:debug]
start_time = Time.now
p "start_time: #{start_time}" if config[:debug]
json = if step == 1
Copy link

@jhoblitt jhoblitt Mar 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this be shorted to ssl_check(step != 1)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could and is closer to what was originally there. I write my code for 3 AM troubleshooting as that is when I care the most about being able to grok it. It's more so a matter of preference.

return json if json['status'] == 'READY'
sleep(config[:between_checks])
if json['endpoints'] && json['endpoints'].is_a?(Array)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ruby 2.2 is supposed to be EOL'd in 2 weeks. It sure would be nice to be able to use the safe nav operator here...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel you, it's on my list of things to run through all the plugins after it is EOL and remove support for ruby < 2.3

# insanely long time periods. The highest I have seen the eta go was
# around 250 seconds but put it in just in case as the api has very
# erratic response times.
if json['endpoints'].first.is_a?(Hash) && json['endpoints'].first.key?('eta') && json['endpoints'].first['eta'] > config[:between_checks]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This expression is a bit long and it requires considering the precedence of && and >. I think it would be better if it was at least newline escaped across several lines.

Copy link
Member Author

@majormoses majormoses Mar 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that, how'd you like to see it? Does something like this work for you?

if json['endpoints'].first.is_a?(Hash) \
  && json['endpoints'].first.key?('eta') \
  && json['endpoints'].first['eta'] > config[:between_checks]
    p "eta: #{json['endpoints'].first['eta']}" if config[:debug]
    sleep(json['endpoints'].first['eta'])
else
    p "sleeping with default: #{config[:between_checks]}" if config[:debug]
    sleep(config[:between_checks])
end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I did not hear back I left it as is, I welcome you to open a PR to clean it up.

README.md Outdated
@@ -46,6 +46,14 @@ or an online CRL:

Critical and Warning thresholds are specified in minutes.

### `bin/check-ssl-qualysis.rb`

Checks the ssllabs qualysis api for grade of your server, this check can be quite long so it should not be scheduled with a low interval and will probably need to adjust the check `timeout` options per the [check attributes spec](https://docs.sensu.io/sensu-core/1.2/reference/checks/#check-attributes) based on my tests you should expect this to take around 3 minutes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/qualysis/qyalys/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

- added an overall timeout parameter to short circuit slow api
- when you call the api beyond the initial attempt it needs to wait until the report is done. Previously it would sleep for a specified arbitrary value and would often timeout. Now we look at the api response and if it has an ETA and the ETA is larger than the value of `--between-checks` we use it. We check if its greater than the specified as in many cases this has been found to be unreliable when seeing small numbers but seems legit when the api is slower to complete the report. In some cases we see a 0 eta which one might assume is complete or near complete but often saw several 0's in a row leading me to put in that stipulation on using it.
- adding documentation about how long it takes and that you probably need to configure your sensu check to have a longer timeout or sensu will kill the process.
- moved to v3 of the api (backwards compatible) I saw little to no difference in speed but the docs indicate that v2 is deprecated

Signed-off-by: Ben Abrams <[email protected]>
@majormoses
Copy link
Member Author

@majormoses majormoses deleted the fix/35 branch March 27, 2018 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants