Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: resolve issues with skuid-cli #205

Closed
wants to merge 113 commits into from
Closed

Conversation

techfg
Copy link

@techfg techfg commented Aug 13, 2024

Important

This PR is a Work-in-progress (WIP) and should NOT be merged yet. Please review details below carefully.

Jira Issue URL

N/A

High-Level Description

This PR is a WIP and should not be merged yet. It contains proposed fixes to many of the issues currently open in the repo as well as workarounds for issues that aren't directly tied to the CLI itself but exist server side for APIs that the CLI relies on.

Of the 108 issues I've created recently in this repo, this PR currently addresses in full or in-part, 79 of them. I'm also currently working on addressing approximately 3 more of them (the remaining issues beyond those require Skuid to resolve directly as they either involve documentation updates and/or server-side API fixes). I hope to have those commits pushed in the next week, time permitting.

I am submitting this PR now vs. waiting until fully complete since there was recent activity from Skuid so I'm hopeful that this means other issues will be reviewed as well so want to start the ball rolling on gaining input on the issues themselves and the proposed changes in this PR as I suspect and hope there will be at least some discussion before things can be finalized and merged.

Unfortunately, trying to manage separate PRs for each issue individually would have been nearly impossible given the number of issues present in the cli and the fact that solutions to some issues were dependent on solutions to other issues. For this reason and while not typical of a PR, I'm submitting a single PR that contains all of them as trying to manage them individually would have been a futile effort.

As referenced throughout the 108 issues, version 0.6.7 of the CLI is not reliable and only under a very specific set of circumstances, does it complete in an expected manner. In most cases, it either fails entirely or performs unexpectedly. In short, there are issues in the Docs, issues in the CLI itself and issues in the APIs. All of these lead to significant risk of unexpected outcomes when deploying to NLX sites, especially given that the CLI is documented as a solution for migrating to production environments.

Given all the above, it would be ideal if this PR was reviewed ASAP so solutions for these issues can be officially released and users of the CLI can start to have confidence that management of their NLX sites via the CLI is predictable and reliable.

Regarding the PR itself, a few things:

  1. This is the first time I've used Go so while the high number of issues has given me ample opportunity to learn the language, the code should be reviewed closely by someone with at least a mid-level of Go experience
  2. I am not a Skuid expert and the knowledge I've obtained regarding Skuid is primarily from reviewing the existing CLI code, tinkering via the Web UI admin interface and the CLI itself. All code changes should be reviewed closely by someone with an expert level of knowledge of Skuid and Skuid Internals.
  3. In the absence of a separate PR per issue, I've tried to be fairly verbose in the commit messages to provide context on the changes in the commit so please look there for details on the commit and its included changes.
  4. Within the code, where I wasn't 100% sure of something, I included a comment Skuid Review Required so when reviewing, beyond general review mentioned in 1 & 2 above, you can specifically look for this comment text to identify where I knew for certain that Skuid input/review was needed.
  5. I added lots of tests in nearly every area of code that I touched and, in many cases, more tests than I would typically include in some scenarios, but all are meaningful tests (just a little overkill in some cases). The reasons for the "thoroughness" in tests is a few-fold:
    1. Version 0.6.7 contained basically no tests, even in areas of the code that contained significant string parsing. The lack of tests was a primary reason why so many issues exist (as any manual testing that was performed did not detect the issues). This is a fairly significant oversight for an official tool of a commercial product built/supported/maintained by the company of the commercial product and especially one intended to be used for production purposes.
    2. In reviewing prior PRs, with some exceptions, there does not appear to have been any type of code review process conducted, at least not one visible via the PR itself. The only conversation I see around PRs in general is related to "manual testing" and expected/unexpected results. Given its unclear what type of formal review PRs (either external or internal) undergo, I felt it best to minimize the chances of regressions such as those introduced via this commit and this commit which were two of the largest contributors to the issues in the current version of the CLI.
  6. This PR is built on top of and includes everything in fix: resolve several issues with watch command #137 so fix: resolve several issues with watch command #137 could be closed in favor of this one (or both reviewed/merged separately).
  7. Breaking Changes:
    1. Phrasal support in --since has been removed - see comments in commit message for details.
    2. Interpretation of Year/Week/Day in --since parsing is now static values. It's unclear what the expected behavior for treatment of these options was in 0.6.7 since there were no tests or documentation so this may or may not be a breaking change but mentioning it as a breaking change for clarity. See comments in commit message for details.
  8. Non-Breaking Changes:
    1. --trace, --verbose, --diagnostic have been deprecated - see comments in commit message
    2. --no-console-logging flag added - see commit
    3. --log-level and --debug flag added - see commit
    4. --skip-data-sources flag added - this is a temporary flag to workaround bug/perf: deploy command takes a long time to complete #150 and can be removed once that issue is resolved. See commit.
    5. --no-clean flag added - see commit
    6. All flags now have automatic support for environment variables following the naming convention of SKUID_FLAG_NAME (e.g., SKUID_LOG_DIRECTORY). Any previous environment variables that did not follow this pattern have been deprecated. See commit for details.

Looking forward to hearing from Skuid on this PR and the related open issues and helping to bring the CLI to a state of stability and reliability for Skuid customers!

Changelog:

Individual commits contain references to the issue(s) that it resolves and/or is related to so linking will be done via commit message. Happy to list them all here once everything is finalized.

In the meantime, here is a spreadsheet that contains a cross-reference to all 108 issues I've created in this repo (+1 that already existed in this repo) along with the commits that target the issue and a "note" column with anything extra that is worth mentioning (e.g., this is something only skuid can fix).
SkuidCLIIssueTracker.xlsx

Additional Information

Below is a summary of high-level changes that were made beyond just the baseline of addressing the core issues/bugs:

  1. The previous code had a tremendous amount of string parsing of paths in many different areas of the code - often doing the same "parsing" to condition behavior one way or another. It was a lot of incorrect parsing that led to many of the issues present in 0.6.7. The parsing of paths is now centralized (here and here) and once parsed, strongly typed structures are passed around to avoid the need to reparse and to make conditional execution safe and reliable.
  2. All metadata types are now represented as enums so there is no longer a need for re-parsing to detect what metadata type something is - another area that led to many issues in the existing code.
  3. Dependency injection was introduced in several areas (e.g. here) that required significant refactoring to resolve existing issues. This improves testability (see below) and sets the table for further use of DI for things like HttpClient, etc. to enable improving test coverage throughout the CLI.
  4. Mocking was introduced via the mockery package to support testing of DI related items.
  5. Code testability improvements - Were new code was added and/or significant changes required to address issues, code was refactored to improve testability. A large majority of the code in 0.6.7 is not testable short of writing full integration tests because: 1) the logic in the code isn't separated into testable units; 2) there is no use of dependency injection so anything that exercises things like HttpClient must use a real HttpClient (vs. a mock). Additionally, the integration tests that do exist fall short of doing much in the way of being "meaningful" (e.g., they deploy to a server and simply check that no error was returned vs ensuring that the server contains the expected result which theoretically may be ok if there were other unit tests that existed to ensure proper behavior). The lack of tests and lack of testability of the code is something that likely (hopefully) would have been caught in a code review process (see mention of this above in Regarding the PR itself, a few things Item 5.2. Some examples of these changes are:
    1. The code to "parse" the --since flag is buried inside of a much larger function that orchestrates the entire retrieval process. There is no way, short of full integration tests of which there would have to be many to test all the scenarios of parsing since to test that parsing since behaves as expected. This code was refactored here and here and tests written against behaviors here and here.
    2. The code to archive files contained essentially no tests despite a significant amount of conditional logic, filtering, etc. While the code was testable, writing tests was not straightforward since the only way to write tests was to utilize the physical file system and any test setup would be dependent on fixtures that existed in the repo. The refactored approach utilizes the fs package allowing for MapFS to be used, dependency injection (e.g., fs.FS, FileUtil) and separated filters (each themselves individually testable e.g., MetadataArchiveFilter). In short, test coverage went from near 0% to 100% for everything related to Archive.

techfg added 30 commits June 24, 2024 17:19
Resolves skuid#127
Resolves skuid#128
Resolves skuid#129
Resolves skuid#130
Resolves skuid#131
Resolves skuid#132
Addresses the watch portion of skuid#136
Resolves skuid#138
Addresses the fatal termination portion of 141 & 142
Resolves 141
Resolves 142
Resolves 143
Resolves 144
- Resolves skuid#158
- Resolves skuid#159
- Resolves skuid#160
- Resolves skuid#161
- add mockery tool for autogenerating mocks
Resolves skuid#172

- Intentionally not adding tests in this commit for a few reasons: 1) There are no existing tests; 2) the existing flags package has bugs (see skuid#170) and contains a lot of duplicate & unecessary code; 3) Due to the existing bugs, any tests written would would be to validate broken behaviors.  The added code in this PR behaves exactly like Flag[string] does, the only difference in the backing value.  If/When the Flags package issues are resolved, proper tests can be added.
- The approach for "one-time use" concept is very primitive.  There is a TODO in the code to reflect that a proper secure storage method of the password in memory and one-time use solution is needed.  Without knowing whther or not Skuid supports refresh tokens, it didn't make sense to overengineer a solution for this at this point.  The approach taken is an improvement over what existed even if not ideal.
- The previous code contained a Refresh metho in the auth package that was not used anywhere (except for a test so eliminated it.  As with the one-time use, once its known whether or not Skuid supports refresh tokens, a proper Refresh can be implemented so support skuid#173.
Resolves skuid#174
Resolves skuid#176

Note that this commit does not address the related information mentioned in skuid#174 regarding testability but it does create a new logrus instance for every reset rather than using the StandardLogger global which is a step towards addressing overall testability and reducing cost of code maintenance.
@techfg
Copy link
Author

techfg commented Sep 17, 2024

  1. Found a couple more bugs, created issues chore: makefile passes version.Name to LDFLAGS but version.Name does not exist #215, chore: makefile and workflows out of sync, outdated, repetitive and drone not working #216
  2. Addressed chore: makefile passes version.Name to LDFLAGS but version.Name does not exist #215
  3. Updated PR description reflecting above including attaching updated spreadsheet

All other information from this comment still applies.

I considered a few different options here including naming the flag `overwrite` instead of `no-clean` or making it an enum with `replace` | `overwrite`.  Also considered changing the default behavior to not `clean`.  In the end, the only real way to know what the default should be is to know more about the "average" user persona and their use cases so for now, just leaving the existing default to avoid any breaking changes and adding a `no-clean` option since `replace/overwrite` could be a little ambiguous.

This is easy enough to change so input is welcomed and happy to make adjustments.

Resolves skuid#217
Removing JSONTime type which was added in ce913dc but wasn't used in that commit and still does not appear to be used anywhere.
…r codebase

This commit started with the goal of improving logging (skuid#220) to help aid with identifying when outcomes were not as expected due to the many issues currently present in the CLI and on the server-side.  During the course of improving the logging which required going through essentially every line of code in the codebase, many new issues were identified and fixes for nearly all of them implemented.  Additionally,validations of all steps in the CLI were improved to and logs emitted with the results.  Lastly, the code was significnatly refactored in many areas to improve code maintainability (skuid#230).

Due to the size of this commit, there is too much to explain in detail in a commit message.  See forthcoming comment in skuid#205 for full details of this commit.

Resolves skuid#59
Resolves skuid#165
Resolves skuid#199
Resolves skuid#218
Resolves skuid#220
Resolves skuid#221
Resolves skuid#222
Resolves skuid#223
Resolves skuid#224
Resolves skuid#228
Resolves skuid#230
Resolves skuid#231
Note that some issues mentioned in skuid#229 were addressed inaf20e1e74ff14ee0bfeab32a2c95f060670fe3a0 or prior commits in skuid#205
Resolves skuid#229
Related af20e1e
Windows does not support SIGTSTP (Ctrl+Z), not worth introducing platform specific solution for this as handling signals is only in-place to provide a more graceful shutdown UI wise (as opposed to having to clean-up resources, etc.) so simply eliminating SIGTSTP from being handled.
@techfg
Copy link
Author

techfg commented Oct 11, 2024

TL;DR

  1. Created 19 new issues (14 bugs, 1 feature, 4 chores) - feat: add ability to not "clean" target directory during retrieve #217, bug: bearer tokens leaked in logs #218, bug: WithField(s) entries are output to log when --diagnostic is NOT set #219, chore: improve consistency, accuracy, and amount of information being logged across all commands #220, bug: errors not handled when failing to create http request #221, bug: NewRetrievalRequestBody ignores any error that may occur during marshaling #222, bug: panic's not logged to log file when --file-logging is specified #223, bug: temporary file not closed during retrieve #224, bug: retrieve indicates success but results in an empty targetDirectory without any warning or error #225, bug: retrieve panics when skuidMetadataService is not present in response #226, chore: eliminate fragility of code that processes retrieved results from potentially processing results in the wrong order #227, chore: improve error messaging when retrieve fails when response from ExecuteRetrievalPlan is invalid #228, bug: deploy indicates success when nothing was deployed #229, chore: improve code maintainability #230, bug: CreateDirectoryDeep does not ensure the directory exists #231, bug: GetRetrievePlan API returns unexpected JSON result #232, bug: since value inconsistent between send/receive during retrieve #233, bug: various structure fields (e.g., NlxPlan) inaccurate #234, bug: App not found error encountered when app exists on server #235
  2. Addressed 16 of the 19 issues - feat: add ability to not "clean" target directory during retrieve #217, bug: bearer tokens leaked in logs #218, bug: WithField(s) entries are output to log when --diagnostic is NOT set #219, chore: improve consistency, accuracy, and amount of information being logged across all commands #220, bug: errors not handled when failing to create http request #221, bug: NewRetrievalRequestBody ignores any error that may occur during marshaling #222, bug: panic's not logged to log file when --file-logging is specified #223, bug: temporary file not closed during retrieve #224, bug: retrieve indicates success but results in an empty targetDirectory without any warning or error #225, bug: retrieve panics when skuidMetadataService is not present in response #226, chore: eliminate fragility of code that processes retrieved results from potentially processing results in the wrong order #227, chore: improve error messaging when retrieve fails when response from ExecuteRetrievalPlan is invalid #228, bug: deploy indicates success when nothing was deployed #229, chore: improve code maintainability #230, bug: CreateDirectoryDeep does not ensure the directory exists #231, bug: since value inconsistent between send/receive during retrieve #233
  3. Updated PR description reflecting above including attaching updated spreadsheet

Up until this batch of issues/commits, I have avoided any type of significant refactoring of the existing code base (except for where it was necessary to resolve an issue(s)) due to the high risk of regression that may be introduced given the lack of unit tests and the high complexity of the code structure as written (e.g., 200+ line methods with many different things all being performed inline).

However, given all the issues still present even with all the fixes/improvements in this PR, having meaningful log messages was critical to be able to identify where things are not working as expected. As an example, if v0.6.7 had anything resembling adequate logging such as what now exists in this PR, identifying root cause of issues that have been found would have been much simpler.

To that end, to address #220, refactoring the code in the remaining areas that hadn't been refactored yet was necessary in order to not further complicate an already unnecessarily complex/difficult to follow code base. Along that line, #230 was created so that #220 & #230 could be addressed in tandem as they are essential for each other to produce the originally desired outcome of #220.

This commit covers a large amount of ground issue wise as a single commit as it would have been nearly impossible to split up into smaller commits since logging is foundational to everything. It contains the refactor itself along with addressing issues related to logging and many bugs found during the refactor while significantly improving readability of the code, thus lowering the overall cost of maintainability.

Below is a summary of what has been accomplished in the 16 issues addressed since my last PR comment above:

Logging

Note

Due to all the issues with v0.6.7 and all the open issues in this repo, logging is very extensive as it's required for two primary reasons:

  1. PR (fix: resolve issues with skuid-cli #205) is massive, and it will help with the review process
  2. There are still many issues that remain (in the CLI itself and in the server-side APIs) so having the extensive logging is the only way to even try to understand what went wrong during command execution

Once the outstanding server API issues are resolved and the CLI stable, the amount of logging can be reduced to a more "reasonable" level

  • all messages follow same format
  • when file logging enabled, all messages are logged to file (even panics) except for in the situation where an error/panic occurs prior to root command's PersistentPreRunE being invokved
  • all durations & times are formatted consistently and all fields (--diag) are quoted to ensure consistency in their output
  • all higher level activities and several lower level activities are traced/timed
  • Fields (when --diag is passed) contain several key/value pairs consistently for every message:
    • logger - the name of the function
    • error - if an error was present when message was logged
    • command - if the function is the "entry" for a command
    • if the function logging the message is "under tracking" control:
      • start - the time the function started tracking
      • finish - the time the function finished tracking
      • duration - the duration of the function (in nanoseconds)
      • success - false if the function returned error
  • use of color is intentional and carries a meaning (see message.go)
  • eliminated all occurrences of misuse of "WithFields"
  • use of debug & trace messages is clearly defined and message/message format/fields/etc. are identical between them - debug is used for higher-level activities, trace is used for individual plan/entity/file/etc. activities
  • Anything that "may" contain sensitive information requires the "--diag" flag to be passed (code in this commit should not log anything sensitive unless APIs return sensitive information in error messages which hopefully they do not)
  • Deploy & Retrieve commands log all names of entities deployed/retrieved in aggregate and per plan when --log-level is debug (or greater) - note that due to bug: deployment results do not contain full set of entities that were deployed #211, the entity name results from execute deploy plan are not logged because the results returned are incomplete, inaccurate, and/or not existent.

Log Files

Log files can now be evaluated when --diag is specified using various tools due to having a consistent format. Which log messages are output is controlled via --log-level so specifying a --log-level of trace is recommended when wanting to inspect log files in this manner, however all messages (regardless of level) will expose some fields. For example, specifying --log-level trace --diag --file-logging will output all possible messages and that messages fields to the log file which can then be evaluated, for example:

  • the following command can be executed to see the duration of every tracked function, returning the duration (nanoseconds), logger/function name and message:
    grep 'duration="[^"]*"' skuid-cli-2024-10-03T031319-0700-2495227290.log | sed -E 's/.*msg=("[^"]*").*duration="([^"]*)".*logger="([^"]*)".*/\2\t | \3\t | \1/' | sort -h | column -s $'\t' -t
    # Retrieve Example
    ...
    834986070    | pkg.Authorize                                                           | "SUCCESS Requesting authentication in 834.98607ms"
    990743036    | pkg.ExecuteRetrieval                                                    | "SUCCESS Executing retrieval plan(s) in 990.743036ms"
    1201127548   | pkg.Retrieve                                                            | "SUCCESS Retrieving site https://<redacted>.skuidsite.com to directory `/sites/mysite/skuid-objects` in 1.201127548s"
    2036255161   | cmd.retrieve                                                            | "SUCCESS Executing command `retrieve` entities from site https://<redacted>.skuidsite.com to directory `/sites/mysite/skuid-objects` in 2.036255161s"
    2036255161   | cmd.retrieve                                                            | "✓ Retrieved site https://<redacted>.skuidsite.com to `/sites/mysite/skuid-objects`"
    
    # Deploy Example
    ...
    1986807659   | pkg.ExecuteDeployPlan.pkg.executeDeployPlan                             | "SUCCESS Executing deployment for plan `Skuid NLX` in 1.986807659s"
    3039649260   | pkg.ExecuteDeployPlan                                                   | "SUCCESS Executing deployment plan(s) in 3.03964926s"
    3362697168   | pkg.Deploy                                                              | "SUCCESS Deploying site https://<redacted>.skuidsite.com from directory `/sites/mysite/skuid-objects` in 3.362697168s"
    3940579660   | cmd.deploy                                                              | "SUCCESS Executing command `deploy` entities to site https://<redacted>.skuidsite.com from directory `/sites/mysite/skuid-objects` in 3.94057966s"
    3940579660   | cmd.deploy                                                              | "✓ Deployed site https://<redacted>.skuidsite.com from `/sites/mysite/skuid-objects`"
    
    
  • the following command can be executed to see the aggregate and per plan entities deployed/retrieved:
    grep 'entities="[^"]*"' skuid-cli-2024-10-03T221256-0700-3834979381.log | sed -E 's/.*entities="([^"]*)".*entitiesFrom="([^"]*)".*/\2\t | \1/' | column -s $'\t' -t
    # Retrieve Example
    Execute retrieval plan(s) [`Skuid Cloud Data Service`, `Skuid NLX`]          | [`apps/SkuidSample`, `apps/sampleapp`, `datasources/SkuidSample_NW_OData`, `designsystems/SkuidSample NW`, ..., `sitepermissionsets/Public`, `sitepermissionsets/Standard`]
    Execute retrieval plan `Skuid NLX` request                                   | [`apps/SkuidSample`, `apps/sampleapp`, `datasources/SkuidSample_NW_OData`, `designsystems/SkuidSample NW`, ..., `sitepermissionsets/Public`, `sitepermissionsets/Standard`]
    Execute retrieval plan `Skuid Cloud Data Service` request                    | [`apps/SkuidSample`, `apps/sampleapp`, `sitepermissionsets/Admin`, `sitepermissionsets/Public`, `sitepermissionsets/Standard`]
    Execute retrieval plan `Skuid NLX` response                                  | [`apps/SkuidSample`, `apps/sampleapp`, `datasources/SkuidSample_NW_OData`, `designsystems/SkuidSample NW`, ..., `sitepermissionsets/Public`, `sitepermissionsets/Standard`]
    Execute retrieval plan `Skuid Cloud Data Service` response                   | [`sitepermissionsets/Admin`]
    Execute retrieval plan(s) [`Skuid Cloud Data Service`, `Skuid NLX`] result   | [`apps/SkuidSample`, `apps/sampleapp`, `datasources/SkuidSample_NW_OData`, `designsystems/SkuidSample NW`, ..., `sitepermissionsets/Public`, `sitepermissionsets/Standard`]
    
    
    # Deploy Example
    --entities flag                                                        | []
    Get deployment plan(s) request                                         | [`apps/SkuidSample`, `apps/sampleapp`, `datasources/SkuidSample_NW_OData`, `designsystems/SkuidSample NW`, ..., `sitepermissionsets/Public`, `sitepermissionsets/Standard`]
    Execute deployment plan(s) [`Skuid Cloud Data Service`, `Skuid NLX`]   | [`apps/SkuidSample`, `apps/sampleapp`, `datasources/SkuidSample_NW_OData`, `designsystems/SkuidSample NW`, ..., `sitepermissionsets/Public`, `sitepermissionsets/Standard`]
    Execute deployment plan `Skuid NLX` request                            | [`apps/SkuidSample`, `apps/sampleapp`, `datasources/SkuidSample_NW_OData`, `designsystems/SkuidSample NW`, ..., `sitepermissionsets/Public`, `sitepermissionsets/Standard`]
    Execute deployment plan `Skuid Cloud Data Service` request             | [`apps/SkuidSample`, `apps/sampleapp`, `sitepermissionsets/Admin`, `sitepermissionsets/Public`, `sitepermissionsets/Standard`]
    Update Permission Sets request                                         | [`test permission set`]
    --entities flag                                                        | []
    --entities flag                                                        | []
    

Coding Structure/Style

  • refactored functions to have a single purpose or orchestrate smaller composable units via function calls to improve overall clarity of purpose and decrease cost of maintenance (and risks)
  • No naked returns
  • Named returns only when needed to clarify meaning of a return value and/or required in a defer function

General

  • general performance improvements (e.g., eliminated copying entire data structures just to change a single property to pass to another function, migrated to github.com/goccy/go-json for all json operations)
  • eliminated several scenarios where very fragile conditional logic was being applied to make decisions, changing the approach to being explicit/intentional and failing when something unexpected was encountered
  • retrieve & deploy commands now use the same data structure for NlxPlan for consistency and also reducing risk of incorrect usage of plans as the order of evaluation/execution of plans is critical and should be very intentional

Resolves #59
Resolves #165
Resolves #199
Resolves #217
Resolves #218
Resolves #219
Resolves #220
Resolves #221
Resolves #222
Resolves #223
Resolves #224
Resolves #225
Resolves #226
Resolves #227
Resolves #228
Resolves #229
Resolves #230
Resolves #231
Resolves #233

Most of these tests are overkill/unnecessary, however given history of this repo and the significant amount of regressions introduced with prior commits, add some basic coverage to uncovered areas to minimize risk of future regressions being introduced
@acofer acofer closed this Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants