[Fix - scylla-node] - Ansible builtin + when condition start node #372

unkls · 2024-04-03T14:04:23Z

Hello Scylladb community,

First thanks for that amazing ansible role !!

I'm currently using your ansible-role to deploy a cluster from scratch and adding a node into a cluster. I'm using ansible 2.16 to match your requirement (2.8). Currently when I was using your examples I was facing some issues with 2 tasks : Start scylla non-seeds nodes serially | Start seeders serially.

This issue append when I'm trying to add a node into the cluster.
My error was due to the Resolve scylla_broadcast_address task that was setting the fact only for the current node that run the task.
Then when we enter into start_scylla_service dependent tasks task, it will iterate over all the servers inside our current inventory and try to catch the scylla_broadcast_address. This variable is empty when we call the role for one node.

Examples :

My inventory is like :

[scylla]
scylla-1
scylla-2
scylla-3

I want to add a new node scylla-4 from the group [new_node].
My playbook is like :

- name: Add Scylla node(s)
  hosts: "{{ ansible_limit }}"
  become: true
  gather_facts: true
  vars:
    full_inventory: false
  roles:
    - ansible-scylla-node
  tags: new_node

- name: Run cleanup on the old nodes
  hosts: "{{ target_host | default('scylla') }}"
  gather_facts: false
  serial: 1
  tasks:
    - name: Run cleanup
      script:  files/repair.sh
  tags: new_node

I call it like : ansible-playbook scylla.yaml -i inventories/inventory.yml -u devops --diff -l scylla-4 -t new_node -e "@group_vars/scylla_poc"
When I arrive at the task scylla_broadcast_address it will perform the set_fact only for my node scylla-4. Once I will run the task Start seeders serially ansible will iterate over loop: "{{ groups['scylla'] }}" but by design this should run only the new node.
That will iterate like :

skipping: [scylla-4] => (scylla-1) --> Empty value for scylla_broadcast_address
skipping: [scylla-4] => (scylla-2) --> Empty value for scylla_broadcast_address
skipping: [scylla-4] => (scylla-3) --> Empty value for scylla_broadcast_address
skipping: [scylla-4] => (scylla-4) --> IP from the NIC not into seeder list
skipping: [scylla-4]

And same pattern append with Start scylla non-seeds nodes serially task.

My fix proposal is to add a new check to avoid crash errors on set_fact empty for scylla_broadcast_address.

Let me know if you need more to debug/discuss !

Thanks a lot in advance !

tarzanek · 2024-04-05T10:31:46Z

well the idea looks OK to me, @vladzcloudius ?

tarzanek · 2024-04-05T10:32:43Z

code wise however you should split the builtin into separate commit (since it's refactoring)
and leave functional change (the scylla_broadcast_address stuff) in a separate isolated commit

both patches/commits should have messages prefixed with
ansible-scylla-node:

and the functional change should describe the change and intended behaviour fix
(so instead "fix empty when condition" you should write something like "support running the role on limited number of nodes, skip tasks that miss information " )

of course thinking about it, there might be more corner cases that this change will unblock and that would require more fixes
original design is that you should always run role on top of whole inventory and it can detect which nodes have to be added as new and wait for them until they get added to cluster

cleanup task is then done afterwards (after all nodes are added)

but I see how your idea can speed up the role run for big clusters, so we should consider it

…des, skip tasks that miss information

…ble 2.10

…a-ansible-roles into fix-scylla-node-ansible-2.14

…ommentary sections

unkls · 2024-04-05T12:29:58Z

Hello @tarzanek,

Thanks for the feedback, I've split both commits let me know if it's ok for you. Thanks for the prefix tips I've include it.

We do use -l in order to add quickly one node into our clusters + we manage multiple inventory with multiple clusters that make our life easier to just limit in this way.

Let me know if you need anything else to allow this PR.

unkls · 2024-04-16T08:50:35Z

Hello @tarzanek ,

Any update about this PR ?

Thanks a lot in advance !

igorribeiroduarte · 2024-04-17T11:28:51Z

Please note that we already have another PR under review that will add support for the --limit option: #366.

Also note that this PR couldn't be merged before fixing the adjust_keyspace_replication.yml, which currently assumes that if start_scylla_service is set to true and we already executed the 'start_scylla_service dependent tasks block, it means that all the nodes are up and tries to adjust the replication based on that, as explained here -> 1326e4e

The generate_tokens.yml task also assumes that broadcast_address is defined for all nodes and would stop working if we merged this PR as it is right now.

In general, even though this serves as a workaround, relying on broadcast_address being defined or not to decide if we'll start scylla-service doesn't look like the best approach to me. We should explicitly limit the execution of the start_one_node task to the nodes in the play_hosts variable, which will be the nodes passed to --limit.

unkls · 2024-04-18T13:42:50Z

Hello @igorribeiroduarte ,

Thanks for your feedback.
About your first point I don't really get it sorry. Even if you perform --limit on one node the context can be global using groups into the tasks (as it's already the case). My point is to run faster the playbook on our clusters.
The PR you shared don't perform the same as your context is changed to only --limit which I don't need for now.
For the adjust_keyspace_replication it will still work as your node will be added and others node are already there and inside the inventory. Again the context stay global with groups.

For the broadcast_address adding a try statement to avoid crash sounds good for me even if you start a node you must need to check if the node itself but not the others as my using a --limit.

Let me know if it's not clear I can elaborate more.

vladzcloudius · 2024-05-15T19:51:06Z

@unkls First of all it's awesome that you enjoy Scylla Role.
A few comments about the way you use it: Scylla Node Role is not ready to be executed using --limit and not only for reasons you are trying to fix in this PR.

The main problem is that Role has a few invariants that are being validated in regards to the cluster.
One of them is a validation of the cluster being healthy ("wait for the cluster to become healthy") that relies on the ansible_play_batch to include all scylla nodes.
And --limit is reducing ansible_play_batch to the nodes you are restricting the playbook to.

Another pieces that are probably going to break are the tasks that are replicating the node configuration from the first node in the DC ensuring that all nodes are configured the same way: "Collect IO settings for the first node of every DC".

And we there are a few other places that are going to break unless you execute a role on all nodes of the cluster (scylla group).

And the thing is - there is no much motivation to restrict the Role using --limit AFAICT.
Simply run Ansible with --fork as the number of your scylla nodes and make sure to cache ssh connections.

If you do that the execution of a Role from the beginning to the end on the cluster that has already been configured takes about 2 minutes - in your example the execution on the whole cluster and on a single node that bootstraps is going to take exactly the same time.

Instrumenting a Node Role to support --limit would make us reduce the Role's functionality in such a mode to levels quite significantly:

We would have to remove configuration files syncing.
We would have to remove cluster's health checking.

While this is not impossible but I find the motivation for this unclear while cost of such an instrumentation - quite significant.

jonbuffington · 2024-07-27T20:48:41Z

One of them is a validation of the cluster being healthy ("wait for the cluster to become healthy") that relies on the ansible_play_batch to include all scylla nodes.

The above change now waits for all "unhealthy" cluster nodes to be 0. This seems to be a useful check since this step executes after the check for the CQL port step.

I ran into an issue deploying a test cluster from scratch were the second node of three would never proceed past the "become healthy" step because only one of three nodes were deployed prior to the check.

common - update builtin + fix empty when condition

1ec685d

unkls mentioned this pull request Apr 3, 2024

Request to Release #370

Open

unkls added 4 commits April 5, 2024 14:17

ansible-scylla-node: support running the role on limited number of no…

860101e

…des, skip tasks that miss information

ansible-scylla-node: Update set_fact to use builtin modules from ansi…

8cf774a

…ble 2.10

Merge branch 'fix-scylla-node-ansible-2.14' of github.com:unkls/scyll…

396fbc7

…a-ansible-roles into fix-scylla-node-ansible-2.14

ansible-scylla-node: Remove the added line to adhere to the current c…

6e7fefb

…ommentary sections

tarzanek requested a review from vladzcloudius April 16, 2024 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix - scylla-node] - Ansible builtin + when condition start node #372

[Fix - scylla-node] - Ansible builtin + when condition start node #372

unkls commented Apr 3, 2024

tarzanek commented Apr 5, 2024

tarzanek commented Apr 5, 2024 •

edited

Loading

unkls commented Apr 5, 2024

unkls commented Apr 16, 2024

igorribeiroduarte commented Apr 17, 2024

unkls commented Apr 18, 2024

vladzcloudius commented May 15, 2024 •

edited

Loading

jonbuffington commented Jul 27, 2024

[Fix - scylla-node] - Ansible builtin + when condition start node #372

Are you sure you want to change the base?

[Fix - scylla-node] - Ansible builtin + when condition start node #372

Conversation

unkls commented Apr 3, 2024

tarzanek commented Apr 5, 2024

tarzanek commented Apr 5, 2024 • edited Loading

unkls commented Apr 5, 2024

unkls commented Apr 16, 2024

igorribeiroduarte commented Apr 17, 2024

unkls commented Apr 18, 2024

vladzcloudius commented May 15, 2024 • edited Loading

jonbuffington commented Jul 27, 2024

tarzanek commented Apr 5, 2024 •

edited

Loading

vladzcloudius commented May 15, 2024 •

edited

Loading