Support multiple inventories and `--limit` #366

igorribeiroduarte · 2024-03-18T13:23:53Z

No description provided.

Having in mind that ansible allows us to have multiple inventory dirs, it makes more sense to use the playbook_dir as a default directory for the metadata used by the role.

…_files_path This variable is being used only by the manager_agents task.

vladzcloudius · 2024-03-27T19:59:27Z

@igorribeiroduarte please, check why CI is failing

vladzcloudius · 2024-03-27T20:14:21Z

ansible-scylla-node/tasks/common.yml

+        method: GET
+      register: _failure_detector_result
+      until: _failure_detector_result.status == 200
+      retries: 3


Let's introduce scylla_api_retries, set it to 5 by default and use it everywhere instead of hard-codded values like this one.

This code block has already been removed since I'm not using the API in this task anymore

I agree that such variable should be added, but I think it's out of the scope of this PR

vladzcloudius · 2024-03-27T20:25:24Z

ansible-scylla-node/tasks/common.yml

+
+    - name: Set all_nodes_up as a fact
+      set_fact:
+        all_nodes_up: "{% if item.status is defined and item.status == 200 %}{{ True }}{% else %}{{ False }}{% endif %}"


What if all nodes are responding to REST but some are in UJ state or reshaping/resharding?
This condition is not good. If you require all nodes to be UN - you need to check that exactly - not just requiring nodes to respond on REST API port.

Fixed. I'm using nodetool status now

vladzcloudius · 2024-03-27T23:15:08Z

ansible-scylla-node/tasks/common.yml

+      delay: 1
+      ignore_errors: true
+      delegate_to: "{{ item }}"
+      loop: "{{ groups['scylla'] }}"


Why running this serially and not concurrently (without loop)? Imagine that your cluster has 100s of nodes - this will run forever!

ansible-scylla-node/tasks/common.yml

…ication Before this patch, we were assuming that if we get to the adjust_keyspace_replication task and start_scylla_service is set to true, it means that all the nodes are up, since the 'start_scylla_service dependent tasks' block was already executed. However, the role is responsible for starting scylla only for the nodes in ansible_play_hosts, meaning that if the user has a cluster with 6 nodes, but is running the role for 3 nodes (by using --limit), the role should start only these 3 nodes and the other 3 nodes might or might not have already been started. Having that in mind, this patch checks if ALL the nodes in the inventory (and not only the ones in ansible_play_hosts) are up before adjusting the replication.

…cation Keyspace replication needs to be adjusted based on ALL nodes, and not only the ones for which we're currently executing the role.

…_sources to find cql credentials A user might pass multiple inventories and we don't know in which of them the credentials will be, so we need to iterate through all of them to find out.

…the 'play_hosts' var Currently we're starting scylla service in all the nodes in the [scylla] section in the inventory. By doing this we're not respecting the limits applied by the user with the '--limit' option. This patch fixes this by limiting the execution of the 'Start scylla non-seeds nodes serially' and 'Start seeders serially' tasks to the nodes in the 'play_hosts' var. This patch does the same with the 'Create a map from dc to its list of nodes' task.

In order to run the token_distributor script in the generate_tokens task, we need to have broadcast_address and rpc_address defined for all the already bootstrapped nodes, including the ones that are not in the 'play_hosts' variable. So far we've been assuming that the variables scylla_broadcast_address and scylla_rpc_address would always be defined for all the nodes, but the user has no obligation of passing variables for nodes that are not in 'play_hosts', i.e.: hosts that were excluded from the playbook execution by the usage of --limit option. Having that in mind, this patch is defining these variables for bootstrapped nodes that are not in 'play_hosts' by using the values defined in their 'scylla.yaml' files. This patch assumes that any already bootstrapped node will have a scylla.yaml in /etc/scylla with rpc_address and broadcast_address defined.

vladzcloudius · 2024-03-27T23:33:21Z

ansible-scylla-node/tasks/adjust_keyspace_replication.yml

@@ -17,11 +7,14 @@
  until: _datacenter_out.status == 200
  retries: 5
  delay: 1
+  delegate_to: "{{ item }}"
+  loop: "{{ groups['scylla'] }}"
+  run_once: true


Again: this is bad - this will work terribly for large clusters.
We need a solution that will execute concurrently on all hosts - not serially.

vladzcloudius · 2024-04-17T19:51:39Z

ansible-scylla-node/tasks/common.yml

@@ -79,7 +79,7 @@
 - name: Create a map from dc to its list of nodes
  set_fact:
    dc_to_node_list: "{{ dc_to_node_list | default({}) | combine( {hostvars[item]['dc']: (dc_to_node_list | default({}))[hostvars[item]['dc']] | default([]) + [item]} ) }}"
-  loop: "{{ groups['scylla'] }}"
+  loop: "{{ play_hosts }}"


This is bogus! This MUST include all scylla hosts

vladzcloudius · 2024-04-17T19:53:00Z

ansible-scylla-node/tasks/common.yml

@@ -328,13 +328,13 @@
    - name: Start seeders serially
      run_once: true
      include_tasks: start_one_node.yml
-      loop: "{{ groups['scylla'] }}"
+      loop: "{{ play_hosts }}"


vladzcloudius · 2024-04-17T19:53:12Z

ansible-scylla-node/tasks/common.yml

      when: hostvars[item]['broadcast_address'] in scylla_seeds or item in scylla_seeds

    - name: Start scylla non-seeds nodes serially
      run_once: true
      include_tasks: start_one_node.yml
-      loop: "{{ groups['scylla'] }}"
+      loop: "{{ play_hosts }}"


vladzcloudius · 2024-04-17T19:57:36Z

ansible-scylla-node/tasks/common.yml

      register: node_count
-      until: node_count.stdout|int == ansible_play_batch|length
+      until: node_count.stdout|int == play_hosts|length


Waiting for the nodes of a specific DC become UN is not enough - we need to wait that other nodes are UN from these nodes perspective too. So you should wait for all (!!) nodes in the cluster to become UN from all nodes perspective.

igorribeiroduarte added 2 commits March 18, 2024 10:24

ansible-scylla-node: Use playbook_dir instead of inventory_dir

0dd900d

Having in mind that ansible allows us to have multiple inventory dirs, it makes more sense to use the playbook_dir as a default directory for the metadata used by the role.

ansible-scylla-node: Rename cluster_local_files_path to manager_local…

b2a45de

…_files_path This variable is being used only by the manager_agents task.

igorribeiroduarte force-pushed the support_multiple_inventories branch 2 times, most recently from 18e05d7 to b79335c Compare March 18, 2024 13:29

igorribeiroduarte requested a review from vladzcloudius March 18, 2024 13:36

vladzcloudius requested changes Mar 27, 2024

View reviewed changes

igorribeiroduarte force-pushed the support_multiple_inventories branch 2 times, most recently from 00562a2 to 08e41cd Compare April 1, 2024 16:02

igorribeiroduarte mentioned this pull request Apr 17, 2024

[Fix - scylla-node] - Ansible builtin + when condition start node #372

Open

igorribeiroduarte changed the title ~~Support multiple inventories~~ Support multiple inventories and for --limit Apr 17, 2024

igorribeiroduarte changed the title ~~Support multiple inventories and for --limit~~ Support multiple inventories and --limit Apr 17, 2024

igorribeiroduarte force-pushed the support_multiple_inventories branch 3 times, most recently from 0ef1ba8 to da57435 Compare April 17, 2024 13:05

igorribeiroduarte added 5 commits April 17, 2024 11:22

ansible-scylla-node: Get DC name for all nodes before adjusting repli…

9ab7961

…cation Keyspace replication needs to be adjusted based on ALL nodes, and not only the ones for which we're currently executing the role.

ansible-scylla-node, manage_users.yml: Loop through ansible_inventory…

ce9a7be

…_sources to find cql credentials A user might pass multiple inventories and we don't know in which of them the credentials will be, so we need to iterate through all of them to find out.

igorribeiroduarte force-pushed the support_multiple_inventories branch from da57435 to 93135cb Compare April 17, 2024 14:25

igorribeiroduarte requested a review from vladzcloudius April 17, 2024 14:28

vladzcloudius requested changes Apr 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiple inventories and `--limit` #366

Support multiple inventories and `--limit` #366

igorribeiroduarte commented Mar 18, 2024

vladzcloudius commented Mar 27, 2024

vladzcloudius Mar 27, 2024

igorribeiroduarte Apr 17, 2024

igorribeiroduarte Apr 17, 2024

vladzcloudius Mar 27, 2024

igorribeiroduarte Apr 17, 2024

vladzcloudius Mar 27, 2024

igorribeiroduarte Apr 17, 2024

vladzcloudius Mar 27, 2024

vladzcloudius Apr 17, 2024

vladzcloudius Apr 17, 2024

vladzcloudius Apr 17, 2024

vladzcloudius Apr 17, 2024

Support multiple inventories and --limit #366

Are you sure you want to change the base?

Support multiple inventories and --limit #366

Conversation

igorribeiroduarte commented Mar 18, 2024

vladzcloudius commented Mar 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Support multiple inventories and `--limit` #366

Support multiple inventories and `--limit` #366