Nexus holds on to specific service backends of crdb and crucible-pantry for too long #3763

askfongjojo · 2023-07-25T19:43:30Z

While testing service failover on rack2, I noticed that nexus held on to the same cockroachdb backend without attempting to use any of the other nodes in the 5-node database cluster, causing requests to fail until the one it favored came back.

The same happened with pantry requests for disk import blocks / bulk writes. I haven't got to the point of seeing the TTL being exhausted. I tried waiting for up to 5 minutes and the request still couldn't succeed until the pantry zone Nexus has been using prior to its outage came up again.

smklein · 2023-07-25T20:08:06Z

I think ignoring the TTL makes sense (not that it's right, just that it makes sense) in the context of this code:

omicron/nexus/src/context.rs

Lines 167 to 195 in a29f08b

    
           let url = match &config.deployment.database { 
        
               nexus_config::Database::FromUrl { url } => url.clone(), 
        
               nexus_config::Database::FromDns => { 
        
                   info!(log, "Accessing DB url from DNS"); 
        
                   let address = loop { 
        
                       match resolver 
        
                           .lookup_socket_v6(ServiceName::Cockroach) 
        
                           .await 
        
                       { 
        
                           Ok(address) => break address, 
        
                           Err(e) => { 
        
                               warn!( 
        
                                   log, 
        
                                   "Failed to lookup cockroach address: {e}" 
        
                               ); 
        
                               tokio::time::sleep(std::time::Duration::from_secs( 
        
                                   1, 
        
                               )) 
        
                               .await; 
        
                           } 
        
                       } 
        
                   }; 
        
                   info!(log, "DB address: {}", address); 
        
                   PostgresConfigWithUrl::from_str(&format!( 
        
                       "postgresql://root@{address}/omicron?sslmode=disable", 
        
                   )) 
        
                   .map_err(|e| format!("Cannot parse Postgres URL: {}", e))? 
        
               } 
        
           };

Our API to interface with CockroachDB from Nexus involves "constructing a URL", and right now, we embed a single point-in-time address of CockroachDB into this URL.

We should probably avoid doing this, and embed the service name into the URL, so it actually use DNS during later lookups.

askfongjojo · 2023-07-25T20:51:38Z

Understood. The current behavior, in conjunction with #3613, results in an 1 in 15 probability that a user will not be able to use the system at all. I happened to be in this situation when my workstation always used the one nexus that had the first CRDB backend I brought down to test failover. :(

jmpesp · 2023-07-26T19:45:56Z

The same happened with pantry requests for disk import blocks / bulk writes. I haven't got to the point of seeing the TTL being exhausted. I tried waiting for up to 5 minutes and the request still couldn't succeed until the pantry zone Nexus has been using prior to its outage came up again.

In this case, there's no TTL: a specific pantry from the list of available ones is selected once and used for a disk until that disk is finalized.

One thing that could be done is to check of the selected pantry is still responsive and choose another one if not, but it was important not keep as much code out of the import chunk hot-path as possible, as any checks there will be multiplied by the number of chunks to import and slow down imports. That being said, slow imports are better than non-working imports :) I'll give this some thought.

askfongjojo · 2023-07-26T20:39:58Z

Sorry for not being clear in pantry's case. I understand the need to stay with the same pantry for the same disk snapshot. But the issue I ran into is with new/separate snapshot requests still holding on to the unresponsive one.

#3783) First pass at #3763 for crdb. Even though we did query internal DNS, we were previously using only a single host as part of connecting to crdb from Nexus. And since the internal DNS server always returns records in the same order, that meant every Nexus instance was always using the same CockroachDB instance even now that we've been provisioning multiple. This also meant if that CRDB instance went down we'd be hosed (as seen in #3763). To help with that, this PR changes Nexus to use all the CRDB hosts reported via Internal DNS when creating the connection URL. There are some comments in the code, but this still not quite as robust as we could be, but short of something cueball-like it's still an improvement. To test I disabled the initial crdb nexus connected to and it was able to recover by connecting to the next crdb instance and continue serving requests. From the log we can see a successful query, connection errors once i disabled `fd00:1122:3344:101::5`, and then a successful query with connection reestablished to next crdb instance (`fd00:1122:3344:101::3`): ``` 23:43:24.729Z DEBG 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): authorize result action = Query actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000003, .. }) resource = Database result = Ok(()) 23:43:24.729Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d isable error_message = Connection error: server is shutting down 23:43:24.729Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d isable error_message = Connection error: server is shutting down 23:43:24.729Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d isable error_message = Connection error: server is shutting down 23:43:24.729Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d isable error_message = Connection error: server is shutting down 23:43:24.729Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d isable error_message = Connection error: server is shutting down 23:43:24.729Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d isable error_message = Connection error: server is shutting down 23:43:24.730Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d isable error_message = Connection error: server is shutting down 23:43:24.730Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d isable error_message = Connection error: server is shutting down 23:43:30.803Z DEBG 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): roles roles = RoleSet { roles: {} } 23:43:30.804Z DEBG 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): authorize result action = Query actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000003, .. }) resource = Database result = Ok(()) ```

luqmana · 2023-07-31T23:02:22Z

#3783 partially addresses this for crdb. We now, grab all the cockroachdb hosts via internal dns at nexus startup and add them all to the connection string. Whenever a new connection is established it'll try the listed hosts in order and use the first one that successfully connects. While that's an improvement in that Nexus won't fail to serve requests if one crdb instance goes down, there are still some issues:

The list of hosts used are resolved and get fixed at Nexus startup. If new cockroachdb zones get spun up, it'll take a restart of Nexus for it to pick them up. As-is, I don't think we're yet dynamically spinning up new services and instead just creating the service plan during RSS.
Internal DNS returns records in the same order everytime (couldn't find existing issue so created one: Internal DNS should permute the order in which it responds for entries with multiple records. #3805). This means every Nexus zone tries to connect to the set of CRDB hosts in the same order.

smklein · 2024-04-29T20:08:48Z

I'm poking at this issue again, now that we're looking at possibly expunging zones which are running CRDB nodes.

Where the System Currently Stands

CockroachDB nodes, when initializing, access internal DNS to find the IP addresses of other nodes within the cluster:

omicron/smf/cockroachdb/method_script.sh

Lines 13 to 19 in cfa6bd9

    
           # We need to tell CockroachDB the DNS names or IP addresses of the other nodes 
        
           # in the cluster.  Look these up in internal DNS.  Per the recommendations in 
        
           # the CockroachDB docs, we choose at most five addresses.  Providing more 
        
           # addresses just increases the time for the cluster to stabilize. 
        
           JOIN_ADDRS="$(/opt/oxide/internal-dns-cli/bin/dnswait cockroach \ 
        
               | head -n 5 \ 
        
               | tr '\n' ,)"

Nexus, when creating a connection to a pool, performs a "one-time DNS lookup" here:

omicron/nexus/src/context.rs

Lines 189 to 224 in d80cd29

    
           nexus_config::Database::FromDns => { 
        
               info!(log, "Accessing DB url from DNS"); 
        
               // It's been requested but unfortunately not supported to 
        
               // directly connect using SRV based lookup. 
        
               // TODO-robustness: the set of cockroachdb hosts we'll use will 
        
               // be fixed to whatever we got back from DNS at Nexus start. 
        
               // This means a new cockroachdb instance won't picked up until 
        
               // Nexus restarts. 
        
               let addrs = loop { 
        
                   match resolver 
        
                       .lookup_all_socket_v6(ServiceName::Cockroach) 
        
                       .await 
        
                   { 
        
                       Ok(addrs) => break addrs, 
        
                       Err(e) => { 
        
                           warn!( 
        
                               log, 
        
                               "Failed to lookup cockroach addresses: {e}" 
        
                           ); 
        
                           tokio::time::sleep(std::time::Duration::from_secs( 
        
                               1, 
        
                           )) 
        
                           .await; 
        
                       } 
        
                   } 
        
               }; 
        
               let addrs_str = addrs 
        
                   .iter() 
        
                   .map(ToString::to_string) 
        
                   .collect::<Vec<_>>() 
        
                   .join(","); 
        
               info!(log, "DB addresses: {}", addrs_str); 
        
               PostgresConfigWithUrl::from_str(&format!( 
        
                   "postgresql://root@{addrs_str}/omicron?sslmode=disable", 
        
               )) 
        
               .map_err(|e| format!("Cannot parse Postgres URL: {}", e))?

So, specifically eyeing the case of "a single CockroachDB node fails, what do we do?":

The CockroachDB nodes themselves should be resilient. I've tested this manually with a three-node setup, and as expected, forcefully killing one node means that the others remain responsive to requests.
The Nexus connection pool may become unresponsive, depending on which IP address was returned out of the original DNS lookup.

What do we want to do

I've tried to dig into the CockroachDB and Postgres docs to explore our options in this area when constructing the postgresql:// URL from the Nexus side. The end-goal here would be to have an implementation where Nexus can easily migrate from querying one CockroachDB node to the next, as we continue removing old nodes and adding new ones.

Options

Specify a host address in the form of a hostname, use DNS to resolve

It should hopefully be possible to provide a hostname to the postgresql::// URL which identifies the CockroachDB service, and which uses our internal DNS server to resolve the name of the host.

As far as I can tell -- feedback welcome if folks see alternate pathways -- the mechanism to point postgres clients at a particular DNS server is by setting the resolv.conf file with the IP address of the desired internal DNS server.

For nexus, this would mean:

Setting up an /etc/resolv.conf file which includes a nameserver entry pointing to the IP address of our internal DNS server
Updating the postgresl:// url to refer to the Cockroach service, and relying on the nameserver to perform translation.

Subsequent experimentation is necessary to determine how failures are propagated back to Nexus when some of these nodes die, and to identify if new nodes are actually made accessible.

Specify multiple hosts during nexus-side construction of `postgresql://` URL

It is possible to specify multiple hostnames, according to libpq:

See: https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-MULTIPLE-HOSTS

Pros: Would give us redundancy from single-node failure
Cons: Would not actually update the set of "all known CRDB nodes", so after enough failures, this would also fall over

Experimentally, I spun up a three-node cluster, killed one, and observed via a psql shell that I could re-connect to another. However, it's also notable that this failure is visible to the client - my postgres client sent a request that failed (to the dead node) before reconnecting to a live one, and this process was not automated.

smklein · 2024-04-29T21:31:34Z

Here's the result of some of my experimentation. I'd really rather have a local test here - that's what I was trying to build out in #5628 - but it's quite difficult to create all this in an isolated environment, since changing the DNS server used by postgres relies on changing "system-wide" config in /etc/resolv.conf.

Setup

I spun up a DNS server within Omicron, sitting on port 53:

sudo ./target/debug/dns-server --http-address [::1]:0 --dns-address [::1]:53 --config-file $PWD/dns-server/examples/config.toml

I then spun up three CockroachDB nodes talking to each other within a cluster. These are all on localhost, on ports 7709, 7710, and 7711.

cockroach start --insecure --join [::1]:7709,[::1]:7710,[::1]:7711 --store /var/tmp/crdb1 --listen-addr [::1]:7709 --http-addr :0
cockroach start --insecure --join [::1]:7709,[::1]:7710,[::1]:7711 --store /var/tmp/crdb2 --listen-addr [::1]:7710 --http-addr :0
cockroach start --insecure --join [::1]:7709,[::1]:7710,[::1]:7711 --store /var/tmp/crdb3 --listen-addr [::1]:7711 --http-addr :0
cockroach init --insecure --host [::]:7709

Then I populated my internal DNS server with these records:

# SRV records
./target/debug/dnsadm -a "[::1]:45901" add-srv control-plane.oxide.test _cockroach._tcp 0 0 7709 c6cda479-5fde-49a0-a079-7c960022baff.host.control-plane.oxide.test
./target/debug/dnsadm -a "[::1]:45901" add-srv control-plane.oxide.test _cockroach._tcp 0 0 7710 ac33791c-62c6-43b0-bcbd-b15e7727b533.host.control-plane.oxide.test
./target/debug/dnsadm -a "[::1]:45901" add-srv control-plane.oxide.test _cockroach._tcp 0 0 7711 6eb5fbb1-fa70-4ee9-aabf-53c450e138f7.host.control-plane.oxide.test 

# AAAA records
./target/debug/dnsadm -a "[::1]:45901" add-aaaa control-plane.oxide.test ac33791c-62c6-43b0-bcbd-b15e7727b533.host ::1
./target/debug/dnsadm -a "[::1]:45901" add-aaaa control-plane.oxide.test 6eb5fbb1-fa70-4ee9-aabf-53c450e138f7.host ::1
./target/debug/dnsadm -a "[::1]:45901" add-aaaa control-plane.oxide.test c6cda479-5fde-49a0-a079-7c960022baff.host ::1

Next, I added my nameserver running on localhost to /etc/resolv.conf:

# added to `resolv.conf`
nameserver ::1

To check that DNS is up and running, I used dig:

 $ dig _cockroach._tcp.control-plane.oxide.test

; <<>> DiG 9.18.18-0ubuntu0.22.04.2-Ubuntu <<>> _cockroach._tcp.control-plane.oxide.test
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 38788
;; flags: qr rd; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 3
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;_cockroach._tcp.control-plane.oxide.test. IN A

;; ANSWER SECTION:
_cockroach._tcp.control-plane.oxide.test. 0 IN SRV 0 0 7709 c6cda479-5fde-49a0-a079-7c960022baff.host.control-plane.oxide.test.
_cockroach._tcp.control-plane.oxide.test. 0 IN SRV 0 0 7710 ac33791c-62c6-43b0-bcbd-b15e7727b533.host.control-plane.oxide.test.
_cockroach._tcp.control-plane.oxide.test. 0 IN SRV 0 0 7711 6eb5fbb1-fa70-4ee9-aabf-53c450e138f7.host.control-plane.oxide.test.

;; ADDITIONAL SECTION:
c6cda479-5fde-49a0-a079-7c960022baff.host.control-plane.oxide.test. 0 IN AAAA ::1
ac33791c-62c6-43b0-bcbd-b15e7727b533.host.control-plane.oxide.test. 0 IN AAAA ::1
6eb5fbb1-fa70-4ee9-aabf-53c450e138f7.host.control-plane.oxide.test. 0 IN AAAA ::1

;; Query time: 0 msec
;; SERVER: ::1#53(::1) (UDP)
;; WHEN: Mon Apr 29 14:04:58 PDT 2024
;; MSG SIZE  rcvd: 400

Which looks like I'd expect -- I'm seeing those 7709 - 7711 ports in the SRV records, and a bunch of references to ::1 in the AAAA records.

I can use the Cockroach shell to connect directly to a node via IP:

cockroach sql --url "postgresql://root@[::1]:7709?sslmode=disable"

However, using a hostname appears to be hitting issues:

 $ cockroach sql --url "postgresql://root@_cockroach._tcp.control-plane.oxide.test?sslmode=disable"
#
# Welcome to the CockroachDB SQL shell.
# All statements must be terminated by a semicolon.
# To exit, type: \q.
#
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.

dial tcp: lookup _cockroach._tcp.control-plane.oxide.test: no such host
Failed running "sql"

luqmana · 2024-04-29T21:52:36Z

Resolving using SRV records doesn't work with the cli (cockroachdb/cockroach#64439).

But you should be able to have the nodes discover each other via DNS without explicitly listing them out for --join (though that's perhaps behind another flag --experimental-srv-dns depending on the version?) But afaik that's just limited to the initial bootstrapping and unsure how it deals with the set changing at runtime.

smklein · 2024-04-29T22:02:59Z

Resolving using SRV records doesn't work with the cli (cockroachdb/cockroach#64439).

But you should be able to have the nodes discover each other via DNS without explicitly listing them out for --join (though that's perhaps behind another flag --experimental-srv-dns depending on the version?) But afaik that's just limited to the initial bootstrapping and unsure how it deals with the set changing at runtime.

Thanks for the pointer, I'll look into this flag! To be clear, that would be for CockroachDB nodes to connect to each other using DNS, right?

Just being clear that it's distinct from any attempts by e.g. Nexus to use a libpq client to connect to CockroachDB

luqmana · 2024-04-29T22:05:06Z

Correct

…

On Mon, Apr 29, 2024, 3:03 PM Sean Klein ***@***.***> wrote: Resolving using SRV records doesn't work with the cli ( cockroachdb/cockroach#64439 <cockroachdb/cockroach#64439>). But you should be able to have the nodes discover each other via DNS without explicitly listing them out for --join (though that's perhaps behind another flag --experimental-srv-dns depending on the version?) But afaik that's just limited to the initial bootstrapping and unsure how it deals with the set changing at runtime. Thanks for the pointer, I'll look into this flag! To be clear, that would be for CockroachDB nodes to connect to each other using DNS, right? Just being clear that it's distinct from any attempts by e.g. Nexus to use a libpq client to connect to CockroachDB — Reply to this email directly, view it on GitHub <#3763 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACGCVZN5SBTWMXN5CM5673Y727SRAVCNFSM6AAAAAA2XRPVFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBTG43DANJYGA> . You are receiving this because you commented.Message ID: ***@***.***>

smklein · 2024-04-29T23:37:16Z

Here's my follow-up -- I'm digging into libpq to try to figure out why DNS resolution isn't happening from Nexus, or any of the psql / cockroachdb sql clients:

https://github.com/postgres/postgres/blob/dd0183469bb779247c96e86c2272dca7ff4ec9e7/src/interfaces/libpq/fe-connect.c#L872

It appears the PQConnectStart method parses connection options, then calls pqConnectDBStart to actually try connecting to the backend. This itself calls PQconnectPoll after setting up some config options (most importantly though, setting status to CONNECTION_NEEDED.

I'm pretty sure I'm bottoming out here, because this matches the error I was seeing:

https://github.com/postgres/postgres/blob/dd0183469bb779247c96e86c2272dca7ff4ec9e7/src/interfaces/libpq/fe-connect.c#L2737-L2747

Under the hood, this appears to be calling getaddrinfo. I believe this is only compatible with A/AAAA records, and cannot properly parse SRV records -- this appears to match my experiments locally, where I spun up a small C Program, and could read from the "node name" of c6cda479-5fde-49a0-a079-7c960022baff.host.control-plane.oxide.test, but not _cockroach._tcp.control-plane.oxide.test. This does, admittedly, defeat the point of doing DNS lookup, since that AAAA record would presumably change when the CRDB service is provisioned to a new node.

Out of curiosity, I dug into tokio-postgres too, to see if they did any better than libpq:

https://github.com/sfackler/rust-postgres/blob/98f5a11bc0a8e451552d8941ffa078c7eb6cd60c/tokio-postgres/src/connect.rs#L98

This calls into https://docs.rs/tokio/latest/tokio/net/fn.lookup_host.html , which also appears (through local testing) to "work with AAAA records when you also supply the port, but not with SRV records". This pretty much matches the getaddrinfo client issues libpq is facing.

smklein · 2024-04-29T23:53:14Z

I'm kinda getting the sense we have two options for a path forward here:

Nexus resolves IP addresses by doing the SRV lookup itself. This is, more-or-less, what we're doing today. To tolerate CockroachDB node failures, however, we should strengthen this approach a bit, by providing multiple hosts to multiple CRDB nodes, and periodically refreshing our connections to postgres with an updated set of hosts that we re-acquire from DNS, again by doing the SRV lookups ourselves.

There are some drawbacks to this approach that we would need to work through:

Currently, the connection manager APIs used by Nexus (and Diesel) are static -- the postgresql:// URL is supplied once, and exists for the duration of the connection pool. To implement something like this on top of such an API, I think we'd need to create a new connection pool and switch new connections over to it. Alternatively, we could modify the connection pooling libraries to change the URL which we use for connections dynamically, and internalize this detail to the pool itself.

We patch libpq to add lookup via SRV. We aren't actually the first ones to propose this, but I think it would require us to do some upstream work, which may or may not be accepted, and would definitely introduce a delay to deployment.

luqmana · 2024-04-30T18:40:07Z

Currently, the connection manager APIs used by Nexus (and Diesel) are static -- the postgresql:// URL is supplied once, and exists for the duration of the connection pool. To implement something like this on top of such an API, I think we'd need to create a new connection pool and switch new connections over to it. Alternatively, we could modify the connection pooling libraries to change the URL which we use for connections dynamically, and internalize this detail to the pool itself.

Right, last I looked it involved changes across multiple crates:

We create an async-bb8-diesel::ConnectionManager with a fixed database URL string. That wraps diesel::r2d2::ConnectionManager and implements bb8::ManageConnection. In that is the blocking call to the underlying r2d2::ManageConnection::connect which is implemented directly in diesel and calls Connection::establish with the fixed url we created at startup.

smklein · 2024-04-30T18:42:49Z

I'll see if I can modify Diesel and async-bb8-diesel to make that URL mutable. That seems like it would help us form a foundation to modify things from nexus, if we are unable to rely on libpq to act as a DNS client on our behalf.

smklein · 2024-04-30T19:00:27Z

Okay, seems possible to let the database URL be mutable from Nexus. That would at least let Nexus update the "set of valid CRDB nodes" with the latest info it knows about.

See: oxidecomputer/async-bb8-diesel#62 , diesel-rs/diesel#4005

luqmana · 2024-04-30T19:20:52Z

Nice! One thing though is that with that we'd need to explicitly call that for every Nexus every time the set of CRDB nodes changes. Compared to changing connect to somehow get the URL dynamically where we could just query DNS.

smklein · 2024-04-30T20:04:32Z

Nice! One thing though is that with that we'd need to explicitly call that for every Nexus every time the set of CRDB nodes changes. Compared to changing connect to somehow get the URL dynamically where we could just query DNS.

Totally, but this was my takeaway from reading libpq's source: someone on the client-side needs to make a decision to go query DNS, get a new set of nodes, and update the set of IPs we end up talking to. If libpq handled this for us, that would be nice, but if it doesn't, I think it ends up being functionally somewhat similar for Nexus to take this responsibility too.

Arguably, I think Nexus could be more optimal here, since it can perform this action as a downstream operation of an RPW to avoid unnecessary queries to DNS until the set is known to have changed.

davepacheco · 2024-04-30T22:12:45Z

This use case is really similar to the case of managing HTTP clients for our internal services: there's DNS resolution that ideally would be decoupled from connection establishment, and connection establishment that would be decoupled from usage. This is another case where we'll want a more sophisticated connection management component, which likely means building our own pool to better control the behavior here. We basically decided in the update call today to pursue this, so the rest of this might be moot. But in case it's useful, here's responding to a few things above.

So, specifically eyeing the case of "a single CockroachDB node fails, what do we do?":

The CockroachDB nodes themselves should be resilient. I've tested this manually with a three-node setup, and as expected, forcefully killing one node means that the others remain responsive to requests.

The Nexus connection pool may become unresponsive, depending on which IP address was returned out of the original DNS lookup.

This is a good summary. The CockroachDB part (in the start method) was explicitly designed this way in order to stay in sync with the latest cluster topology and I'd be more surprised if it didn't work! But we've known from the start that there was at least some work to be done to make the pool survive database failures better, and likely we were going to have to do our own pool.

Resolving using SRV records doesn't work with the cli (cockroachdb/cockroach#64439).
...
Under the hood, this appears to be calling getaddrinfo. I believe this is only compatible with A/AAAA records, and cannot properly parse SRV records

For what it's worth, I've almost never seen non-application-specific components do SRV lookups.

But you should be able to have the nodes discover each other via DNS without explicitly listing them out for --join (though that's perhaps behind another flag --experimental-srv-dns depending on the version?) But afaik that's just limited to the initial bootstrapping and unsure how it deals with the set changing at runtime.

I think we've basically solved this part already (see above). We considered --experimental-dns-srv at the time but ruled it out, if for no other reason than it was documented as unstable and subject to removal or change. I feel like I also found some problem with its behavior but I do not remember what it was. Thinking out loud: such a facility would need: to use the right DNS servers (have we set up /etc/resolv.conf correctly), and deal well with those being down; and importantly if any of it fails it should sit there waiting for the right records to show up, not just give up and exit. It may have been this last one that was the sticking point.

I'm kinda getting the sense we have two options for a path forward here:

Nexus resolves IP addresses by doing the SRV lookup itself...

We patch libpq to add lookup via SRV...

It makes sense to me that the DNS resolution and even TCP connection establishment behavior would ultimately be application-specific and would happen outside libpq. While it's no doubt convenient that clients like tokio_postgres allow you to specify a DNS name at all, that falls apart in situations like ours because there are so many policy choices hardcoded here that we'd probably want to change: which nameservers to use? how many are queried? with what concurrency? on failure, do we fail or retry? how many times? how often? Then most of those same questions apply again for TCP connection establishment. And the challenge is that we likely want to make these choices at a high layer, not deep in tokio_postgres when we're in the process of trying to establish a connection for a client that's currently waiting on it.

askfongjojo · 2024-06-21T18:50:33Z

In my most recent testing, the behavior for CRDB connection failures has changed (or might be expected?). When the crdb node in use has gone down, I saw messages like this in nexus logs:

17:42:35.701Z ERRO 2898657e-4141-4c05-851b-147bffc6bbbd (ServerContext): database connection error
    database_url = postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
    error_message = Connection error: server is shutting down
    file = nexus/db-queries/src/db/pool.rs:100

(Note: The crdb instance in use is not necessarily the first one listed; in my case, disabling the 105 instance is what triggered the error.)

Nexus was able to continue serving both read and update API requests in spite of the above error. It's possible that any ongoing requests against the crdb node right at the moment it's going down could have failed but all requests I made afterwards succeeded.

@davepacheco took a look at the code to understand what changed the behavior. Here are his comments:

I didn't think we'd changed anything here since #3763 was filed, but I see there was #3763 (comment), which maybe fixed this? Although subsequent comments suggested it might not have, but it seems like it should have. Anyway, it's good that it worked.

smklein · 2024-06-25T19:19:19Z

Just to be explicit: #5876 will fix this issue for CockroachDB -- in that PR, we use https://github.com/oxidecomputer/qorb to access each CRDB node individually, and create a pool of connections for each one.

If a CRDB node starts returning errors, we reduce our usage of it until it looks healthy
If a CRDB node is removed from DNS, we stop using it
If a new CRDB node shows up in DNS, we start using it

Replaces all usage of bb8 with a new connection pooling library called [qorb](https://github.com/oxidecomputer/qorb). qorb, detailed in RFD 477, provides the following benefits over bb8: - It allows lookup of multiple backends via DNS SRV records - It dynamically adjusts the number of connections to each bakend based on their health, and prioritizes vending out connections to healthy backends - It should be re-usable for both our database and progenitor clients (using a different "backend connector", but the same core library and DNS resolution mechanism). Fixes #4192 Part of #3763 (fixes CRDB portion)

@jmpesp

The meat of this PR is the change in implementation of `get_pantry_address`: instead of asking our internal DNS resolver to look up a crucible pantry (which does not randomize, so in practice we always get whichever pantry the DNS server listed first), we ask a Qorb connection pool for the address of a healthy client. `get_pantry_address` itself does not use the client directly and only cares about its address, but the pool does keep a client around so that it can call `pantry_status()` as a health check. (It doesn't look at the contents of the result; only whether or not the request succeeded - @jmpesp if that should be more refined, please say so.) This partially addresses #3763; once this lands, if a pantry is down or unhealthy but still present in DNS (i.e., not expunged), Qorb + the status health checks should mean we'll pick a different pantry for new operations, instead of the current behavior of always sticking to the first pantry in DNS. --------- Co-authored-by: Sean Klein <[email protected]>

…es (#6836) This is a much smaller change than the diff stat implies; most of the changes are expectorate outputs because the example system we set up for tests now includes Crucible pantry zones, which shifted a bunch of other zone UUIDs. Fully supporting Crucible pantry replacement depends on #3763, which I'm continuing to work on. But the reconfigurator side of "start new pantries" is about as trivial as things go and does not depend on #3763, hence this PR.

askfongjojo added this to the 1.0.1 milestone Jul 25, 2023

askfongjojo changed the title ~~Nexus holds on to a specific service backends of crdb and crucible-pantry for too long~~ Nexus holds on to specific service backends of crdb and crucible-pantry for too long Jul 27, 2023

smklein mentioned this issue Jul 27, 2023

[nexus] Implement on-boot DB Schema Migration #3760

Merged

luqmana mentioned this issue Jul 27, 2023

nexus: use all CockroachDB hosts from DNS to create DB connection URL. #3783

Merged

askfongjojo modified the milestones: 1.0.1, 1.0.2 Jul 28, 2023

askfongjojo modified the milestones: 1.0.2, 1.0.3 Aug 22, 2023

askfongjojo modified the milestones: 1.0.3, 3 Sep 1, 2023

morlandi7 modified the milestones: 3, MVP Oct 4, 2023

This was referenced Apr 30, 2024

Allow the r2d2 ConnectionManager to change URL dynamically diesel-rs/diesel#4005

Merged

Add API to update database URL oxidecomputer/async-bb8-diesel#62

Open

smklein mentioned this issue Jun 25, 2024

Qorb integration as connection pool for database #5876

Merged

morlandi7 modified the milestones: MVP, 10 Jul 2, 2024

morlandi7 modified the milestones: 10, 11 Aug 12, 2024

davepacheco removed this from the 11 milestone Aug 21, 2024

jgallagher mentioned this issue Aug 22, 2024

Oximeter collector needs a way to lookup ClickHouse more often #6407

Closed

davepacheco added this to the 12 milestone Oct 4, 2024

davepacheco assigned jgallagher Oct 4, 2024

smklein added the qorb Connection Pooling label Oct 8, 2024

This was referenced Oct 10, 2024

Use a Qorb pool to choose pantry clients #6822

Merged

[reconfigurator] Add planner support for starting new Crucible pantries #6836

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nexus holds on to specific service backends of crdb and crucible-pantry for too long #3763

Nexus holds on to specific service backends of crdb and crucible-pantry for too long #3763

askfongjojo commented Jul 25, 2023

smklein commented Jul 25, 2023

askfongjojo commented Jul 25, 2023

jmpesp commented Jul 26, 2023

askfongjojo commented Jul 26, 2023 •

edited

Loading

luqmana commented Jul 31, 2023

smklein commented Apr 29, 2024

smklein commented Apr 29, 2024

luqmana commented Apr 29, 2024

smklein commented Apr 29, 2024

luqmana commented Apr 29, 2024 via email

smklein commented Apr 29, 2024 •

edited

Loading

smklein commented Apr 29, 2024

luqmana commented Apr 30, 2024

smklein commented Apr 30, 2024

smklein commented Apr 30, 2024

luqmana commented Apr 30, 2024

smklein commented Apr 30, 2024

davepacheco commented Apr 30, 2024

askfongjojo commented Jun 21, 2024 •

edited

Loading

smklein commented Jun 25, 2024

Nexus holds on to specific service backends of crdb and crucible-pantry for too long #3763

Nexus holds on to specific service backends of crdb and crucible-pantry for too long #3763

Comments

askfongjojo commented Jul 25, 2023

smklein commented Jul 25, 2023

askfongjojo commented Jul 25, 2023

jmpesp commented Jul 26, 2023

askfongjojo commented Jul 26, 2023 • edited Loading

luqmana commented Jul 31, 2023

smklein commented Apr 29, 2024

Where the System Currently Stands

What do we want to do

Options

Specify a host address in the form of a hostname, use DNS to resolve

Specify multiple hosts during nexus-side construction of postgresql:// URL

smklein commented Apr 29, 2024

Setup

luqmana commented Apr 29, 2024

smklein commented Apr 29, 2024

luqmana commented Apr 29, 2024 via email

smklein commented Apr 29, 2024 • edited Loading

smklein commented Apr 29, 2024

luqmana commented Apr 30, 2024

smklein commented Apr 30, 2024

smklein commented Apr 30, 2024

luqmana commented Apr 30, 2024

smklein commented Apr 30, 2024

davepacheco commented Apr 30, 2024

askfongjojo commented Jun 21, 2024 • edited Loading

smklein commented Jun 25, 2024

askfongjojo commented Jul 26, 2023 •

edited

Loading

Specify multiple hosts during nexus-side construction of `postgresql://` URL

smklein commented Apr 29, 2024 •

edited

Loading

askfongjojo commented Jun 21, 2024 •

edited

Loading