Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nexus holds on to specific service backends of crdb and crucible-pantry for too long #3763

Open
askfongjojo opened this issue Jul 25, 2023 · 20 comments
Assignees
Labels
qorb Connection Pooling
Milestone

Comments

@askfongjojo
Copy link

While testing service failover on rack2, I noticed that nexus held on to the same cockroachdb backend without attempting to use any of the other nodes in the 5-node database cluster, causing requests to fail until the one it favored came back.

The same happened with pantry requests for disk import blocks / bulk writes. I haven't got to the point of seeing the TTL being exhausted. I tried waiting for up to 5 minutes and the request still couldn't succeed until the pantry zone Nexus has been using prior to its outage came up again.

@smklein
Copy link
Collaborator

smklein commented Jul 25, 2023

I think ignoring the TTL makes sense (not that it's right, just that it makes sense) in the context of this code:

let url = match &config.deployment.database {
nexus_config::Database::FromUrl { url } => url.clone(),
nexus_config::Database::FromDns => {
info!(log, "Accessing DB url from DNS");
let address = loop {
match resolver
.lookup_socket_v6(ServiceName::Cockroach)
.await
{
Ok(address) => break address,
Err(e) => {
warn!(
log,
"Failed to lookup cockroach address: {e}"
);
tokio::time::sleep(std::time::Duration::from_secs(
1,
))
.await;
}
}
};
info!(log, "DB address: {}", address);
PostgresConfigWithUrl::from_str(&format!(
"postgresql://root@{address}/omicron?sslmode=disable",
))
.map_err(|e| format!("Cannot parse Postgres URL: {}", e))?
}
};

Our API to interface with CockroachDB from Nexus involves "constructing a URL", and right now, we embed a single point-in-time address of CockroachDB into this URL.

We should probably avoid doing this, and embed the service name into the URL, so it actually use DNS during later lookups.

@askfongjojo
Copy link
Author

Understood. The current behavior, in conjunction with #3613, results in an 1 in 15 probability that a user will not be able to use the system at all. I happened to be in this situation when my workstation always used the one nexus that had the first CRDB backend I brought down to test failover. :(

@askfongjojo askfongjojo added this to the 1.0.1 milestone Jul 25, 2023
@jmpesp
Copy link
Contributor

jmpesp commented Jul 26, 2023

The same happened with pantry requests for disk import blocks / bulk writes. I haven't got to the point of seeing the TTL being exhausted. I tried waiting for up to 5 minutes and the request still couldn't succeed until the pantry zone Nexus has been using prior to its outage came up again.

In this case, there's no TTL: a specific pantry from the list of available ones is selected once and used for a disk until that disk is finalized.

One thing that could be done is to check of the selected pantry is still responsive and choose another one if not, but it was important not keep as much code out of the import chunk hot-path as possible, as any checks there will be multiplied by the number of chunks to import and slow down imports. That being said, slow imports are better than non-working imports :) I'll give this some thought.

@askfongjojo
Copy link
Author

askfongjojo commented Jul 26, 2023

Sorry for not being clear in pantry's case. I understand the need to stay with the same pantry for the same disk snapshot. But the issue I ran into is with new/separate snapshot requests still holding on to the unresponsive one.

@askfongjojo askfongjojo changed the title Nexus holds on to a specific service backends of crdb and crucible-pantry for too long Nexus holds on to specific service backends of crdb and crucible-pantry for too long Jul 27, 2023
@askfongjojo askfongjojo modified the milestones: 1.0.1, 1.0.2 Jul 28, 2023
luqmana added a commit that referenced this issue Jul 31, 2023
#3783)

First pass at #3763 for crdb.

Even though we did query internal DNS, we were previously using only a
single host as part of connecting to crdb from Nexus. And since the
internal DNS server always returns records in the same order, that meant
every Nexus instance was always using the same CockroachDB instance even
now that we've been provisioning multiple. This also meant if that CRDB
instance went down we'd be hosed (as seen in #3763).

To help with that, this PR changes Nexus to use all the CRDB hosts
reported via Internal DNS when creating the connection URL. There are
some comments in the code, but this still not quite as robust as we
could be, but short of something cueball-like it's still an improvement.

To test I disabled the initial crdb nexus connected to and it was able
to recover by connecting to the next crdb instance and continue serving
requests. From the log we can see a successful query, connection errors
once i disabled `fd00:1122:3344:101::5`, and then a successful query
with connection reestablished to next crdb instance
(`fd00:1122:3344:101::3`):
```
23:43:24.729Z DEBG 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000003, .. })
    resource = Database
    result = Ok(())
23:43:24.729Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error
    database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d
isable
    error_message = Connection error: server is shutting down
23:43:24.729Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error
    database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d
isable
    error_message = Connection error: server is shutting down
23:43:24.729Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error
    database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d
isable
    error_message = Connection error: server is shutting down
23:43:24.729Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error
    database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d
isable
    error_message = Connection error: server is shutting down
23:43:24.729Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error
    database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d
isable
    error_message = Connection error: server is shutting down
23:43:24.729Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error
    database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d
isable
    error_message = Connection error: server is shutting down
23:43:24.730Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error
    database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d
isable
    error_message = Connection error: server is shutting down
23:43:24.730Z ERRO 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): database connection error
    database_url = postgresql://root@[fd00:1122:3344:101::5]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:101::6]:32221,[fd00:1122:3344:101::4]:32221,[fd00:1122:3344:101::7]:32221/omicron?sslmode=d
isable
    error_message = Connection error: server is shutting down
23:43:30.803Z DEBG 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): roles
    roles = RoleSet { roles: {} }
23:43:30.804Z DEBG 7be03b0d-48bf-4f43-a11e-7303236a3c5e (ServerContext): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000003, .. })
    resource = Database
    result = Ok(())
```
@luqmana
Copy link
Contributor

luqmana commented Jul 31, 2023

#3783 partially addresses this for crdb. We now, grab all the cockroachdb hosts via internal dns at nexus startup and add them all to the connection string. Whenever a new connection is established it'll try the listed hosts in order and use the first one that successfully connects. While that's an improvement in that Nexus won't fail to serve requests if one crdb instance goes down, there are still some issues:

  1. The list of hosts used are resolved and get fixed at Nexus startup. If new cockroachdb zones get spun up, it'll take a restart of Nexus for it to pick them up. As-is, I don't think we're yet dynamically spinning up new services and instead just creating the service plan during RSS.
  2. Internal DNS returns records in the same order everytime (couldn't find existing issue so created one: Internal DNS should permute the order in which it responds for entries with multiple records. #3805). This means every Nexus zone tries to connect to the set of CRDB hosts in the same order.

@askfongjojo askfongjojo modified the milestones: 1.0.2, 1.0.3 Aug 22, 2023
@askfongjojo askfongjojo modified the milestones: 1.0.3, 3 Sep 1, 2023
@morlandi7 morlandi7 modified the milestones: 3, MVP Oct 4, 2023
@smklein
Copy link
Collaborator

smklein commented Apr 29, 2024

I'm poking at this issue again, now that we're looking at possibly expunging zones which are running CRDB nodes.

Where the System Currently Stands

CockroachDB nodes, when initializing, access internal DNS to find the IP addresses of other nodes within the cluster:

# We need to tell CockroachDB the DNS names or IP addresses of the other nodes
# in the cluster. Look these up in internal DNS. Per the recommendations in
# the CockroachDB docs, we choose at most five addresses. Providing more
# addresses just increases the time for the cluster to stabilize.
JOIN_ADDRS="$(/opt/oxide/internal-dns-cli/bin/dnswait cockroach \
| head -n 5 \
| tr '\n' ,)"

Nexus, when creating a connection to a pool, performs a "one-time DNS lookup" here:

nexus_config::Database::FromDns => {
info!(log, "Accessing DB url from DNS");
// It's been requested but unfortunately not supported to
// directly connect using SRV based lookup.
// TODO-robustness: the set of cockroachdb hosts we'll use will
// be fixed to whatever we got back from DNS at Nexus start.
// This means a new cockroachdb instance won't picked up until
// Nexus restarts.
let addrs = loop {
match resolver
.lookup_all_socket_v6(ServiceName::Cockroach)
.await
{
Ok(addrs) => break addrs,
Err(e) => {
warn!(
log,
"Failed to lookup cockroach addresses: {e}"
);
tokio::time::sleep(std::time::Duration::from_secs(
1,
))
.await;
}
}
};
let addrs_str = addrs
.iter()
.map(ToString::to_string)
.collect::<Vec<_>>()
.join(",");
info!(log, "DB addresses: {}", addrs_str);
PostgresConfigWithUrl::from_str(&format!(
"postgresql://root@{addrs_str}/omicron?sslmode=disable",
))
.map_err(|e| format!("Cannot parse Postgres URL: {}", e))?

So, specifically eyeing the case of "a single CockroachDB node fails, what do we do?":

  • The CockroachDB nodes themselves should be resilient. I've tested this manually with a three-node setup, and as expected, forcefully killing one node means that the others remain responsive to requests.
  • The Nexus connection pool may become unresponsive, depending on which IP address was returned out of the original DNS lookup.

What do we want to do

I've tried to dig into the CockroachDB and Postgres docs to explore our options in this area when constructing the postgresql:// URL from the Nexus side. The end-goal here would be to have an implementation where Nexus can easily migrate from querying one CockroachDB node to the next, as we continue removing old nodes and adding new ones.

Options

Specify a host address in the form of a hostname, use DNS to resolve

It should hopefully be possible to provide a hostname to the postgresql::// URL which identifies the CockroachDB service, and which uses our internal DNS server to resolve the name of the host.

As far as I can tell -- feedback welcome if folks see alternate pathways -- the mechanism to point postgres clients at a particular DNS server is by setting the resolv.conf file with the IP address of the desired internal DNS server.

For nexus, this would mean:

  • Setting up an /etc/resolv.conf file which includes a nameserver entry pointing to the IP address of our internal DNS server
  • Updating the postgresl:// url to refer to the Cockroach service, and relying on the nameserver to perform translation.

Subsequent experimentation is necessary to determine how failures are propagated back to Nexus when some of these nodes die, and to identify if new nodes are actually made accessible.

Specify multiple hosts during nexus-side construction of postgresql:// URL

It is possible to specify multiple hostnames, according to libpq:

See: https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-MULTIPLE-HOSTS

Pros: Would give us redundancy from single-node failure
Cons: Would not actually update the set of "all known CRDB nodes", so after enough failures, this would also fall over

Experimentally, I spun up a three-node cluster, killed one, and observed via a psql shell that I could re-connect to another. However, it's also notable that this failure is visible to the client - my postgres client sent a request that failed (to the dead node) before reconnecting to a live one, and this process was not automated.

@smklein
Copy link
Collaborator

smklein commented Apr 29, 2024

Here's the result of some of my experimentation. I'd really rather have a local test here - that's what I was trying to build out in #5628 - but it's quite difficult to create all this in an isolated environment, since changing the DNS server used by postgres relies on changing "system-wide" config in /etc/resolv.conf.

Setup

I spun up a DNS server within Omicron, sitting on port 53:

sudo ./target/debug/dns-server --http-address [::1]:0 --dns-address [::1]:53 --config-file $PWD/dns-server/examples/config.toml

I then spun up three CockroachDB nodes talking to each other within a cluster. These are all on localhost, on ports 7709, 7710, and 7711.

cockroach start --insecure --join [::1]:7709,[::1]:7710,[::1]:7711 --store /var/tmp/crdb1 --listen-addr [::1]:7709 --http-addr :0
cockroach start --insecure --join [::1]:7709,[::1]:7710,[::1]:7711 --store /var/tmp/crdb2 --listen-addr [::1]:7710 --http-addr :0
cockroach start --insecure --join [::1]:7709,[::1]:7710,[::1]:7711 --store /var/tmp/crdb3 --listen-addr [::1]:7711 --http-addr :0
cockroach init --insecure --host [::]:7709

Then I populated my internal DNS server with these records:

# SRV records
./target/debug/dnsadm -a "[::1]:45901" add-srv control-plane.oxide.test _cockroach._tcp 0 0 7709 c6cda479-5fde-49a0-a079-7c960022baff.host.control-plane.oxide.test
./target/debug/dnsadm -a "[::1]:45901" add-srv control-plane.oxide.test _cockroach._tcp 0 0 7710 ac33791c-62c6-43b0-bcbd-b15e7727b533.host.control-plane.oxide.test
./target/debug/dnsadm -a "[::1]:45901" add-srv control-plane.oxide.test _cockroach._tcp 0 0 7711 6eb5fbb1-fa70-4ee9-aabf-53c450e138f7.host.control-plane.oxide.test 

# AAAA records
./target/debug/dnsadm -a "[::1]:45901" add-aaaa control-plane.oxide.test ac33791c-62c6-43b0-bcbd-b15e7727b533.host ::1
./target/debug/dnsadm -a "[::1]:45901" add-aaaa control-plane.oxide.test 6eb5fbb1-fa70-4ee9-aabf-53c450e138f7.host ::1
./target/debug/dnsadm -a "[::1]:45901" add-aaaa control-plane.oxide.test c6cda479-5fde-49a0-a079-7c960022baff.host ::1

Next, I added my nameserver running on localhost to /etc/resolv.conf:

# added to `resolv.conf`
nameserver ::1

To check that DNS is up and running, I used dig:

 $ dig _cockroach._tcp.control-plane.oxide.test

; <<>> DiG 9.18.18-0ubuntu0.22.04.2-Ubuntu <<>> _cockroach._tcp.control-plane.oxide.test
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 38788
;; flags: qr rd; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 3
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;_cockroach._tcp.control-plane.oxide.test. IN A

;; ANSWER SECTION:
_cockroach._tcp.control-plane.oxide.test. 0 IN SRV 0 0 7709 c6cda479-5fde-49a0-a079-7c960022baff.host.control-plane.oxide.test.
_cockroach._tcp.control-plane.oxide.test. 0 IN SRV 0 0 7710 ac33791c-62c6-43b0-bcbd-b15e7727b533.host.control-plane.oxide.test.
_cockroach._tcp.control-plane.oxide.test. 0 IN SRV 0 0 7711 6eb5fbb1-fa70-4ee9-aabf-53c450e138f7.host.control-plane.oxide.test.

;; ADDITIONAL SECTION:
c6cda479-5fde-49a0-a079-7c960022baff.host.control-plane.oxide.test. 0 IN AAAA ::1
ac33791c-62c6-43b0-bcbd-b15e7727b533.host.control-plane.oxide.test. 0 IN AAAA ::1
6eb5fbb1-fa70-4ee9-aabf-53c450e138f7.host.control-plane.oxide.test. 0 IN AAAA ::1

;; Query time: 0 msec
;; SERVER: ::1#53(::1) (UDP)
;; WHEN: Mon Apr 29 14:04:58 PDT 2024
;; MSG SIZE  rcvd: 400

Which looks like I'd expect -- I'm seeing those 7709 - 7711 ports in the SRV records, and a bunch of references to ::1 in the AAAA records.

I can use the Cockroach shell to connect directly to a node via IP:

cockroach sql --url "postgresql://root@[::1]:7709?sslmode=disable"

However, using a hostname appears to be hitting issues:

 $ cockroach sql --url "postgresql://root@_cockroach._tcp.control-plane.oxide.test?sslmode=disable"
#
# Welcome to the CockroachDB SQL shell.
# All statements must be terminated by a semicolon.
# To exit, type: \q.
#
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.

dial tcp: lookup _cockroach._tcp.control-plane.oxide.test: no such host
Failed running "sql"

@luqmana
Copy link
Contributor

luqmana commented Apr 29, 2024

Resolving using SRV records doesn't work with the cli (cockroachdb/cockroach#64439).

But you should be able to have the nodes discover each other via DNS without explicitly listing them out for --join (though that's perhaps behind another flag --experimental-srv-dns depending on the version?) But afaik that's just limited to the initial bootstrapping and unsure how it deals with the set changing at runtime.

@smklein
Copy link
Collaborator

smklein commented Apr 29, 2024

Resolving using SRV records doesn't work with the cli (cockroachdb/cockroach#64439).

But you should be able to have the nodes discover each other via DNS without explicitly listing them out for --join (though that's perhaps behind another flag --experimental-srv-dns depending on the version?) But afaik that's just limited to the initial bootstrapping and unsure how it deals with the set changing at runtime.

Thanks for the pointer, I'll look into this flag! To be clear, that would be for CockroachDB nodes to connect to each other using DNS, right?

Just being clear that it's distinct from any attempts by e.g. Nexus to use a libpq client to connect to CockroachDB

@luqmana
Copy link
Contributor

luqmana commented Apr 29, 2024 via email

@smklein
Copy link
Collaborator

smklein commented Apr 29, 2024

Here's my follow-up -- I'm digging into libpq to try to figure out why DNS resolution isn't happening from Nexus, or any of the psql / cockroachdb sql clients:

https://github.com/postgres/postgres/blob/dd0183469bb779247c96e86c2272dca7ff4ec9e7/src/interfaces/libpq/fe-connect.c#L872

It appears the PQConnectStart method parses connection options, then calls pqConnectDBStart to actually try connecting to the backend. This itself calls PQconnectPoll after setting up some config options (most importantly though, setting status to CONNECTION_NEEDED.

I'm pretty sure I'm bottoming out here, because this matches the error I was seeing:

https://github.com/postgres/postgres/blob/dd0183469bb779247c96e86c2272dca7ff4ec9e7/src/interfaces/libpq/fe-connect.c#L2737-L2747

Under the hood, this appears to be calling getaddrinfo. I believe this is only compatible with A/AAAA records, and cannot properly parse SRV records -- this appears to match my experiments locally, where I spun up a small C Program, and could read from the "node name" of c6cda479-5fde-49a0-a079-7c960022baff.host.control-plane.oxide.test, but not _cockroach._tcp.control-plane.oxide.test. This does, admittedly, defeat the point of doing DNS lookup, since that AAAA record would presumably change when the CRDB service is provisioned to a new node.

Out of curiosity, I dug into tokio-postgres too, to see if they did any better than libpq:

https://github.com/sfackler/rust-postgres/blob/98f5a11bc0a8e451552d8941ffa078c7eb6cd60c/tokio-postgres/src/connect.rs#L98

This calls into https://docs.rs/tokio/latest/tokio/net/fn.lookup_host.html , which also appears (through local testing) to "work with AAAA records when you also supply the port, but not with SRV records". This pretty much matches the getaddrinfo client issues libpq is facing.

@smklein
Copy link
Collaborator

smklein commented Apr 29, 2024

I'm kinda getting the sense we have two options for a path forward here:

  1. Nexus resolves IP addresses by doing the SRV lookup itself. This is, more-or-less, what we're doing today. To tolerate CockroachDB node failures, however, we should strengthen this approach a bit, by providing multiple hosts to multiple CRDB nodes, and periodically refreshing our connections to postgres with an updated set of hosts that we re-acquire from DNS, again by doing the SRV lookups ourselves.

There are some drawbacks to this approach that we would need to work through:

  • Currently, the connection manager APIs used by Nexus (and Diesel) are static -- the postgresql:// URL is supplied once, and exists for the duration of the connection pool. To implement something like this on top of such an API, I think we'd need to create a new connection pool and switch new connections over to it. Alternatively, we could modify the connection pooling libraries to change the URL which we use for connections dynamically, and internalize this detail to the pool itself.
  1. We patch libpq to add lookup via SRV. We aren't actually the first ones to propose this, but I think it would require us to do some upstream work, which may or may not be accepted, and would definitely introduce a delay to deployment.

@luqmana
Copy link
Contributor

luqmana commented Apr 30, 2024

Currently, the connection manager APIs used by Nexus (and Diesel) are static -- the postgresql:// URL is supplied once, and exists for the duration of the connection pool. To implement something like this on top of such an API, I think we'd need to create a new connection pool and switch new connections over to it. Alternatively, we could modify the connection pooling libraries to change the URL which we use for connections dynamically, and internalize this detail to the pool itself.

Right, last I looked it involved changes across multiple crates:

We create an async-bb8-diesel::ConnectionManager with a fixed database URL string. That wraps diesel::r2d2::ConnectionManager and implements bb8::ManageConnection. In that is the blocking call to the underlying r2d2::ManageConnection::connect which is implemented directly in diesel and calls Connection::establish with the fixed url we created at startup.

@smklein
Copy link
Collaborator

smklein commented Apr 30, 2024

I'll see if I can modify Diesel and async-bb8-diesel to make that URL mutable. That seems like it would help us form a foundation to modify things from nexus, if we are unable to rely on libpq to act as a DNS client on our behalf.

@smklein
Copy link
Collaborator

smklein commented Apr 30, 2024

Okay, seems possible to let the database URL be mutable from Nexus. That would at least let Nexus update the "set of valid CRDB nodes" with the latest info it knows about.

See: oxidecomputer/async-bb8-diesel#62 , diesel-rs/diesel#4005

@luqmana
Copy link
Contributor

luqmana commented Apr 30, 2024

Nice! One thing though is that with that we'd need to explicitly call that for every Nexus every time the set of CRDB nodes changes. Compared to changing connect to somehow get the URL dynamically where we could just query DNS.

@smklein
Copy link
Collaborator

smklein commented Apr 30, 2024

Nice! One thing though is that with that we'd need to explicitly call that for every Nexus every time the set of CRDB nodes changes. Compared to changing connect to somehow get the URL dynamically where we could just query DNS.

Totally, but this was my takeaway from reading libpq's source: someone on the client-side needs to make a decision to go query DNS, get a new set of nodes, and update the set of IPs we end up talking to. If libpq handled this for us, that would be nice, but if it doesn't, I think it ends up being functionally somewhat similar for Nexus to take this responsibility too.

Arguably, I think Nexus could be more optimal here, since it can perform this action as a downstream operation of an RPW to avoid unnecessary queries to DNS until the set is known to have changed.

@davepacheco
Copy link
Collaborator

This use case is really similar to the case of managing HTTP clients for our internal services: there's DNS resolution that ideally would be decoupled from connection establishment, and connection establishment that would be decoupled from usage. This is another case where we'll want a more sophisticated connection management component, which likely means building our own pool to better control the behavior here. We basically decided in the update call today to pursue this, so the rest of this might be moot. But in case it's useful, here's responding to a few things above.


So, specifically eyeing the case of "a single CockroachDB node fails, what do we do?":

  • The CockroachDB nodes themselves should be resilient. I've tested this manually with a three-node setup, and as expected, forcefully killing one node means that the others remain responsive to requests.
  • The Nexus connection pool may become unresponsive, depending on which IP address was returned out of the original DNS lookup.

This is a good summary. The CockroachDB part (in the start method) was explicitly designed this way in order to stay in sync with the latest cluster topology and I'd be more surprised if it didn't work! But we've known from the start that there was at least some work to be done to make the pool survive database failures better, and likely we were going to have to do our own pool.


Resolving using SRV records doesn't work with the cli (cockroachdb/cockroach#64439).
...
Under the hood, this appears to be calling getaddrinfo. I believe this is only compatible with A/AAAA records, and cannot properly parse SRV records

For what it's worth, I've almost never seen non-application-specific components do SRV lookups.

But you should be able to have the nodes discover each other via DNS without explicitly listing them out for --join (though that's perhaps behind another flag --experimental-srv-dns depending on the version?) But afaik that's just limited to the initial bootstrapping and unsure how it deals with the set changing at runtime.

I think we've basically solved this part already (see above). We considered --experimental-dns-srv at the time but ruled it out, if for no other reason than it was documented as unstable and subject to removal or change. I feel like I also found some problem with its behavior but I do not remember what it was. Thinking out loud: such a facility would need: to use the right DNS servers (have we set up /etc/resolv.conf correctly), and deal well with those being down; and importantly if any of it fails it should sit there waiting for the right records to show up, not just give up and exit. It may have been this last one that was the sticking point.


I'm kinda getting the sense we have two options for a path forward here:

  1. Nexus resolves IP addresses by doing the SRV lookup itself...
  2. We patch libpq to add lookup via SRV...

It makes sense to me that the DNS resolution and even TCP connection establishment behavior would ultimately be application-specific and would happen outside libpq. While it's no doubt convenient that clients like tokio_postgres allow you to specify a DNS name at all, that falls apart in situations like ours because there are so many policy choices hardcoded here that we'd probably want to change: which nameservers to use? how many are queried? with what concurrency? on failure, do we fail or retry? how many times? how often? Then most of those same questions apply again for TCP connection establishment. And the challenge is that we likely want to make these choices at a high layer, not deep in tokio_postgres when we're in the process of trying to establish a connection for a client that's currently waiting on it.

@askfongjojo
Copy link
Author

askfongjojo commented Jun 21, 2024

In my most recent testing, the behavior for CRDB connection failures has changed (or might be expected?). When the crdb node in use has gone down, I saw messages like this in nexus logs:

17:42:35.701Z ERRO 2898657e-4141-4c05-851b-147bffc6bbbd (ServerContext): database connection error
    database_url = postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
    error_message = Connection error: server is shutting down
    file = nexus/db-queries/src/db/pool.rs:100

(Note: The crdb instance in use is not necessarily the first one listed; in my case, disabling the 105 instance is what triggered the error.)

Nexus was able to continue serving both read and update API requests in spite of the above error. It's possible that any ongoing requests against the crdb node right at the moment it's going down could have failed but all requests I made afterwards succeeded.

@davepacheco took a look at the code to understand what changed the behavior. Here are his comments:

I didn't think we'd changed anything here since #3763 was filed, but I see there was #3763 (comment), which maybe fixed this? Although subsequent comments suggested it might not have, but it seems like it should have. Anyway, it's good that it worked.

@smklein
Copy link
Collaborator

smklein commented Jun 25, 2024

Just to be explicit: #5876 will fix this issue for CockroachDB -- in that PR, we use https://github.com/oxidecomputer/qorb to access each CRDB node individually, and create a pool of connections for each one.

  • If a CRDB node starts returning errors, we reduce our usage of it until it looks healthy
  • If a CRDB node is removed from DNS, we stop using it
  • If a new CRDB node shows up in DNS, we start using it

@morlandi7 morlandi7 modified the milestones: MVP, 10 Jul 2, 2024
@morlandi7 morlandi7 modified the milestones: 10, 11 Aug 12, 2024
@davepacheco davepacheco removed this from the 11 milestone Aug 21, 2024
smklein added a commit that referenced this issue Aug 27, 2024
Replaces all usage of bb8 with a new connection pooling library called
[qorb](https://github.com/oxidecomputer/qorb).

qorb, detailed in RFD 477, provides the following benefits over bb8:
- It allows lookup of multiple backends via DNS SRV records
- It dynamically adjusts the number of connections to each bakend based
on their health, and prioritizes vending out connections to healthy
backends
- It should be re-usable for both our database and progenitor clients
(using a different "backend connector", but the same core library and
DNS resolution mechanism).

Fixes #4192
Part of #3763 (fixes CRDB
portion)
@davepacheco davepacheco added this to the 12 milestone Oct 4, 2024
@smklein smklein added the qorb Connection Pooling label Oct 8, 2024
jgallagher added a commit that referenced this issue Oct 11, 2024
The meat of this PR is the change in implementation of
`get_pantry_address`: instead of asking our internal DNS resolver to
look up a crucible pantry (which does not randomize, so in practice we
always get whichever pantry the DNS server listed first), we ask a Qorb
connection pool for the address of a healthy client.
`get_pantry_address` itself does not use the client directly and only
cares about its address, but the pool does keep a client around so that
it can call `pantry_status()` as a health check. (It doesn't look at the
contents of the result; only whether or not the request succeeded -
@jmpesp if that should be more refined, please say so.)

This partially addresses #3763; once this lands, if a pantry is down or
unhealthy but still present in DNS (i.e., not expunged), Qorb + the
status health checks should mean we'll pick a different pantry for new
operations, instead of the current behavior of always sticking to the
first pantry in DNS.

---------

Co-authored-by: Sean Klein <[email protected]>
jgallagher added a commit that referenced this issue Oct 14, 2024
…es (#6836)

This is a much smaller change than the diff stat implies; most of the
changes are expectorate outputs because the example system we set up for
tests now includes Crucible pantry zones, which shifted a bunch of other
zone UUIDs.

Fully supporting Crucible pantry replacement depends on #3763, which I'm
continuing to work on. But the reconfigurator side of "start new
pantries" is about as trivial as things go and does not depend on #3763,
hence this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
qorb Connection Pooling
Projects
None yet
Development

No branches or pull requests

7 participants