Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cockroach version panic in schema change #83864

Closed
davepacheco opened this issue Jul 5, 2022 · 11 comments
Closed

cockroach version panic in schema change #83864

davepacheco opened this issue Jul 5, 2022 · 11 comments
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) X-blathers-triaged blathers was able to find an owner

Comments

@davepacheco
Copy link

davepacheco commented Jul 5, 2022

(originally filed as #83706, moving here as requested)

Describe the problem

While trying to reproduce #82958, I found a case where cockroach version exited with status 2. Thinking maybe it would be reproducible, I started running that in a loop. I saw this panic:

panic: invalid predicate with signature *fmt.pp != func(*scpb.Column, scplan.Params) bool for *scpb.Column[0]

goroutine 1 [running]:
github.com/cockroachdb/cockroach/pkg/sql/schemachanger/scplan.buildSchemaChangeOpGenFunc(0x703b2a0, 0x0, 0xc000e0aff0, 0xc000e0b050, 0xc000e08c18)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/sql/schemachanger/scplan/declarative.go:171 +0xd93
github.com/cockroachdb/cockroach/pkg/sql/schemachanger/scplan.buildSchemaChangePlanner(0xc000e0ab40, 0xc000e0ab40)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/sql/schemachanger/scplan/declarative.go:60 +0x125
github.com/cockroachdb/cockroach/pkg/sql/schemachanger/scplan.init()
	/ws/gc/cockroach/cache/gopath/src/github.co

To Reproduce

Just run cockroach version in a loop unti it exits. At first, I just did:

for ((i = 0; ; i++)) { echo "ATTEMPT $i"; cockroach version || break; }

Now I'm using this:

#!/bin/bash

# attempt to reproduce failure from `cockroach version`

# specifies a specific `cockroach` binary
export PATH="$HOME/tools/cockroachdb-v21.2.9/bin:$PATH"

date +%FT%TZ > start_time
which cockroach > which_cockroach

for ((i = 0; ; i++)) {
	echo "ATTEMPT $i"
	echo $i > attempt_number
	cockroach version > stdout 2>stderr
	rv=$?
	if [[ $rv -ne 0 ]]; then
		echo $rv > exit_status
		echo "exited with status $rv";
		break;
	fi
}

Unfortunately I lost the status code but I've fixed the above script to avoid that.

It took just over 53h and 1.4M iterations to hit this.

Expected behavior

This should run indefinitely without issue.

Environment:

$ cockroach version
Build Tag:        v21.2.9
Build Time:       2022/04/28 04:02:42
Distribution:     OSS
Platform:         illumos amd64 (x86_64-pc-solaris2.11)
Go Version:       go1.16.10
C Compiler:       gcc 10.3.0
Build Commit ID:  11787edfcfc157a0df951abc34684e4e18b3ef20
Build Type:       release

This is on helios helios-1.0.21004.

CC @knz

Jira issue: CRDB-17319

@davepacheco davepacheco added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 5, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jul 5, 2022

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added A-disaster-recovery O-community Originated from the community X-blathers-triaged blathers was able to find an owner T-disaster-recovery labels Jul 5, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jul 5, 2022

cc @cockroachdb/bulk-io

@ajwerner
Copy link
Contributor

ajwerner commented Jul 6, 2022

This crash seems unreasonable without something wild going on. Maybe some sort of memory corruption? The code is here:

if pt := reflect.TypeOf(rule.predicate); pt != predicateTyp {

This is init time code processing some narrowly defined structures. There's not a ton of leeway here. The only instantiations of that struct are in the below file and given the rest of the error, we know that the structs must have been one of these.
Also, do *fmt.pp objects ever make their way out of the fmt package (see https://go.dev/src/fmt/print.go#L104)?

predicate: columnInSecondaryIndex,
},
{
dirPredicate: sameDirection,
thatStatus: scpb.Status_DELETE_AND_WRITE_ONLY,
predicate: columnInPrimaryIndex,
},
},
scpb.Status_PUBLIC: {
{
dirPredicate: bothDirectionsEqual(scpb.Target_ADD),
thatStatus: scpb.Status_PUBLIC,
predicate: columnInSecondaryIndex,
},
{
dirPredicate: bothDirectionsEqual(scpb.Target_ADD),
thatStatus: scpb.Status_PUBLIC,
predicate: columnInPrimaryIndex,

@knz
Copy link
Contributor

knz commented Jul 6, 2022

They shouldnt. See the latest comment in #83706, I now suspect this particular go runtime / OS combination is borked in a way independent of crdb.

@ajwerner
Copy link
Contributor

Given the evidence of runtime corruption, I'm closing this.

@davepacheco
Copy link
Author

If the root cause is memory corruption, isn't this still a bug?

@ajwerner
Copy link
Contributor

ajwerner commented Jul 12, 2022

If the memory corruption is in the go runtime, is it a cockroach bug? We generally write code assuming the go runtime works according to the language specification, I think in this case, it doesn't.

@davepacheco
Copy link
Author

I didn't think we had any evidence that the corruption was caused by the Go runtime. (Obviously if anything in the process is corrupting memory, then the corruption can appear in any subsystem -- and indeed we've seen corruption in CockroachDB as well as a few different parts of the Go runtime.) But if it's preferable, I can track this issue outside this project and reopen it if/when I have more data.

@ajwerner
Copy link
Contributor

I interpreted #83706 (comment) as an indication of widespread memory corruption. I can't make a claim as to whether this corruption is in go or in some other code, perhaps some linked cgo code. Either way, I do not thing there's value in maintaining this issue separately. Any root cause here doesn't have anything to do with the code implicated by this issue.

I hope we all can get to the bottom of it.

@davepacheco
Copy link
Author

Just to close the loop on this: see oxidecomputer/omicron#1146 for gory details, but at this point I believe the corruption in that issue was due to this illumos OS issue. The behavior there is that memory that is supposed to be zero'd by the Go runtime may not be properly zero'd. It looks to me like that could explain this bug, so that seems the likely culprit.

@knz
Copy link
Contributor

knz commented Dec 21, 2022

thank you.

@exalate-issue-sync exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-disaster-recovery T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

No branches or pull requests

3 participants