Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes failed cluster causes sql driver panic #41

Open
axllent opened this issue Apr 14, 2024 · 3 comments
Open

Sometimes failed cluster causes sql driver panic #41

axllent opened this issue Apr 14, 2024 · 3 comments

Comments

@axllent
Copy link

axllent commented Apr 14, 2024

Hi @otoolep! I've been doing some cluster testing in my local network using three computers running rqlited, and a single instance of Mailpit. While ingesting a constant stream of incoming emails via Mailpit (thus writing as fast as rqlite can handle) I have been randomly dropping out one of the three rqlite nodes to test fault tolerance / recovery etc. rqlited seems to handle this very elegantly, but every now and then I get a panic from the gorqlite sql driver panic: runtime error: index out of range [0] with length 0:

ERRO[2024/04/14 16:04:21] [db] error storing message: tried all peers unsuccessfully. here are the results:
   peer #0: http://192.168.7.2:4001/db/execute?timings&level=weak&transaction failed due to Post "http://192.168.7.2:4001/db/execute?timings&level=weak&transaction": dial tcp 192.168.7.2:4001: connect: connection refused
   peer #1: http://192.168.7.50:4001/db/execute?timings&level=weak&transaction failed, got: 503 Service Unavailable, message: leader not found

   peer #2: http://192.168.7.1:4001/db/execute?timings&level=weak&transaction failed, got: 503 Service Unavailable, message: leader not found
 
panic: runtime error: index out of range [0] with length 0

goroutine 376180 [running]:
github.com/rqlite/gorqlite.(*Connection).WriteOneParameterizedContext(...)
        /home/ralph/.cache/go/pkg/mod/github.com/rqlite/[email protected]/write.go:79
github.com/rqlite/gorqlite/stdlib.(*Stmt).ExecContext(0xc008ce1410, {0x108d1b8, 0x1745480}, {0xc000e79860, 0xa, 0x1083460?})
        /home/ralph/.cache/go/pkg/mod/github.com/rqlite/[email protected]/stdlib/sql.go:96 +0x3f8
database/sql.ctxDriverStmtExec({0x108d1b8, 0x1745480}, {0x108d730, 0xc008ce1410}, {0xc000e79860, 0xa, 0x108cb80?})
        /usr/lib/go-1.22/src/database/sql/ctxutil.go:65 +0xa3
database/sql.resultFromStatement({0x108d1b8, 0x1745480}, {0x108cb80, 0xc00011c518}, 0xc000cf5438, {0xc000cf58b0, 0xa, 0xa})
        /usr/lib/go-1.22/src/database/sql/sql.go:2670 +0x13a
database/sql.(*DB).execDC(0x0?, {0x108d1b8, 0x1745480}, 0xc0001cc750, 0x10?, {0xc0043c1ae0, 0x99}, {0xc000cf58b0, 0xa, 0xa})
        /usr/lib/go-1.22/src/database/sql/sql.go:1722 +0x42c
database/sql.(*Tx).ExecContext(0xc007d56a80, {0x108d1b8, 0x1745480}, {0xc0043c1ae0, 0x99}, {0xc000cf58b0, 0xa, 0xa})
        /usr/lib/go-1.22/src/database/sql/sql.go:2506 +0xad
database/sql.(*Tx).Exec(...)
        /usr/lib/go-1.22/src/database/sql/sql.go:2515
...

exit status 2

I've tried to find the cause of the panic and I can clearly see where it is happening, just not why it is happening. WriteOne(), WriteOneContext(), WriteOneParameterized() and WriteOneParameterizedContext() are all potentially affected by this too as they all return wra[0], err (eg: here). If wra[0] does not exist (ie: empty slice) you will get a panic. I just can't work out what is causing wra to be empty as your error handling appears to append the error as a result. The only thing I can think of is if the final:

 else {
	return results, nil
}

... if results has no results... Hopefully this means more to you than me?

Ensuring the wra[0] exists before trying to return it is obviously the safest solution (though maybe not so elegant), and I do not know what the consequences of that are "down the food chain" (you're way more familiar with the code than I am).

@otoolep
Copy link
Member

otoolep commented Apr 14, 2024

What I would guess Is happening is that this client library is not handling a leader election properly. When you remove a node randomly, and that node is the Leader, a new leader will be elected. That takes only a second or so, but doing a write during that time may result in the node responding to the kind of response you see above ("Leader not found").

What do you think the client library should do? Retry the write is probably the right thing to do, since it's a transient issue.

@axllent
Copy link
Author

axllent commented Apr 15, 2024

It sounds like you may be on to it - that brief time while nodes are electing a new leader. It seems to be there are two separate (but related) things here, how to deal with queries during the brief time during a leader election, and the fact that a populated return slice (wra[0]) is seemingly not guaranteed under such (and possibly other?) conditions.

I'm probably not qualified to answer the first - but a retry would be the most elegant (as long as it does not result in an endless loop of retries). As for the panic though, that should never happen as it literally kills the application through no fault of its own.

@otoolep
Copy link
Member

otoolep commented Apr 15, 2024

OK, let me see if I can fix up the library to handle this situation better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants