Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cheroot==8.1.0 regression] Spin on Linux when socket filedescriptor is too big for select #249

Closed
kmduk opened this issue Nov 22, 2019 · 20 comments · Fixed by #301
Closed
Labels
bug Something is broken help wanted Somebody help us, please! Linux regression Something that worked earlier got broken in new release reproducer: present This PR or issue contains code, which reproduce the problem described or clearly understandable STR

Comments

@kmduk
Copy link

kmduk commented Nov 22, 2019

Once the filedescriptor for a socket used by cherrypy has a value greater than 1024 a thread goes in to a loop just writing blocks like the following to stderr:

[22/Nov/2019:11:55:07] ENGINE Error in HTTPServer.tick
Traceback (most recent call last):
  File "/usr/lib/python3.6/dist-packages/cheroot/server.py", line 1770, in serve
    self.tick()
  File "/usr/lib/python3.6/dist-packages/cheroot/server.py", line 1993, in tick
    conn = self.connections.get_conn(self.socket)
  File "/usr/lib/python3.6/dist-packages/cheroot/connections.py", line 142, in get_conn
    rlist, _, _ = select.select(list(socket_dict), [], [], 0.1)
ValueError: filedescriptor out of range in select()

This does not happen with cheroot versions prior to 8.1.0

@webknjaz
Copy link
Member

Hi, do you have a reproducer? Also, please fill out the issue template.

@webknjaz webknjaz added regression Something that worked earlier got broken in new release reproducer: missing This PR or issue lacks code, which reproduce the problem described or clearly understandable STR labels Nov 22, 2019
@webknjaz
Copy link
Member

@the-allanc I assume this may be related to your change ^

@webknjaz
Copy link
Member

@the-allanc ?

@timurgen
Copy link

timurgen commented Jan 17, 2020

This error happens after some days of work in Docker container (Linux), reload CherryPy works but would be better to fix this issue. Looks like CherryPy runs out of file descriptors in some way, don't close after a request is completed probably?

@webknjaz
Copy link
Member

webknjaz commented Feb 3, 2020

@timurgen starting v8.1.0 Cheroot has been refactored by @the-allanc to use per-HTTP-request worker threads rather than per-connection as it was before.

@webknjaz webknjaz added the help wanted Somebody help us, please! label Feb 3, 2020
@webknjaz webknjaz changed the title Spin on Linux when socket filedescriptor is too big for select [cheroot==8.1.0 regression] Spin on Linux when socket filedescriptor is too big for select Feb 3, 2020
@webknjaz webknjaz mentioned this issue Feb 3, 2020
15 tasks
@jakeczyz
Copy link

jakeczyz commented Mar 9, 2020

We've started seeing this fairly often too.
E.g.:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/cheroot/server.py", line 1770, in serve
    self.tick()
  File "/usr/lib/python3.6/site-packages/cheroot/server.py", line 1993, in tick
    conn = self.connections.get_conn(self.socket)
  File "/usr/lib/python3.6/site-packages/cheroot/connections.py", line 142, in get_conn
    rlist, _, _ = select.select(list(socket_dict), [], [], 0.1)
ValueError: filedescriptor out of range in select()

This is on:
cherrypy 18.5.0
cheroot 8.2.1
python 3.6.9
os: alpine 3.9.4

@webknjaz
Copy link
Member

@webknjaz
Copy link
Member

@the-allanc ^

@liamstask
Copy link
Contributor

just browsing through issues as i'm evaluating wsgi options, so not a ton of context, but it would appear that if you go through with #252, then reimplementing the select()-based implementation of #199 in terms of https://docs.python.org/3/library/selectors.html should be a clear win.

@webknjaz
Copy link
Member

@liamstask so the problem is that I don't want to have the last py2-compatible release w/ this regression.

Looks like there's a backport: https://pypi.org/project/selectors2/.

@jaraco
Copy link
Member

jaraco commented Apr 4, 2020

Looks like CherryPy runs out of file descriptors in some way, don't close after a request is completed probably?

That seems like the likely cause to me. The solution should be to find a suitable design that doesn't leak file descriptors. Just switching to selectors to support higher-numbered file selectors where they weren't required before seems like papering over the root cause.

Perhaps related, when I was working on #263, at 5d4f1ca I was creating a fixture, but when I attempted to shut down the server (e36d1de), I see intermittent errors when running the test that invokes the fixture tox -- -k test_wsgi. The test passes, but during teardown, I see an error:

Traceback (most recent call last):
  File "/Users/jaraco/code/public/cherrypy/cheroot/cheroot/server.py", line 1776, in serve
    self.tick()
  File "/Users/jaraco/code/public/cherrypy/cheroot/cheroot/server.py", line 1999, in tick
    conn = self.connections.get_conn(self.socket)
  File "/Users/jaraco/code/public/cherrypy/cheroot/cheroot/connections.py", line 142, in get_conn
    rlist, _, _ = select.select(list(socket_dict), [], [], 0.1)
ValueError: file descriptor cannot be a negative integer (-1)

Then hundreds of these error messages:

Error in HTTPServer.tick
Traceback (most recent call last):
  File "/Users/jaraco/code/public/cherrypy/cheroot/cheroot/server.py", line 1776, in serve
    self.tick()
  File "/Users/jaraco/code/public/cherrypy/cheroot/cheroot/server.py", line 1999, in tick
    conn = self.connections.get_conn(self.socket)
  File "/Users/jaraco/code/public/cherrypy/cheroot/cheroot/connections.py", line 139, in get_conn
    ss_fileno = server_socket.fileno()
AttributeError: 'NoneType' object has no attribute 'fileno'

The reason I suspect it's related as -1 as a file descriptor could be another symptom of a race condition on incrementing/decrementing descriptors.

@jaraco
Copy link
Member

jaraco commented Apr 4, 2020

I see intermittent errors when running the test.

After reading the Server docs, I see the fixture wasn't setting things up properly and one is supposed to invoke prepare and serve instead. After ba857bd, that issue is addressed, so is probably unrelated.

@webknjaz
Copy link
Member

webknjaz commented Apr 4, 2020

but when I attempted to shut down the server (e36d1de), I see intermittent errors when running the test that invokes the fixture tox -- -k test_wsgi.

FTR I also saw something similar after reproducing #263 and hitting Ctrl+C. It seems like KeyboardInterrupt needs to be caught too.

@webknjaz
Copy link
Member

webknjaz commented Apr 4, 2020

The solution should be to find a suitable design that doesn't leak file descriptors.

Yes, but we also need a higher descriptors limit. I think that the problem is not the number of open descriptors but the fact that at some point their ids become bigger than 1024 just because the OS increments them.

@hardikmodha
Copy link

I'm also seeing this issue while upgrading to Cheroot v8.3.0. Following is the list of open file descriptors for the process. I've highlighted the socket file descriptors which are > 1024. I'll try to create a minimal example where the issue gets reproduced and will update it here.

Details
lr-x------ 1 root root 64 May 15 09:10 0 -> 'pipe:[4067373]'
l-wx------ 1 root root 64 May 15 09:10 1 -> 'pipe:[4067374]'
l-wx------ 1 root root 64 May 15 09:10 10 -> 'pipe:[4089444]'
lrwx------ 1 root root 64 May 15 09:10 1001 -> 'socket:[4542090]'
lrwx------ 1 root root 64 May 15 09:10 1008 -> 'socket:[4542068]'
lrwx------ 1 root root 64 May 15 09:10 1014 -> 'socket:[4542007]'
lrwx------ 1 root root 64 May 15 09:10 1016 -> 'socket:[4542014]'
lrwx------ 1 root root 64 May 15 09:10 1022 -> 'socket:[4542033]'
lrwx------ 1 root root 64 May 15 09:10 103 -> 'socket:[4409130]'
lrwx------ 1 root root 64 May 15 09:10 1030 -> 'socket:[4542060]'
lrwx------ 1 root root 64 May 15 09:10 1032 -> 'socket:[4542067]'
lrwx------ 1 root root 64 May 15 09:10 1037 -> 'socket:[4542086]'
lrwx------ 1 root root 64 May 15 09:10 1039 -> 'socket:[4542093]'
lrwx------ 1 root root 64 May 15 09:10 107 -> 'socket:[4410904]'
lrwx------ 1 root root 64 May 15 09:10 11 -> 'socket:[4410953]'
lrwx------ 1 root root 64 May 15 09:10 12 -> 'socket:[4398701]'
lrwx------ 1 root root 64 May 15 09:10 13 -> 'socket:[4394408]'
lrwx------ 1 root root 64 May 15 09:10 166 -> 'socket:[4415354]'
lrwx------ 1 root root 64 May 15 09:10 17 -> 'socket:[4413727]'
lrwx------ 1 root root 64 May 15 09:10 177 -> 'socket:[4421826]'
lrwx------ 1 root root 64 May 15 09:10 178 -> 'socket:[4420791]'
lr-x------ 1 root root 64 May 15 09:10 19 -> /dev/urandom
lrwx------ 1 root root 64 May 15 09:10 192 -> 'socket:[4415348]'
lrwx------ 1 root root 64 May 15 09:10 194 -> 'socket:[4415355]'
lrwx------ 1 root root 64 May 15 09:10 195 -> 'socket:[4415356]'
lrwx------ 1 root root 64 May 15 09:10 197 -> 'socket:[4409073]'
l-wx------ 1 root root 64 May 15 09:10 2 -> 'pipe:[4067375]'
lrwx------ 1 root root 64 May 15 09:10 202 -> 'socket:[4409102]'
lrwx------ 1 root root 64 May 15 09:10 204 -> 'socket:[4415378]'
lrwx------ 1 root root 64 May 15 09:10 205 -> 'socket:[4410866]'
lrwx------ 1 root root 64 May 15 09:10 210 -> 'socket:[4415388]'
lrwx------ 1 root root 64 May 15 09:10 214 -> 'socket:[4409090]'
lrwx------ 1 root root 64 May 15 09:10 217 -> 'socket:[4519931]'
lrwx------ 1 root root 64 May 15 09:10 220 -> 'socket:[4409106]'
lrwx------ 1 root root 64 May 15 09:10 231 -> 'socket:[4409084]'
lrwx------ 1 root root 64 May 15 09:10 243 -> 'socket:[4409120]'
lrwx------ 1 root root 64 May 15 09:10 246 -> 'socket:[4409122]'
lrwx------ 1 root root 64 May 15 09:10 248 -> 'socket:[4409126]'
lrwx------ 1 root root 64 May 15 09:10 259 -> 'socket:[4390834]'
lrwx------ 1 root root 64 May 15 09:10 264 -> 'socket:[4398611]'
lrwx------ 1 root root 64 May 15 09:10 265 -> 'socket:[4398612]'
lrwx------ 1 root root 64 May 15 09:10 269 -> 'socket:[4398618]'
lrwx------ 1 root root 64 May 15 09:10 273 -> 'socket:[4390866]'
lrwx------ 1 root root 64 May 15 09:10 275 -> 'socket:[4390867]'
lrwx------ 1 root root 64 May 15 09:10 299 -> 'socket:[4528213]'
lrwx------ 1 root root 64 May 15 09:10 3 -> 'socket:[4071269]'
lrwx------ 1 root root 64 May 15 09:10 345 -> 'socket:[4530350]'
l-wx------ 1 root root 64 May 15 09:10 4 -> /var/log/dummy/dummy.log
lrwx------ 1 root root 64 May 15 09:10 458 -> 'socket:[4530368]'
lrwx------ 1 root root 64 May 15 09:10 475 -> 'socket:[4532420]'
lrwx------ 1 root root 64 May 15 09:10 58 -> 'socket:[4394381]'
lrwx------ 1 root root 64 May 15 09:10 6 -> 'socket:[4071936]'
lrwx------ 1 root root 64 May 15 09:10 60 -> 'socket:[4418027]'
lrwx------ 1 root root 64 May 15 09:10 62 -> 'socket:[4398640]'
lrwx------ 1 root root 64 May 15 09:10 627 -> 'socket:[4530335]'
lrwx------ 1 root root 64 May 15 09:10 631 -> 'socket:[4522479]'
lrwx------ 1 root root 64 May 15 09:10 64 -> 'socket:[4420797]'
lrwx------ 1 root root 64 May 15 09:10 645 -> 'socket:[4528565]'
lrwx------ 1 root root 64 May 15 09:10 65 -> 'socket:[4422822]'
lrwx------ 1 root root 64 May 15 09:10 652 -> 'socket:[4527970]'
lrwx------ 1 root root 64 May 15 09:10 654 -> 'socket:[4529300]'
lrwx------ 1 root root 64 May 15 09:10 659 -> 'socket:[4532391]'
lrwx------ 1 root root 64 May 15 09:10 66 -> 'socket:[4398678]'
lrwx------ 1 root root 64 May 15 09:10 661 -> 'socket:[4529308]'
lrwx------ 1 root root 64 May 15 09:10 662 -> 'socket:[4527994]'
lrwx------ 1 root root 64 May 15 09:10 665 -> 'socket:[4529311]'
lrwx------ 1 root root 64 May 15 09:10 67 -> 'socket:[4398679]'
lrwx------ 1 root root 64 May 15 09:10 722 -> 'socket:[4542099]'
lrwx------ 1 root root 64 May 15 09:10 73 -> 'socket:[4413701]'
lrwx------ 1 root root 64 May 15 09:10 753 -> 'socket:[4542109]'
lrwx------ 1 root root 64 May 15 09:10 789 -> 'socket:[4516716]'
lrwx------ 1 root root 64 May 15 09:10 790 -> 'socket:[4519908]'
lrwx------ 1 root root 64 May 15 09:10 8 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 May 15 09:10 802 -> 'socket:[4516746]'
lrwx------ 1 root root 64 May 15 09:10 813 -> 'socket:[4519888]'
lrwx------ 1 root root 64 May 15 09:10 816 -> 'socket:[4516726]'
lrwx------ 1 root root 64 May 15 09:10 818 -> 'socket:[4519892]'
lrwx------ 1 root root 64 May 15 09:10 82 -> 'socket:[4415385]'
lrwx------ 1 root root 64 May 15 09:10 829 -> 'socket:[4527267]'
lrwx------ 1 root root 64 May 15 09:10 848 -> 'socket:[4541983]'
lrwx------ 1 root root 64 May 15 09:10 853 -> 'socket:[4528184]'
lrwx------ 1 root root 64 May 15 09:10 857 -> 'socket:[4525430]'
lrwx------ 1 root root 64 May 15 09:10 858 -> 'socket:[4527287]'
lrwx------ 1 root root 64 May 15 09:10 860 -> 'socket:[4528194]'
lrwx------ 1 root root 64 May 15 09:10 868 -> 'socket:[4542121]'
lrwx------ 1 root root 64 May 15 09:10 892 -> 'socket:[4542072]'
lrwx------ 1 root root 64 May 15 09:10 896 -> 'socket:[4542135]'
lr-x------ 1 root root 64 May 15 09:10 9 -> 'pipe:[4089444]'
lrwx------ 1 root root 64 May 15 09:10 920 -> 'socket:[4542133]'
lrwx------ 1 root root 64 May 15 09:10 956 -> 'socket:[4541985]'
lrwx------ 1 root root 64 May 15 09:10 957 -> 'socket:[4541988]'
lrwx------ 1 root root 64 May 15 09:10 973 -> 'socket:[4542140]'
lrwx------ 1 root root 64 May 15 09:10 980 -> 'socket:[4542142]'
lrwx------ 1 root root 64 May 15 09:10 985 -> 'socket:[4542156]'
lrwx------ 1 root root 64 May 15 09:10 990 -> 'socket:[4542009]'
lrwx------ 1 root root 64 May 15 09:10 991 -> 'socket:[4542035]'
lrwx------ 1 root root 64 May 15 09:10 997 -> 'socket:[4542037]'
lrwx------ 1 root root 64 May 15 09:10 999 -> 'socket:[4542036]' 

@webknjaz
Copy link
Member

@hardikmodha thanks, it'd be great if you could submit such reproducer in a form of the pytest test so that we could have it in our CI (marked as xfail) as per https://pganssle-talks.github.io/xfail-lightning/.

manthey added a commit to girder/large_image that referenced this issue Jun 8, 2020
There is an issue that was introduced in 8.1.0 that throws an rror when
file descriptor numbers are large (see
cherrypy/cheroot#249).  We can't pin cheroot
to an older version, as that breaks support for modern python versions.

It isn't clear if limiting the size of our cache will really help -- the
cheroot issue implies that it is leaking file descriptors, so this may
not do anything useful.
manthey added a commit to girder/large_image that referenced this issue Jun 15, 2020
There is an issue that was introduced in 8.1.0 that throws an error when
file descriptor numbers are large (see
cherrypy/cheroot#249).  We can't pin cheroot
to an older version, as that breaks support for modern python versions.

It isn't clear if limiting the size of our cache will really help -- the
cheroot issue implies that it is leaking file descriptors, so this may
not do anything useful.
manthey added a commit to girder/large_image that referenced this issue Jun 18, 2020
There is an issue that was introduced in 8.1.0 that throws an error when
file descriptor numbers are large (see
cherrypy/cheroot#249).  We can't pin cheroot
to an older version, as that breaks support for modern python versions.

It isn't clear if limiting the size of our cache will really help -- the
cheroot issue implies that it is leaking file descriptors, so this may
not do anything useful.
manthey added a commit to girder/large_image that referenced this issue Jul 9, 2020
There is an issue that was introduced in 8.1.0 that throws an error when
file descriptor numbers are large (see
cherrypy/cheroot#249).  We can't pin cheroot
to an older version, as that breaks support for modern python versions.

It isn't clear if limiting the size of our cache will really help -- the
cheroot issue implies that it is leaking file descriptors, so this may
not do anything useful.
manthey added a commit to girder/large_image that referenced this issue Jul 9, 2020
There is an issue that was introduced in 8.1.0 that throws an error when
file descriptor numbers are large (see
cherrypy/cheroot#249).  We can't pin cheroot
to an older version, as that breaks support for modern python versions.

It isn't clear if limiting the size of our cache will really help -- the
cheroot issue implies that it is leaking file descriptors, so this may
not do anything useful.
manthey added a commit to girder/large_image that referenced this issue Jul 9, 2020
There is an issue that was introduced in 8.1.0 that throws an error when
file descriptor numbers are large (see
cherrypy/cheroot#249).  We can't pin cheroot
to an older version, as that breaks support for modern python versions.

It isn't clear if limiting the size of our cache will really help -- the
cheroot issue implies that it is leaking file descriptors, so this may
not do anything useful.
@kavehtp
Copy link

kavehtp commented Jul 15, 2020

Is there a chance this issue may get fixed any time soon? Server crashes quite often because of this bug.

@webknjaz
Copy link
Member

webknjaz commented Jul 15, 2020

Hey @kavehtp, I'm mostly able to dedicate some free time on the weekends and this one is next on the list after the recent merge of #277. So it'll probably take a while to address.

But if anybody is open to helping out, please contribute PRs:

  1. We need a reproducer integrated into our pytest setup, per https://pganssle-talks.github.io/xfail-lightning.
  2. Then, we need the actual fix that will probably mean switching from low-level select() to using selectors+selectors2 that also implement different low-level mechanisms abstractions, like epoll() and others.

@tommilligan
Copy link
Contributor

@webknjaz I've submitted a reproduction of this bug as PR #301

You can see the error reproduced in the Travis logs here

@webknjaz webknjaz added bug Something is broken Linux reproducer: present This PR or issue contains code, which reproduce the problem described or clearly understandable STR and removed reproducer: missing This PR or issue lacks code, which reproduce the problem described or clearly understandable STR labels Jul 20, 2020
@webknjaz
Copy link
Member

Looks like #301 fix is close to be mergeable, everyone is welcome to test it before I merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken help wanted Somebody help us, please! Linux regression Something that worked earlier got broken in new release reproducer: present This PR or issue contains code, which reproduce the problem described or clearly understandable STR
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants