Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std/process: Default to libc closefrom in spawnProcessPosix #9048

Merged
merged 1 commit into from
Oct 3, 2024

Conversation

the-horo
Copy link
Contributor

The current implementation of spawnProcessPosix is broken on systems with a large ulimit -n because it always OOMs making it impossible to spawn processes. Using the libc implementation, when available, for doing file descriptor operations en-mass solves this problem.

This PR requires dlang/dmd#16806 to be merged first.

@dlang-bot
Copy link
Contributor

dlang-bot commented Aug 23, 2024

Thanks for your pull request and interest in making D better, @the-horo! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

  • My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
  • My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
  • I have provided a detailed rationale explaining my changes
  • New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.


If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Auto-close Bugzilla Severity Description
24715 normal std/process: Default to libc `closefrom` in spawnProcessPosix

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + phobos#9048"

@the-horo
Copy link
Contributor Author

An example of such a system is a docker container which comes with a ulimit -n of 1073741816. This has come up as an issue in https://github.com/mesonbuild/meson where docker containers could not be built / ran locally because dub would crash them (and optionally the host system as well) whenever it was invoked: mesonbuild/meson#13555 (comment)

@the-horo the-horo force-pushed the closefrom-spawn-process branch from 2bfbb5d to 3a85f42 Compare August 23, 2024 04:45
@the-horo
Copy link
Contributor Author

Retrigger CI and remove some braces that didn't conform to the style guide.

@thewilsonator
Copy link
Contributor

Is there a bugzilla issue for this? If not, please create one, then retitle the commit message to say "Fix bugzilla #####: std/process: Default to libc closefrom in spawnProcessPosix"

@the-horo
Copy link
Contributor Author

Is there a bugzilla issue for this? If not, please create one, then retitle the commit message to say "Fix bugzilla #####: std/process: Default to libc closefrom in spawnProcessPosix"

I didn't find any when I wrote the changes. I'll go and create one.

@the-horo
Copy link
Contributor Author

CircleCI is failing because it uses ubuntu-20.04 as a base which uses glibc-2.31 and closefrom has been added to glibc-2.34: https://sourceware.org/pipermail/libc-alpha/2021-August/129718.html. Is there any way to check the version of glibc or is it acceptable to bump the circle CI image to ubuntu-22.04?

@CyberShadow
Copy link
Member

But wouldn't that mean that this PR would make D programs incompatible with Ubuntu versions older than 22.04?

@the-horo
Copy link
Contributor Author

But wouldn't that mean that this PR would make D programs incompatible with Ubuntu versions older than 22.04?

Yes, it would break compiling on systems that dont't have >=glibc-2.34 which was released in august 2021. The current code is, under the circumstances of large ulimit, broken on any posix system. Either way something will be broken.

If I were to fix this I would introduce a HAVE_CALLFROM version and check for that in spawnProcessPosix. Then the compilers would ship with a config file that defines it (I don't know if gdc has a config file) and, in the worst case, users of older systems would need to remove that definition. It's pretty ugly but I'm struggling to find anything better. Perhaps anyone else has a better idea.

std/process.d Outdated
Comment on lines 1046 to 1047
version (CRuntime_Glibc)
import core.sys.linux.unistd : closefrom;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mismatches the version gate of the linux.unistd module (it is possible to be glibc without being on linux).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that when writing the code but I assumed that glibc is linux only, thanks for clarifying this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hurd is one example of an actively developed system that uses glibc.

kFreeBSD and kOpenSolaris were two others - both thankfully confined to history as a failed experiment (for the latter when developers realized the complexity of the task :-)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The good news is that glibc's closefrom implementation is not in the linux-specific directory (it is io/closefrom.c), and the abilist for hurd includes the expected:

GLIBC_2.34 closefrom F

@ibuclaw
Copy link
Member

ibuclaw commented Aug 23, 2024

To make this a druntime issue, it might be best for Phobos to do the following:

version (linux) import core.sys.linux.unistd;
...
static if (!__traits(compiles, closefrom))
{
    void closefrom (int lowfd)
    {
        ...
    }
}
...
closefrom(forkPipeOut + 1);

@CyberShadow
Copy link
Member

@the-horo Can you check if #8990 fixes this issue? It is some work which is I think in the same direction.

@CyberShadow
Copy link
Member

CyberShadow commented Aug 23, 2024

Yes, it would break compiling on systems that dont't have >=glibc-2.34 which was released in august 2021.

But won't this make it impossible to also run D programs on older systems, rather than just build them?

We would need to use dlsym to preserve compatibility with systems that don't have closefrom.

@the-horo
Copy link
Contributor Author

@the-horo Can you check if #8990 fixes this issue? It is some work which is I think in the same direction.

It looks like what I want but I think the closefrom libc solution can be better if I can get it to work properly. #8990 can also be applied on top of this to help the systems that don't have closefrom.

To make this a druntime issue, it might be best for Phobos to do the following:

version (linux) import core.sys.linux.unistd;
...
static if (!__traits(compiles, closefrom))
{
    void closefrom (int lowfd)
    {
        ...
    }
}
...
closefrom(forkPipeOut + 1);

Yes, I want to rewrite it like this but I would want the version (linux) check to become version (linux) version (GLIBC_HAS_CALLFROM) and only then use the glibc function. GLIBC_HAS_CALLFROM can then be set when building phobos. Since gdc uses autotools and ldc uses cmake it should be easy for them to do the equivalent of if function_exists('closefrom') then DFLAGS+=-version=GLIBC_HAS_CALLFROM, it would be more awkward to do for dmd but not impossible. Correct me if I'm saying something dumb but I think it can work.

Yes, it would break compiling on systems that dont't have >=glibc-2.34 which was released in august 2021.

But won't this make it impossible to also run D programs on older systems, rather than just build them?

I don't understand what you refer to. phobos isn't backward abi compatible so the changes here would affect only programs written in the future. If they can't be built then they won't run, obviously. Do you mean that one won't be able to compile a program on a recent glibc and run it on an older one, for example, the dmd dlang.org provided binary will no longer work?

We would need to use dlsym to preserve compatibility with systems that don't have closefrom.

I think we can solve all of this during the building of phobos, see above. From my understanding, because it is not a template, the function spawnProcessPosix should never be compiled when building user code, the code should always use the version that is built into phobos. So changing the implementation at phobos' build time shouldn't affect consumers of the library, they will all use what phobos was built with.

@ibuclaw
Copy link
Member

ibuclaw commented Aug 23, 2024

Yes, I want to rewrite it like this but I would want the version (linux) check to become version (linux) version (GLIBC_HAS_CALLFROM) and only then use the glibc function. GLIBC_HAS_CALLFROM can then be set when building phobos. Since gdc uses autotools and ldc uses cmake it should be easy for them to do the equivalent of if function_exists('closefrom') then DFLAGS+=-version=GLIBC_HAS_CALLFROM, it would be more awkward to do for dmd but not impossible. Correct me if I'm saying something dumb but I think it can work.

That's why the __traits(compiles) condition. The XXX_HAS_CALLFROM can be a package enum value in druntime via one of the */config.d modules.

Phobos doesn't need to concern itself about what platform its running on, only what features are available at run-time.

@the-horo
Copy link
Contributor Author

That's why the __traits(compiles) condition. The XXX_HAS_CALLFROM can be a package enum value in druntime via one of the */config.d modules.

Yes, there needs to be a way to check for the existence of the function. If you do it in druntime how do you support multiple versions of glibc? It's either they are all assumed to have callfrom or none have it. It's my understanding that the druntime bindings are just wrappers for what the corresponding C headers provide for the target system, which isn't known in advance, so, unless they are able to importC the system header and expose those values to D I don't see how one could support multiple libc versions.

The difference with phobos is that the build system knows for what target it is compiling so it can inspect those values are make a decision.

Phobos doesn't need to concern itself about what platform its running on, only what features are available at run-time.

And that decision is made at built time, right?

@the-horo
Copy link
Contributor Author

Do you suggest having druntime be like:

module core.sys.linux.unistd;

public import core.sys.linux.config;

static if (GLIBC_VER_NUM >= 234)
void closefrom(int);

and

module core.sys.linux.config;

enum GLIBC_VER_NUM = @SED_ME_IN@;

And replace @SED_ME_IN@ when building druntime?

@CyberShadow
Copy link
Member

CyberShadow commented Aug 23, 2024

I don't understand what you refer to. phobos isn't backward abi compatible so the changes here would affect only programs written in the future.

closefrom is only available in some glibc versions, right? It wasn't there since the beginning. Hence the CI failure.

We want to allow people to build and run D programs even on computers which have older glibc versions. This means that those computers won't have closefrom.

So, if we wanted to keep supporting those situations, but still use closefrom if it's available, then we need to check if the function is available at runtime. We can do this with dlopen(LIBC_SO, ...) + dlsym(..., "closefrom").

If we try to simply use closefrom as it is declared in Druntime in dlang/dmd#16806, we will get

  1. link errors when trying to build D on machines with older glibc versions
  2. start-up errors when trying to run D programs on machines with older glibc versions.

@the-horo
Copy link
Contributor Author

closefrom is only available in some glibc versions, right? It wasn't there since the beginning. Hence the CI failure.

We want to allow people to build and run D programs even on computers which have older glibc versions. This means that those computers won't have closefrom.

So, if we wanted to keep supporting those situations, but still use closefrom if it's available, then we need to check if the function is available at runtime. We can do this with dlopen(LIBC_SO, ...) + dlsym(..., "closefrom").

Therefore the bindings D uses to the libc headers provided by the host should match the definitions for that host. If the host provides closefrom then there's no issue to use it and if the host doesn't then you shouldn't use it. I still don't understand where does runtime come into play.

If we try to simply use closefrom as it is declared in Druntime in dlang/dmd#16806, we will get

1. link errors when trying to build D on machines with older `glibc` versions

2. start-up errors when trying to run D programs on machines with older `glibc` versions.

I don't think you're supposed to be able to downgrade glibc which is exactly what happens when you take a binary compiled for a newer glibc and try to run it against an older one. If you want to support systems with older glibcs you can compile for the older one, in that case the application will work on newer systems as well because glibc does provide backwards compatibility. Am I misunderstanding something?

@CyberShadow
Copy link
Member

Therefore the bindings D uses to the libc headers provided by the host should match the definitions for that host.

But that's not relevant, is it? D does not use the libc headers.

I don't think you're supposed to be able to downgrade glibc which is exactly what happens when you take a binary compiled for a newer glibc and try to run it against an older one.

This is only true if we're only talking about never distributing binaries.

I think many D users do rely on building a binary on their computer and then running it on another computer.

@the-horo
Copy link
Contributor Author

But that's not relevant, is it? D does not use the libc headers.

What are the headers in core.sys.* for then? If they're meant to only represent the functions that are available in a libc implementation in an arbitrarily (but consistent) range of versions of such libc then sure, those don't represent what is available for the target host. The only represent what is considered, by D's standards, portable enough. This would allow applications that only use such functions to be portable across libc versions so people can copy their programs safely across systems and have them work.

Note that just because that arbitrary range of versions is not written anywhere doesn't mean that it doesn't exist. Functions that are either soon to be removed or too recent shouldn't be part of those headers in that situation, and this should be documented somewhere.

Now, in phobos' case, in does need to perform system calls so it would need those prototypes for those functions somewhere. It makes the most sense that those prototypes be under core.sys. If phobos needs to be portable to some degree it would basically mean that every time a system call is introduced it must be checked that it exists on all versions of libc that phobos needs to support. Again, if this desired, it should be clearly stated and enforced. This is why I suggested having the core.sys modules represent the portability basis so that one can code in phobos without worrying about the portability of the final binary.

The second approach for the headers is have them represent the interface that is provided by libc for the target system. This is what, and I think @ibuclaw suggested. In this situation they should match what the C system headers define so we might as well call them that.

This is only true if we're only talking about never distributing binaries.

Distributing binaries can be done with any of the approached above, you just need to make sure that you compile the programs properly. This does mean obvious things like not compiling for a different architecture nor os but it also includes not using CPU extensions that not all target machines support and not making calls to external functions that may not exist.

It all comes down to what are the actual systems that you want to support. In phobos' case, if we only resort to what CI says, portable enough means being able to run on ubuntu 20.04. If portability is desired then, in the first situation in which core.sys headers represent the portability basis no work needs to be done in phobos, include everything that you need and rewrite anything that isn't available.

In the second case in which the core.sys headers are properly versioned, phobos would then declare that it wants to see the interface provided by, say, glibc-2.30 and use only those functions in order to be portable. The only requirement is that functions that are removed in later versions aren't used to there is a little bit more work when deciding if something can be used.

The fundamental difference in the second case is that the core.sys files can then be used in any D program and they will represent what the host environment offers and not some phobos subset of it. I think you should be able to also do this with importC but then, what would be the point of the core.sys headers?

I think many D users do rely on building a binary on their computer and then running it on another computer.

If people do want to do this then they should do it willingly, by writing code that is portable. This implies that libraries that they link statically (like phobos by default) should also be portable. If we want to support this than having a minimum version of libc that we target is more helpful than "if CI is green it's portable".

If we take this approach we should also consider the users who would prefer that the phobos code uses the efficient functions provided by their system rather then doing its own inefficient thing.

At least I understand now what you mean by doing the check at runtime. libphobos can be compiled for an older glibc and check if the one that it loaded into memory happens to provided the function that it needs. In this situation I would highlight that if the code is bad enough that it would rather be replaced at runtime maybe it's worth considering bringing the minimum supported version higher. In this particular case I agree with you, depending on a version of glibc from 3 years ago is a little bit much.

All in all, it think it would be nice if:

  1. core.sys became versioned
  2. phobos was explicit about which libc function are allowed and which aren't
  3. phobos' build system could be configured to either target a "portable" libc or the current one.

What I really want to see changed it having phobos by in this limbo of portability where there's no documentation about what's allowed and what isn't and the enforcement is a single CI run.

@CyberShadow
Copy link
Member

What are the headers in core.sys.* for then? If they're meant to only represent the functions that are available in a libc implementation in an arbitrarily (but consistent) range of versions of such libc then sure, those don't represent what is available for the target host. The only represent what is considered, by D's standards, portable enough. This would allow applications that only use such functions to be portable across libc versions so people can copy their programs safely across systems and have them work.

Note that just because that arbitrary range of versions is not written anywhere doesn't mean that it doesn't exist. Functions that are either soon to be removed or too recent shouldn't be part of those headers in that situation, and this should be documented somewhere.

Now, in phobos' case, in does need to perform system calls so it would need those prototypes for those functions somewhere. It makes the most sense that those prototypes be under core.sys. If phobos needs to be portable to some degree it would basically mean that every time a system call is introduced it must be checked that it exists on all versions of libc that phobos needs to support. Again, if this desired, it should be clearly stated and enforced. This is why I suggested having the core.sys modules represent the portability basis so that one can code in phobos without worrying about the portability of the final binary.

Agreed, and this makes sense to me. I can't speak authoritatively but I think contributions which implement the above would be welcome.

The second approach for the headers is have them represent the interface that is provided by libc for the target system. This is what, and I think @ibuclaw suggested. In this situation they should match what the C system headers define so we might as well call them that.

There must have been a misunderstanding because I have no idea how this would be implemented in practice. We distribute Phobos+Druntime as precompiled libraries. The decision whether to use a certain function or not in this way would need to be done at compile time, so it clearly cannot be done in that way.

Distributing binaries can be done with any of the approached above, you just need to make sure that you compile the programs properly. This does mean obvious things like not compiling for a different architecture nor os but it also includes not using CPU extensions that not all target machines support and not making calls to external functions that may not exist.

True in theory, but in practice, compilers already generally produce fairly portable binaries at their default settings, which even run well across multiple Linux distributions. Yes, this is not codified in any way right now (except said CI), but I don't think that's an excuse to drastically worsen the situation.

It all comes down to what are the actual systems that you want to support.

I don't know if it reflects the current situation or not, but to the best of my knowledge, the DFL's position is that D targets all OS versions which are supported by their vendor. So, for Windows, that would be Windows 10 or newer. Looking at what Canonical supports, I guess that would be Ubuntu 14.04 and newer.

@eli-schwartz
Copy link

This is only true if we're only talking about never distributing binaries.

I think many D users do rely on building a binary on their computer and then running it on another computer.

True in theory, but in practice, compilers already generally produce fairly portable binaries at their default settings, which even run well across multiple Linux distributions. Yes, this is not codified in any way right now (except said CI), but I don't think that's an excuse to drastically worsen the situation.

In theory you "can" do this, in practice this is incredibly dangerous.

This is NOT about compilers, at all, period, end of story. This is about glibc, a library which provides a library API and a library ABI. And glibc also has a strict backwards compatibility story. You can compile programs using an old glibc and it shall continue working with new glibc. The reverse is not true.

glibc's ABI is versioned. Every time they add a new symbol to the ABI, they version it by the release it was added in. So the minimum version of glibc you need to run a binary you have built, is the newest ABI your binary ended up including when it was compiled. This may be older than the glibc you compiled with, but it depends on glibc private implementation details. They do sometimes improve the codebase, providing newer versions of an older symbol that are an ABI break and therefore require providing two copies of the symbol: one is symname@GLIBC_2.2.5 or whatever, and the other is symname@GLIBC_2.41.

Your suggestion is to avoid binding to the ABI at compile time for closefrom, because there is no closefrom older than closefrom@GLIBC_2.34. Using dlsym is a workaround for this, but not for other symbols.

e.g. dlsym@GLIBC_2.34 overrides dlsym@GLIBC_2.2.5, so I dearly hope you're not depending on that.

@schveiguy
Copy link
Member

Wow, closefrom is nice. That's exactly what we need for this very long standing problem! There have been several PRs to try and fix this, the most recent being #8990 with accompanying bug report https://issues.dlang.org/show_bug.cgi?id=24524

So in response to the discussion about using directly closefrom or using dlsym to look it up, I prefer the dlsym because it's very robust, and does not require synchronized compile-time versioning (a chronic problem). If you have the symbol, it's there, and if not, you do some fallback.

@eli-schwartz
Copy link

And also as I already pointed out above:

If you use dlsym then it adds a dependency on newer glibc, which you already said you don't want to do.

@CyberShadow
Copy link
Member

CyberShadow commented Aug 25, 2024

... Ubuntu 24.04 ships with glibc 2.39, and glibc 2.39 contains a dlsym() that, just like closefrom, is a @GLIBC_2.34 symbol and will not run on old versions of Ubuntu.

I see, so we are in that situation already, as we use it in std.net.curl.

I guess that means that it is already not possible to use an older glibc than what a D binary is compiled against.

So, using dlsym in std.process will not affect the status quo.

Is this accurate?

which you rejected as invalid

It was meant as a joke, apologies if it didn't go across well.

@the-horo
Copy link
Contributor Author

Apologies, this discussion is a little overwhelming with respect to how much time I can dedicate to it. I'm aware of many things that have been written above and have done a poor job communicating this, so we're going in circles somewhat. It sounds like you're trying to convince someone of something, but I'm not the correct person who needs convincing. In any case, discussing a mechanism that allows Druntime to make some declarations available depending on the glbc version it was built against seems a little out of scope for this PR.

I agree, the whole discussion is about what druntime should do so I will open an issue against it. Thank you for taking the time to discuss this, even if it seems like we haven't gotten anywhere from the start.

You can leave this PR open and wait to the resolution of the issue or you can close it now and I'll reopen if druntime changes in a way that allows phobos code to do what I want.

I guess that means that it is already not possible to use an older glibc than what a D binary is compiled against.

Nothing that uses glibc declarations and links to it can do this, regardless of programming language.

@eli-schwartz
Copy link

... Ubuntu 24.04 ships with glibc 2.39, and glibc 2.39 contains a dlsym() that, just like closefrom, is a @GLIBC_2.34 symbol and will not run on old versions of Ubuntu.

I see, so we are in that situation already, as we use it in std.net.curl.

I guess that means that it is already not possible to use an older glibc than what a D binary is compiled against.

So, using dlsym in std.process will not affect the status quo.

Is this accurate?

In a sense, I guess? Using dlsym doesn't make the glibc versioned API status quo worse because currently they are at the same version. Who knows what the future may bring -- if dlsym is upgraded again, it will make the status quo worse, if closefrom is upgraded again then the reverse will occur.

But there are other reasons to strictly avoid dlsym! It means the compiler cannot offer a safety contract for your use of the ABI. If it changes in the future you won't know what to call and how, whereas by using the symbol normally, you can compile against the expected implementation. You can typecheck how you call it. It's also faster to avoid the indirection overhead.

I'm afraid I don't really see a good reason to use a thoroughly substandard and unsafe technology such as dlsym when no real justification for doing so has been provided.

What is the downside of using a regular function call if it is available?

@the-horo
Copy link
Contributor Author

What is the downside of using a regular function call if it is available?

I can think of only two:

  1. we currently don't have a check to know if it's available and a check for that is best put in druntime so we can't go further here without changing how druntime handles its declarations. We could do it here but if we need a discussion followed by a decision on how (and if) we perform the check just place it in druntime and have everyone benefit instead of leaving it internally for phobos.
  2. it would fix the problem only for phobos compiled on newer systems, which the D provided archives are not. For this there is Fix bugzilla 24524: Very slow process fork if RLIMIT_NOFILE is too high #8990 and the dlsym trick would find the definition when compiled for an older glibc and ran on a recent system. I also agree with you that calling a function through a void* can not ever be considered safe and, in my opinion, spawnProcessPosix could not be marked @safe or @trusted because you can not verify that the call will never do something nasty but I'm not the language designer.

@CyberShadow
Copy link
Member

I also agree with you that calling a function through a void* can not ever be considered safe

To be fair, it's not like there is a mechanism that checks that the declaration in Druntime matches the declaration in the C headers... so, it's not that different from casting from a void*.

spawnProcessPosix could not be marked @safe or @trusted because you can not verify that the call will never do something

A function which only takes an int argument but loads and calls closefrom via dlopen/dlsym would still have a memory-safe interface, so it can be @trusted.

@the-horo
Copy link
Contributor Author

To be fair, it's not like there is a mechanism that checks that the declaration in Druntime matches the declaration in the C headers... so, it's not that different from casting from a void*.

This is not correct. When you specify extern(C) void closefrom(int); in your program and then reference that symbol your program will need to link to that symbol inside glibc. The linker's job is to map that declaration to a function in glibc, in our case it will find closefrom@GLIBC_2.34. Then, at runtime, when in your code you called closefrom you will actually be calling closefrom@GLIBC_2.34. Because of the backwards compatibility guarantees of glibc you will always be able to find that symbol on any version of glibc higher or equal to 2.34 and it will have the exact same signature as the one you saw during the build.

The check is not automatic, you have to look inside the headers of glibc, inspect the function that you need and correctly translate that into a D declaration, but it is possible to do.

This is not the case when you use dlsym. If you use the closefrom function you get from dlsym you have absolutely no way of knowing that its semantics are the same as you saw in the glibc headers you built against. We like to think that things will keep working because they haven't broken yet but there are very clear limitations on what you can and can't do and you are not able to verify the type of a function from its void*.

Maybe the closefrom function will keep its semantics even 20 years from now but imagine any other example. Imagine that a glibc function took a struct as an argument. How can you verify that the layout/size of the struct you defined in your program is the same as the one that the function you found through dlsym is the same one? You can't control how many iterations the struct inside glibc went through because you can't know which version of glibc dlopen will find.

A function which only takes an int argument but loads and calls closefrom via dlopen/dlsym would still have a memory-safe interface, so it can be @trusted.

This is again, an assumption we make because we expect code to be properly written. You can not infer the safety of a function just be looking at its signature. Take:

/** Get the number of a cpu related structure

choice can be one of:
- PHYS_CPUS => the function returns the number of physical cpus in the system
- PHYS_CORES => the function returns the number of physical cores
- LOG_THRDS => the function returns the number of logical threads
*/
int get_cpus(int choice);

You can not have a @trusted declaration for get_cpus because the behavior is undefined if you don't call the function with the correct argument. Assuming that a function is safe if it takes an int and unsafe if it takes a pointer is just an assumption, if you want to say that a call to a function is trusted you have to check its documentation for its supported usage. In the case of no documentation you have to go and read the code and deduce based of that.

On top of this, as I explained above, you can't even know the signature of a function you get from dlsym so its pointless to even say that the function you get takes in an int.

@CyberShadow
Copy link
Member

CyberShadow commented Aug 25, 2024

This is not correct.

Here is what I mean:

extern(C) int closefrom(const char* str);

void main() { closefrom("Please close file descriptor 5 and higher"); }

This compiles and links with no error. There is no mechanism that would verify that the function signature is correct, much like how a cast from void* requires using the correct function pointer type.

Edit: I understand now that what you're trying to say is that the correct type to cast to from void* is unknowable.

You can not infer the safety of a function just be looking at its signature.

That doesn't sound right, and I really don't understand what you're even trying to say. Do we need a sticker on every function that says "this function is bug free and doesn't do anything crazy" before we can put @trusted on it? Maybe there are people who think so, but that's not what we've been doing, and I don't see any practical reason to stop.

Assuming that a function is safe if it takes an int and unsafe if it takes a pointer is just an assumption

I'm not sure if you're just exaggerating, but in case you're not, that's not how memory safety in D works. We generally define safety at the function level as being with respect to the arguments. So, if the function accepted an int* p and did p[0] = 1, then it's safe, but if it did p[1] = 1 then it would not be. What functions do in their internal structures that are not visible from D is of no concern (though, if they modified the D global state in an unsafe way, that would be different).

@CyberShadow
Copy link
Member

CyberShadow commented Aug 25, 2024

This is not the case when you use dlsym.

I see what you mean, and let's assume that we are indeed worried about the hypothetical situation that a future glibc version might change the ABI of closefrom.

Can we call dlsym with e.g. "closefrom@GLIBC_2.34" so that we get a specific function with a signature that we know how to use?

@eli-schwartz
Copy link

There is no mechanism that would verify that the function signature is correct, much like how a cast from void* requires using the correct function pointer type.

As a general thing, this is doable for non-C languages by computing the language-specific function signature from the C header. If I understand correctly, that's what @the-horo suggested above. It would necessitate a build step, probably -- or shipping with versioned libc interfaces and using a version test to see which version of the shipped interface to use.

@the-horo
Copy link
Contributor Author

Here is what I mean:

extern(C) int closefrom(const char* str);

void main() { closefrom("Please close file descriptor 5 and higher"); }

This compiles and links with no error. There is no mechanism that would verify that the function signature is correct, much like how a cast from void* requires using the correct function pointer type.

If we're going to argue the semantics of @safe let's check what the standard says. @trusted and @safe functions have safe interfaces. This means that they "can not exhibit undefined behaviour".

@safe functions have these restrictions enforced by the compiler while with @trusted "it is the responsibility of the programmer to ensure that the interface of a trusted function is safe."

The standard doesn't say how you need to ensure that, only that it's your responsibility. I have said in my previous comment that you can verify that extern(C) int closefrom(const char *str) can be @trusted be checking inside the file unistd.h provided by your glibc installation and verifying that those declarations would match. They don't so the declaration can only be @system.

To explicitly point out what is broken:

Calling the function extern(C) int closefrom(const char*) (in the case that it doesn't crash) produces erratic results as the return value is undefined and passing a const char* may truncate that pointer to an integer and close all file descriptors higher or equal to that the integer value, or just be undefined behavior. All of these the spec categorizes as undefined behavior => closefrom is @system.

Because there is no way to call this function without observing undefined behavior then all functions that (indirectly) can call our closefrom are @system. That is what the spec states.

That doesn't sound right, and I really don't understand what you're even trying to say. Do we need a sticker on every function that says "this function is bug free and doesn't do anything crazy" before we can put @trusted on it? Maybe there are people who think so, but that's not what we've been doing, and I don't see any practical reason to stop.

If you are using @trusted this is exactly what would be required of you, according to the spec. I am not arguing for this, however. My main and only concern is that, eventually, @trusted functions will break their contract, because it requires human verification and humans do mistakes. There is one big difference, in my opinion, based on what actually caused the breakage:

If our @trusted function made an application crash because it called dlsym and got a function that it incorrectly cast to the wrong type and called that then it is entirely our fault and we are responsible for breaking people's code.

If our @trusted function called the glibc implementation of closefrom which then proceeded to crash because the code passed in the value 0xdeadbeef and there were open 42 file descriptors which caused a weird interactions in the glibc code that lead to a failure, even if we technically broke the contract, we did everything in our power to prevent this and blame should be put on the bad implementation of closefrom in glibc.

All that I want is for the code not to introduce, willingly, more points of failure without any reason. Calling dlsym to get a symbol that can be provided at build time is explicitly ignoring the safe and simple approach for complexity. If the code is so bad that it's better to do runtime inspection of libraries to check for functions that do what we need because our current implementation is so bad then just change the implementation or bump the minimum required versions of glibc.

As a general thing, this is doable for non-C languages by computing the language-specific function signature from the C header. If I understand correctly, that's what @the-horo suggested above. It would necessitate a build step, probably -- or shipping with versioned libc interfaces and using a version test to see which version of the shipped interface to use.

There are two approaches to this I suggested. First we can have the D code inspect the C headers, just like they would be seen by a C program. We can use it in two cases:

  1. Replace the current modules with a .c file that includes the corresponding header:
// core/sys/linux/unistd.c
#include <unistd.h>
  1. Add versioning to the current headers and include a C header that gives the versions:
// core/sys/linux/unistd.d
import core.sys.linux.config
static if (__GLIBC_PREREQ(2, 34))
extern(C) void closefrom(int);
// core/sys/linux/config.c
#include <features.h>

The first approach is simpler but, looking at all the expected failures in https://github.com/dlang/dmd/blob/master/compiler/test/compilable/stdcheaders.c, if we do it we will probably start failing to compile on certain platforms with incompatible headers when the previous code worked, by some definition of the word. This will mostly affect gdc and ldc2 because they do target more exotic setups.

The second approach is more portable only because it imports less so it has less of a chance to fail. Maybe the D header can also be considered more readable compared to looking in unistd.h.

Both of these have the advantage that, if they work, the C bindings we get from it match exactly the ones in the glibc headers and, by extension, in the glibc library we link against.

The second approach I discussed is embedding the version of glibc we build druntime against inside it and have the declarations match that version of glibc. It would look like:

// core/sys/linux/unistd.d
import core.sys.linux.config
static if (__GLIBC_PREREQ(2, 34))
extern(C) void closefrom(int);
// core/sys/linux/config.d
enum __GLIBC__ = 2;
enum __GLIBC_MINOR__ = 39;
bool __GLIBC_PREREQ(int, int) => ...;

The enums will be auto populated at build time so the bindings will be for the version of glibc that was present during the build. This has the advantage that, once generated, the code will continue to work. By doing it during the build we can also support more targets because we can use a lot more tools, compared to our C parser implementation, to extract those values from the headers. Once generated the code will also not fail on exotic platforms because we can control all the parameters.

The downside is that the headers can become out of sync if we compile for an older glibc and distribute on systems with newer ones. I want to highlight that this is exactly how the headers are currently implemented so this change has only upsides to the current situation. To fix the out-of-sync headers all there's needed is to modify the __GLIBC_MINOR__ definition.

I admit that I use the D compilers that are compiled for my system so I am biased towards this solution because it has zero downsides and almost all the upsides of the first two solutions. The D provided archives, however, don't share the same circumstances but, since changing the version is pretty easy we can have the install.sh script fill in those values after extracting the archive by inspecting the target environment.

I am so much against using dlsym because, to me, it seems like the kind of solution you shove into your code because you can't be bothered to fix a bad implementation but you also don't want to remove the bad implementation because that would break other parts of it. This way you try to get the number of users who are actually on older systems without a closefrom implementation below a threshold so that, hopefully, the won't report this issue.

It is unreasonable for me to think that phobos code should be dictated by what I want so I will accept the dlsym solution if that's what's decided. I only want to add to it trying to first check for a closefrom declaration at compile time and, if found use it instead of using dlsym. This way everyone gets what they ask for but it would require getting the druntime headers situation sorted out so don't consider this a blocker. The important thing is that there is a major bug in phobos' code and a PR to fix it has been opened 4 months ago. Even having a subpar solution is better than having this code continue to be put into all D executables.

@CyberShadow
Copy link
Member

I have said in my previous comment that you can verify that extern(C) int closefrom(const char *str) can be @trusted be checking inside the file unistd.h provided by your glibc installation and verifying that those declarations would match.

Apologies for the confusion but I made no claims about the safety of closefrom(const char *str) - the two were separate topics.

  • we currently don't have a check to know if it's available and a check for that is best put in druntime so we can't go further here without changing how druntime handles its declarations. We could do it here but if we need a discussion followed by a decision on how (and if) we perform the check just place it in druntime and have everyone benefit instead of leaving it internally for phobos.

  • it would fix the problem only for phobos compiled on newer systems, which the D provided archives are not. For this there is

Here is one more advantage of using dlsym that I see: it would allow D users to build redistributable binaries which work on old systems but take advantage of new additions such as closefrom on new systems.

Can we call dlsym with e.g. "closefrom@GLIBC_2.34" so that we get a specific function with a signature that we know how to use?

Would this work?

@the-horo
Copy link
Contributor Author

Apologies for the confusion but I made no claims about the safety of closefrom(const char *str) - the two were separate topics.

All I wanted to show is the process of verifying that a glibc function declaration can be marked @trusted. If you can or can not mark it @trusted is only the final conclusion, the important part is the steps you take.

Here is one more advantage of using dlsym that I see: it would allow D users to build redistributable binaries which work on old systems but take advantage of new additions such as closefrom on new systems.

Yes, if your use case is compiling for an older glibc and, possibly, running on a newer one then the only way to take advantage of the functions in the newer glibc is by using dlsym. I want to, however, differentiate about what the "advantage" actually is:

  1. Performance. In this case using dlsym I find similar to generating fat binaries that take advantage of CPU extensions in order to run slightly more performant code. I know that mir-ion does this so this use case is not unreasonable. Of course, if we're interested in performance it is best to performs benchmarks so we know that using dlsym is doing what we want it to do.

  2. Not using the default implementation because it sucks. If that's the case then the proper solution is fixing the implementation, using dlsym is just a bandage. There's nothing wrong with adding a bandage over the wound but we will still need a proper solution, which we do have in the works as Fix bugzilla 24524: Very slow process fork if RLIMIT_NOFILE is too high #8990. You said that bumping the glibc requirement is not acceptable so I won't consider that approach.

Using plain dlsym will come with the downside that you can't typecheck your function but if we're talking about using dlsym just as a temporary fix then I have nothing against it.

I'm starting to find the argument about people wanting portability weak. If they desire portability then adding a dependency on dlsym and depending an more external code in addition to our own sounds counter intuitive. If you had to support multiple systems would you rather:

  1. Have the best implementation on newer systems and a terrible implementation on older systems
  2. Have a generally good implementation on all systems.

Currently we are in situation 3: Having a terrible implementation on all systems so, yes, using dlsym is an improvement. The consequence of this action is that it would make us head towards solution 1 which I believe is the incorrect approach. If we acknowledge that we use dlsym only temporary because the current implementation is just so bad and we are open to picking either of the two solutions then, again, I have nothing against.

I don't believe that phobos should cater itself specifically to people who want portability. The standard library should be a collection of general code that works well in general. If people have a specific use case that needs specific code then the standard library is not the place for it.

The 100% solution to this problem, employed by any software project that I have seen so far, would be to:

  1. Link against closefrom if we know it's available (we currently don't have a mechanism for it)
  2. If we couldn't link we should use a sane default implementation (PR Fix bugzilla 24524: Very slow process fork if RLIMIT_NOFILE is too high #8990 is not merged)

If, after this, we benchmark the default implementation code and compare it to using closefrom through dlsym and we determine that that advantage is significant enough and there are sufficient people who can benefit from it we can add the dlsym mechanism.

Can we call dlsym with e.g. "closefrom@GLIBC_2.34" so that we get a specific function with a signature that we know how to use?

Would this work?

If you need a specific version of a symbol you need to use dlvsym. Will it work? I have no idea because I have never seen someone use it.

@the-horo the-horo force-pushed the closefrom-spawn-process branch from 43c7b05 to ab94533 Compare August 26, 2024 10:04
@the-horo
Copy link
Contributor Author

I hate it when bug fixes get stalled because people can't decide on a solution and I don't want to be part of the problem.

All the downsides of using dlsym have been outlined during this conversation. If we can't get anything better merged in a timely fashion just allow using dlsym. We can come back to the code and repair it properly but, right now, the important thing is being able to run a D program on a posix system without the fear that it will crash said system if RTLIMIT_NOFILE is high enough.

@CyberShadow
Copy link
Member

I am not sure why #8990 is stalled. @schveiguy What do you think, should we go ahead and merge that one?

If we can't get anything better merged in a timely fashion just allow using dlsym.

No one is forbidding it, right?

I'll happily approve a patch which uses dlsym or dlvsym in a version (Cruntime_Glibc) { ... } block.

Copy link
Member

@CyberShadow CyberShadow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see you've already done just that!

Question, what do you think about doing the dlopen/dlsym calls in the parent process, so that we can reuse the result across forks? I think it might also make it easier in the futureto use one of those lighter-weight fork variant that restricts what we can do between fork and exec.

static bool tryGlibcClosefrom (int lowfd) {
import core.sys.posix.dlfcn;

void *handle = dlopen("libc.so.6", RTLD_LAZY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why libc.so.6 and not libc.so?
  2. What do you think about using RTLD_NOLOAD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why libc.so.6 and not libc.so?

If dlopen("libc.so") ever finds something different than dlopen("libc.so.6") then we definitely don't want to continue trying to load symbols. Very unlikely that this will ever happen but better safe then sorry.

  1. What do you think about using RTLD_NOLOAD?

Looking only at the man page of dlopen I can't tell what those values actually do so I went with the value they showed in the example.

Copy link
Member

@CyberShadow CyberShadow Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that in the unlikely case that the current binary is actually using a libc that's not libc.so.6 then we probably don't want to actually load a second libc. I understand that RTLD_NOLOAD should achieve that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RTLD_NOLOAD makes it so that closefrom is no longer found so I'll keep RTLD_LAZY.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This debate could be solved by avoiding dlopen/dlsym
  2. I find it vanishingly unlikely in the event Cruntime_Glibc is true but the soversion of glibc has advanced past "6", that you'll also have libc.so.6 available

std/process.d Outdated Show resolved Hide resolved
std/process.d Show resolved Hide resolved
std/process.d Outdated Show resolved Hide resolved
@the-horo the-horo force-pushed the closefrom-spawn-process branch from ab94533 to 0f51821 Compare August 26, 2024 11:38
Copy link
Member

@CyberShadow CyberShadow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

…nProcessPosix

The current implementation of spawnProcessPosix is broken on systems
with a large `ulimit -n` because it always OOMs making it impossible
to spawn processes. Using the libc implementation, when available, for
doing file descriptor operations en-mass partially solves this problem.

Signed-off-by: Andrei Horodniceanu <[email protected]>
@the-horo the-horo force-pushed the closefrom-spawn-process branch from 0f51821 to 48d581a Compare August 26, 2024 12:01
@ibuclaw
Copy link
Member

ibuclaw commented Aug 26, 2024

The second approach for the headers is have them represent the interface that is provided by libc for the target system. This is what, and I think @ibuclaw suggested. In this situation they should match what the C system headers define so we might as well call them that.

There must have been a misunderstanding because I have no idea how this would be implemented in practice. We distribute Phobos+Druntime as precompiled libraries. The decision whether to use a certain function or not in this way would need to be done at compile time, so it clearly cannot be done in that way.

FAOD.

  • Yes, druntime is precompiled - though not applicable to dmd, other D compilers can have a configure-time where they can check if closefrom is available for the target before druntime is built (i.e: here's one for qsort_r)
  • From there, the determination of whether closefrom is available can be done at compile time with __traits(compiles) in Phobos. No hard assumptions about OS's needed.

@schveiguy
Copy link
Member

I think we should merge both this and #8990, as the fallback. That way, we always pick the best available option for high rlimit.

closefrom seems like the ideal solution here, as it doesn't require any tricks or complex code.

IMO, the dlsym feels more robust, and if we don't use it, I'm sure we'll find out right away whether it breaks things. But I'm OK merging it without that if everyone feels it's not going to cause problems. I trust @ibuclaw has the goods on how exactly it could break or not.

@the-horo
Copy link
Contributor Author

  • Yes, druntime is precompiled - though not applicable to dmd, other D compilers can have a configure-time where they can check if closefrom is available for the target before druntime is built (i.e: here's one for qsort_r)

There's nothing stopping dmd from doing the same but it will be a little bit more complicated as it would have to be done during the build. I suggested above a solution that embeds the glibc version inside core.sys.linux.config and the core.sys.linux.unistd header would static if the glibc version and decide whether to expose a function declaration. I did it like this because I though it would be easier to extend the header in the future compared to doing feature tests on individual functions. Do you think the individual functions approach is better?

@ibuclaw
Copy link
Member

ibuclaw commented Aug 27, 2024

Do you think the individual functions approach is better?

I think so, as it doesn't matter whether you're on AIX or Windows, if the function exists, it's there to be taken advantage of.

@the-horo
Copy link
Contributor Author

Well, the function may exist but have a different signature. Even with closefrom openbsd has its implementation return an int and freebsd returns void. You still need to check the environment somewhat to make sure that there's no risk of providing a bad signature as that can lead to very hard to diagnose bugs.

@the-horo
Copy link
Contributor Author

I've made the bug report about carrying out the check in druntime: https://issues.dlang.org/show_bug.cgi?id=24725. @ibuclaw feel free to write any thoughts there.

@schveiguy
Copy link
Member

Is there any more discussion to be had here? I think we should merge this.

@dkorpel dkorpel merged commit eab6595 into dlang:master Oct 3, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants