Skip to content

Allow the global allocator to use thread-local storage and std::thread::current()#144465

Merged
bors merged 12 commits intorust-lang:mainfrom
orlp:system-alloc-tls
Nov 29, 2025
Merged

Allow the global allocator to use thread-local storage and std::thread::current()#144465
bors merged 12 commits intorust-lang:mainfrom
orlp:system-alloc-tls

Conversation

@orlp
Copy link
Contributor

@orlp orlp commented Jul 25, 2025

Fixes #115209.

Currently the thread-local storage implementation uses the Global allocator if it needs to allocate memory in some places. This effectively means the global allocator can not use thread-local variables. This is a shame as an allocator is precisely one of the locations where you'd really want to use thread-locals. We also see that this lead to hacks such as #116402, where we detect re-entrance and abort.

So I've made the places where I could find allocation happening in the TLS implementation use the System allocator instead. I also applied this change to the storage allocated for a Thread handle so that it may be used care-free in the global allocator as well, for e.g. registering it to a central place or parking primitives.

r? @joboet

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jul 25, 2025
@orlp orlp changed the title Use System allocator for thread-local storage Use System allocator for thread-local storage and std::thread::Thread Jul 25, 2025
@Mark-Simulacrum Mark-Simulacrum added the I-libs-api-nominated Nominated for discussion during a libs-api team meeting. label Jul 26, 2025
@Mark-Simulacrum
Copy link
Member

I think this merits some libs-api discussion (unless that's already happened? I didn't quickly find it). Landing this IMO implies at least an implicit guarantee (even if not necessarily stable without actual docs) that we don't introduce global allocator usage in thread local code. I think we should have some discussion and either commit to that or not. And we should discuss where we draw the line (e.g., is other runtime-like code in std supposed to do this? For example, environment variables or other bits that need allocation early, potentially even before main starts -- is that OK to call into the allocator for?)

OTOH, if we can make a weaker guarantee here while still serving the purpose, that may also be good to look at? For example, can we guarantee that a pattern like dhat's for ignoring re-entrant alloc/dealloc calls is safe? i.e., that we don't need allocations for non-Drop thread locals? Or if we can only do that for a limited set, perhaps std could define a public thread local for allocators to use?

@orlp
Copy link
Contributor Author

orlp commented Jul 26, 2025

@Mark-Simulacrum

I think this merits some libs-api discussion (unless that's already happened? I didn't quickly find it). Landing this IMO implies at least an implicit guarantee (even if not necessarily stable without actual docs) that we don't introduce global allocator usage in thread local code.

Yes, this should probably be properly discussed.

And we should discuss where we draw the line (e.g., is other runtime-like code in std supposed to do this? For example, environment variables

You make a good point here about environment variables. It is very common to allow configuration and debugging of allocators through them. I'd like to at least guarantee that std::env::var_os does not result in allocation through the global allocator, I will look into that to see how feasible that is.

or other bits that need allocation early, potentially even before main starts -- is that OK to call into the allocator for?)

I think it's good to brainstorm a bit what would be needed for a global allocator, although I don't think an allocator needs all that much from std. Environment variables, thread locals, getting a thread handle comes to mind, and I guess spawning a thread is also rather important for creating background cleanup. I will also investigate that.

OTOH, if we can make a weaker guarantee here while still serving the purpose, that may also be good to look at?

I don't think there's much difference in constraint for the stdlib between guaranteeing this for all thread_local instances as opposed to only non-Drop thread locals, especially on platforms where every single thread-local already does an allocation regardless.

For example, can we guarantee that a pattern like dhat's for ignoring re-entrant alloc/dealloc calls is safe?

Not without putting undue burden on the allocator. dhat doesn't actually change how allocation works, so regardless of re-entrance it calls System.alloc(layout) or System.dealloc(ptr, layout). If it was actually changing allocation method depending on re-entrance it would need to keep track of which pointers were allocated normally, and which were allocated re-entrant, and call the appropriate dealloc function for each. This makes all deallocation significantly slower regardless of re-entrance, in addition to adding extra complexity.

@orlp
Copy link
Contributor Author

orlp commented Jul 26, 2025

I did some investigation into some of the problems of making std::thread::spawn global-allocator-free:

Details
  • You can't give the thread a name because the interface for setting a name inherently takes a String. Perhaps an extra method could be added static_name which takes a &'static str, and the internal field changed to a Cow.

  • The RUST_MIN_STACK environment variable gets parsed to an integer if not explicitly specified. So this means var_os needs to be global-allocator-free. This is something I'd want regardless. This is impossible, so any call would need to explicitly specify the stack size.

  • no_spawn_hooks almost surely has to be set.

  • The Arc containing the Packet needs to be made to use System.

  • The Box containing the main function needs to be made to use System.

The above seems feasible to me, although I also had another thought: I don't know if std::thread::spawn needs to be global-allocator-free. The only use-case I can think of for spawning thread in the global allocator is for spawning cleanup threads, and I think it's perfectly fine if this happens after first init:

if !alloc_is_init() {
    init_alloc();
    // Now the allocator works and may be re-entrant.
    spawn_background_threads();
}

So I'm perfectly happy if we don't make std::thread::spawn global-allocator-free.


While investigating the above I also found the following additional blocker for making std::thread::current() global-allocator-free I didn't realize before: ThreadId::new on platforms which don't have 64-bit atomics uses a Mutex, so Mutex would have to be global-allocation-free (likely unfeasible), or this implementation changed. Which lead me to ask:

It's possible to do this with a fairly simple spinlock, but are there platforms we support that have threads and Mutex but not an atomic we can use for a spinlock?

To answer my own question: std requires the existence of AtomicBool.

@orlp
Copy link
Contributor Author

orlp commented Jul 26, 2025

I've added a commit to use a spinlock for ThreadId if 64-bit atomics are unavailable, and with that I believe std::thread::current() should be safe to call in a global allocator.

I also investigated environment variables and... it's hopeless without a new API. The current API returns owned Strings or OsStrings both of which allocate with the global allocator. Currently if the global allocator wants to read environment variables it'll have to use libc::getenv (and thus bypass Rust's lock). I don't think that's the end of the world.

@orlp orlp changed the title Use System allocator for thread-local storage and std::thread::Thread Allow the global allocator to use thread-local storage and std::thread::current() Jul 26, 2025
@Amanieu Amanieu removed the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jul 29, 2025
@joshtriplett joshtriplett added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. and removed T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jul 29, 2025
@joshtriplett
Copy link
Member

We talked about this in today's @rust-lang/libs-api meeting.

We were supportive of doing this, and we agreed that using TLS in an allocator seems desirable.

We also decided there's no point in doing this if we don't make it a guarantee of support. Thus, labeling this libs-api.

@rfcbot merge

Also, before merging this we'd like to see the documentation for TLS updated to make this guarantee.

@rfcbot
Copy link

rfcbot commented Jul 29, 2025

Team member @joshtriplett has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@rfcbot rfcbot added proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. labels Jul 29, 2025
@Amanieu Amanieu removed the I-libs-api-nominated Nominated for discussion during a libs-api team meeting. label Jul 29, 2025
@joshtriplett
Copy link
Member

Separately, it would also be useful for the global allocator documentation to provide an explicit safelist of other things you're allowed to use.

@rfcbot rfcbot added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. labels Jul 29, 2025
@rfcbot
Copy link

rfcbot commented Jul 29, 2025

🔔 This is now entering its final comment period, as per the review above. 🔔

@orlp
Copy link
Contributor Author

orlp commented Jul 29, 2025

@joshtriplett

Also, before merging this we'd like to see the documentation for TLS updated to make this guarantee.

Done.

Separately, it would also be useful for the global allocator documentation to provide an explicit safelist of other things you're allowed to use.

Where? On the std::alloc page where the #[global_allocator] attribute is described, or on GlobalAlloc?

@RalfJung
Copy link
Member

RalfJung commented Aug 1, 2025

So with this, for the first time, a program with a #[global_allocator] might still use the System allocator for some std operations the user has no control over? That seems really surprising and potentially breaking to me. The point of #[global_allocator] is to entirely replace which code std uses for allocations. We even have tests relying on that, using a custom #[global_allocator] to count how many allocations have been created. Those tests will now be less effective.

At the very least, this should be documented with that attribute.

@matthieu-m
Copy link
Contributor

Much like Ralf, I am quite surprised at the "allocator fork" occurring here.

Now, I would like to note that there is precedent for this. One of the early pains I encountered in replacing the system allocator with a custom allocator was that the loader will use the system allocator for loading the code (and constants, etc...) regardless. Thus, to an extent, there are already allocations bypassing the global allocator.

Still, up until now, this was a well-defined set. Loaded sections of library/binary would be allocated with the system allocator, all further allocations would be allocated with the global allocator. The divide is clear. And one can (try to) write their own loader if they wish to change the statu quo.

This changes removes control from the hands of the (allocator) developer. It may be fine. It may be the beginning of a slippery slope.


For the point at hand, is there any reason that the thread-local destructors could not be registered in an intrusive linked list instead?

That is, each thread-local variable requiring a destructor would be accompanied by a thread-local:

struct Node {
    pointer: *mut u8,
    destructor: unsafe extern "C" fn(*mut u8),
    next: *mut Node,
}

And destruction would simply consist of popping the first destructor of the linked-list and executing it in a loop.

(This doesn't solve all problems identified in this PR, but it may solve the most salient one)

@orlp
Copy link
Contributor Author

orlp commented Aug 1, 2025

So with this, for the first time, a program with a #[global_allocator] might still use the System allocator for some std operations the user has no control over? That seems really surprising and potentially breaking to me.

Well, yes and no. I think technically yes, this is the first time specifically System is explicitly called from the Rust standard library.

However, System essentially maps to libc::malloc, which is still called all the time by other libc functions and other things we use internally in std. Is there a reason you specifically want to single out this usage of malloc from the others?
For example if I currently set a breakpoint on malloc, on MacOS we call tlv_get_addr in our thread-local implementation, which calls malloc internally. Why was that okay, but would this be breaking?


Note that this doesn't affect no_std applications which absolutely rely on #[global_allocator] and for which there is no system allocator available. The thread_local! and std::thread implementation is firmly in std.

@orlp
Copy link
Contributor Author

orlp commented Aug 1, 2025

For the point at hand, is there any reason that the thread-local destructors could not be registered in an intrusive linked list instead?

On some platforms we can only install a key (that is, a pointer) into thread-local storage, which requires boxing of the thread-local variable. We must allocate on these platforms. See here:

We also require allocation for std::thread::current() as the thread handle may survive the thread it refers to, so it can't live on the thread's stack.

This doesn't solve all problems identified in this PR, but it may solve the most salient one.

Well, considering there are platforms which hard require allocation for thread-locals, and every platform requiring allocation for std::thread::current(), I don't particularly agree.

@orlp
Copy link
Contributor Author

orlp commented Aug 4, 2025

@RalfJung @matthieu-m Sorry, forgot to tag you earlier, awaiting your response.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 28, 2025
@bors
Copy link
Collaborator

bors commented Nov 29, 2025

⌛ Testing commit 492fbc5 with merge 467250d...

@bors
Copy link
Collaborator

bors commented Nov 29, 2025

☀️ Test successful - checks-actions
Approved by: Mark-Simulacrum
Pushing 467250d to main...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Nov 29, 2025
@bors bors merged commit 467250d into rust-lang:main Nov 29, 2025
12 checks passed
@github-actions
Copy link
Contributor

What is this? This is an experimental post-merge analysis report that shows differences in test outcomes between the merged PR and its parent PR.

Comparing 1eb0657 (parent) -> 467250d (this PR)

Test differences

Show 192 test diffs

Stage 1

  • [ui] tests/ui/threads-sendsync/tls-in-global-alloc.rs: [missing] -> pass (J1)

Stage 2

  • [ui] tests/ui/threads-sendsync/tls-in-global-alloc.rs: [missing] -> pass (J0)

Additionally, 190 doctest diffs were found. These are ignored, as they are noisy.

Job group index

Test dashboard

Run

cargo run --manifest-path src/ci/citool/Cargo.toml -- \
    test-dashboard 467250ddb25a93315a8dee02dd6cc1e398be46ff --output-dir test-dashboard

And then open test-dashboard/index.html in your browser to see an overview of all executed tests.

Job duration changes

  1. dist-aarch64-apple: 8173.3s -> 6219.0s (-23.9%)
  2. dist-x86_64-apple: 7322.6s -> 8441.8s (+15.3%)
  3. aarch64-apple: 9870.3s -> 8712.9s (-11.7%)
  4. dist-apple-various: 3497.2s -> 3838.1s (+9.7%)
  5. x86_64-gnu-llvm-21-2: 5846.6s -> 5303.8s (-9.3%)
  6. dist-aarch64-msvc: 5760.4s -> 5332.3s (-7.4%)
  7. x86_64-gnu-llvm-20: 2752.0s -> 2569.2s (-6.6%)
  8. aarch64-gnu-debug: 4116.0s -> 4381.6s (+6.5%)
  9. aarch64-msvc-1: 7179.3s -> 6749.6s (-6.0%)
  10. dist-x86_64-netbsd: 4461.8s -> 4720.3s (+5.8%)
How to interpret the job duration changes?

Job durations can vary a lot, based on the actual runner instance
that executed the job, system noise, invalidated caches, etc. The table above is provided
mostly for t-infra members, for simpler debugging of potential CI slow-downs.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (467250d): comparison URL.

Overall result: no relevant changes - no action needed

@rustbot label: -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results (primary -1.0%, secondary 4.3%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
2.8% [2.8%, 2.8%] 1
Regressions ❌
(secondary)
4.3% [4.3%, 4.3%] 1
Improvements ✅
(primary)
-4.7% [-4.7%, -4.7%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -1.0% [-4.7%, 2.8%] 2

Cycles

Results (secondary -2.5%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-2.5% [-2.8%, -2.2%] 2
All ❌✅ (primary) - - 0

Binary size

Results (primary -0.5%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.5% [-0.5%, -0.5%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -0.5% [-0.5%, -0.5%] 1

Bootstrap: 472.886s -> 473.687s (0.17%)
Artifact size: 386.93 MiB -> 386.86 MiB (-0.02%)

@laurmaedje
Copy link
Contributor

I realize that it's somewhat late (just saw this in the release notes), but I agree with Ralf that it's kind of surprising to switch stdlib functionality to using System over the global allocator where the docs previously stated that all allocations are routed through the global allocator.

While, on native systems, it might be the case that there were already intermittend malloc calls before, on WebAssembly this was not the case. In my multi-threaded WebAssembly application, I am overriding the global allocator and rely on the fact that all requests are routed through it to manage memory manually.1 Concurrent requests to the System allocator cause corruption because the global allocator expects unique control over growing the Wasm linear memory. This setup is now broken because of the new uncontrollable System allocations coming from the thread local storage implementation.

Given that Rust now gives a guarantee that global allocators can use TLS without modification, I doubt there is much that can be done about this at this point. But I do wonder whether there wouldn't have been a way to support TLS in allocator impls without giving up on the existing global allocator guarantees. For example, a thread local API that takes an explicit allocator akin to Vec::new_in.

Footnotes

  1. Specifically, I was forced to write a custom allocator because memory.grow alongside multi-threaded WebAssembly is currently broken in Safari, so I'm pre-equipping the Wasm module with lots of memory and have an allocator that serves requests from this existing memory. Meanwhile, System always grows the memory, leading to an immediate crash. But the specifics of how cursed this all is are a bit besides the point, the point being that there are valid uses of the global allocator that expect total control over memory allocation.

@orlp
Copy link
Contributor Author

orlp commented Jan 22, 2026

the point being that there are valid uses of the global allocator that expect total control over memory allocation

I'm not sure I agree that "my platform's System is broken and I'm trying to work around it with a hack" is a valid use case (insofar we should have designed with that in mind). Your workaround was still broken regardless, since anyone can call System directly, not just the standard library.

What is the precise platform that is broken? Perhaps we can patch System for that platform?

@laurmaedje
Copy link
Contributor

laurmaedje commented Jan 23, 2026

What is the precise platform that is broken? Perhaps we can patch System for that platform?

It's not that System is broken per se, but rather that memory.grow on a shared memory (i.e. wasm32-unknown-unknown with atomics) is broken in WebKit and had I to work around it. I'm not sure there's much that can be done on Rust's side here, unfortunately.

I'm not sure I agree that "my platform's System is broken and I'm trying to work around it with a hack" is a valid use case

My point was not at all that "my platform's System is broken" is the primary use case. That's why I only put it into a footnote. The point was just that a previously stated guarantee ("route all default allocation requests to a custom object") that I could imagine other people depended on as well was softened. I know it's hard to make any kind of change without the potential for breakage and sometimes one needs to be practical. What concerns me here though is that the feature "override all std allocations", which previously existed, is now gone and we only have "override most std allocations" (that's at least the effect on Wasm, even if malloc was already used in some places before on native targets).

Your workaround was still broken regardless, since anyone can call System directly, not just the standard library.

In practice, my workaround was not broken, as in, it is currently deployed in production and working. Patching a dependency to avoid System is also much simpler than to avoid any TLS in the whole dep tree. I haven't looked into it much yet, but I fear that for me patching std is the only way to avoid a crash now.

@orlp
Copy link
Contributor Author

orlp commented Jan 23, 2026

Given that Rust now gives a guarantee that global allocators can use TLS without modification, I doubt there is much that can be done about this at this point.

I can't speak for the libs team, but if we do a patch release reverting the guarantees that thread_local and Thread::current doesn't allocate I don't think it's too late. But to do that I do think we need to have:

  • a viable alternative for #[global_allocator]-compatible thread-locals. I think with a substantial engineering effort LocalKey::try_with_in(System, || { ... }) could work, although I don't know if it it could work without making all thread-locals slower to support type-erasure over the allocator used.

    Another alternative is thread_local_in! which would create some sort of LocalKeyIn (or adding an allocator argument to LocalKey, if that's possible) which would avoid this type-erasure.

    Another alternative is some thread_local_ptr! which may only hold a pointer argument and does not support destructors, I believe we should be able to support this on all platforms without allocation support. This would be the easiest path forward, I think. (Whoever writes the allocator can still get destructors by installing a normal thread-local registering the pointer once initialization is done and re-entrance can be allowed.) This last one may not actually need an API change if we guarantee this behavior for anything which fits in a pointer and does not have a destructor.

  • a viable alternative for Thread::current. We construct a Thread object when spawning a new thread on the old thread, so usually this is not a problem. However, if Rust gets called in a non-main thread we didn't spawn (e.g. from foreign code) we absolutely need to be able to synthesize a Thread object from thin air. I think the only way to solve this is by adding a Thread::current_in(System)-style of API and type-erasing the destructor inside a Thread.

However, if this means the stabilization of #[global_allocator]-viable thread-locals becomes dependent on the stabilization of the Allocator API I don't think this is acceptable, as the timeline is too long/uncertain. But again, I can't speak for the libs team.

All of the above is of course assuming the libs team actually considers it problematic we use System in the stdlib internals, seeing @laurmaedje's plight.

@ChrisDenton
Copy link
Member

I've renominated this for libs-api to discuss possibly reverting this change. Also cc @Amanieu if you have any thoughts.

@Mark-Simulacrum
Copy link
Member

One option might be to add something like this to GlobalAllocator. That way allocators that can deal with re-entrancy from std just fine (e.g., ones using C only, or just counting bytes allocated in a global static) could be defined with false and continue to be used by std. Name should obviously be bikeshed but it feels like a fairly reasonable tradeoff between allowing a general override and working out of the box for people who don't want to think about it.

(A more complicated variant, and probably slower to stabilize, would be a separate global allocator item for std internals, but I'm not convinced that's better).

Allocators are already unsafe impl, so we can require the value returned is always the same pretty easily.

fn requires_std_reentrancy(&self) -> bool {
    true
}

@orlp
Copy link
Contributor Author

orlp commented Feb 9, 2026

@Mark-Simulacrum I don't really understand your proposed solution. Who would call requires_std_reentrancy (and when), and what would they do differently if it returns true or false?

@Mark-Simulacrum
Copy link
Member

std's code that this PR changed to call System directly would instead condition on the global_allocator's return from requires_std_reentrancy and either call System (the default) or call the global allocator (if it opts-in).

@Mark-Simulacrum
Copy link
Member

We discussed this in today's libs-api meeting. We're not inclined to revert/backport at this time (since this has reached stable and there may be code depending on the new guarantees already). The proposed next step is to open a libs-api ACP that proposes the API I suggested above to allow GlobalAlloc impls to opt-in to allowing the re-entrancy.

@laurmaedje Do you think that's sufficient for your case? i.e., the override you previously had wasn't using the APIs std guaranteed here to not call into GlobalAlloc?

We also noticed you mentioned "the docs previously stated that all allocations are routed through the global allocator", can you link those docs? We'd like to make sure they're updated (and presuming the ACP would get accepted, point at the strategy for opting in).

@Mark-Simulacrum
Copy link
Member

Cut the ACP: rust-lang/libs-team#743

@laurmaedje
Copy link
Contributor

Thanks for discussing this among the libs team!

Do you think that's sufficient for your case?

Yes, for my use case, this would absolutely suffice. While I think a TLS design akin to Vec::new_in might have been cleaner, I totally understand the real-world difficulties with the fact that it's already merged and guaranteed, so I'm positively surprised that a fix might be coming at all. :)


We also noticed you mentioned "the docs previously stated that all allocations are routed through the global allocator", can you link those docs?

I was referring to the global allocator section in the previous docs:

You can use this to implement a completely custom global allocator to route all default allocation requests to a custom object.

Of course, the "default" is doing some heavy lifting, so it's not 100% clear cut. The docs have already been updated to include the footnote so I don't think anything further needs to change here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this PR / Issue. merged-by-bors This PR was explicitly merged by bors. O-hermit Operating System: Hermit O-itron Operating System: ITRON O-SGX Target: SGX O-unix Operating system: Unix-like O-windows Operating system: Windows S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Calling std::thread::current() in global allocator results in non-obvious error