Fix for ASE MACE calculator failures with TorchScript on ROCm#1044
Fix for ASE MACE calculator failures with TorchScript on ROCm#10447radians wants to merge 2 commits intoACEsuit:developfrom
Conversation
|
Hey @7radians thank you for that, this seems very weird indeed. Can you tell me what error appeared? |
|
@ilyes319 here are the errors my collaborator had: |
|
mmm were you using a model that was compiled before hand on an older version of mace? |
|
The model was compiled with mace 0.3.13, and failed both with 0.3.13 and 0.3.14, giving the same errors |
Develop patch
|
Some other commit tagged along from the main and caused errors with the heads, here is the clean fix based on the develop branch, hopefully that's less hassle @ilyes319 |
Hi all,
Not sure if others have run into this issue on Archer2 or elsewhere, but in case this fix is useful:
Context
My collaborator's ASE MD runs with MACE
0.3.13/0.3.14failed when using a ROCm PyTorch build on Archer2. TorchScript was enforcing strict schema checks, silently ignoring unknown kwargs and omitting optional outputs, causing runtime errors and deadlocks.This PR
Improves robustness of the ASE MACE calculator to handle these scenarios:
Dynamic kwarg gating
Inspect
model.forwardat runtime and passcompute_edge_forces/compute_atomic_stressesonly if supported, eliminating unknown‑kwarg errors on TorchScripted models on ROCm builds.Safe output access
Replace
out["..."]without.get("...")plus null checks foratomic_stressesandatomic_virials, preventingKeyErrors or hangs when keys are absent.Empty‑list stacking guard
Before aggregating per‑model tensors with
torch.stack(), verify that the corresponding list is non‑empty, avoiding deadlocks.All changes are in
mace/calculators/mace.py.Tested on CUDA, ROCm, and CPUs on the following machines:
The overheads should be (and appear to be) negligible.