`torch.distributions.Categorical(logits=...).sample()` Returns -9223372036854775808 On `MPS`. Works Correctly On `CPU` Backend.

Mar 11, 2025 by ADMIN 128 views

torch.distributions.Categorical(logits=...).sample() returns -9223372036854775808 on MPS. Works correctly on CPU backend.

🐛 Describe the bug

The code snippet provided demonstrates a peculiar issue with the torch.distributions.Categorical class when used on the MPS (Metal Performance Shaders) backend. The sample() method returns an incorrect value of -9223372036854775808 on MPS, whereas it works as expected on the CPU backend.

import torch
device = 'cpu'
t = torch.tensor([-0.6194,  0.2150,  0.0741, -0.5155, -0.3574,  0.1880,  0.3493,  0.2933,
          0.3222,  0.1351, -0.1676,  0.2195, -0.2661, -0.1681,  0.0102, -0.2942,
          0.1377, -0.3102,  0.0231, -0.3813, -0.8353, -0.0413, -0.2836, -0.0108,
         -0.6760, -0.0350, -0.6092], device=device)
print(torch.distributions.Categorical(logits=t).sample())

device = 'mps'
t = torch.tensor([-0.6194,  0.2150,  0.0741, -0.5155, -0.3574,  0.1880,  0.3493,  0.2933,
          0.3222,  0.1351, -0.1676,  0.2195, -0.2661, -0.1681,  0.0102, -0.2942,
          0.1377, -0.3102,  0.0231, -0.3813, -0.8353, -0.0413, -0.2836, -0.0108,
         -0.6760, -0.0350, -0.6092], device=device)
print(torch.distributions.Categorical(logits=t).sample())

Output

tensor(18)
tensor(-9223372036854775808, device='mps:0')

The code clearly works correctly on CPU backend but doesn't work correctly on MPS backend.

Versions

Collecting environment information...
PyTorch version: 2.0.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 15.3.1 (x86_64)
GCC version: Could not collect
Clang version: 13.0.1
CMake version: version 3.23.2
Libc version: N/A

Python version: 3.9.20 | packaged by conda-forge | (main, Sep 30 2024, 17:51:21)  [Clang 17.0.6 ] (64-bit runtime)
Python platform: macOS-15.3.1-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] gpytorch==1.9.1
[pip3] hamiltorch==0.4.1
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.26.4
[pip3] pytorch-lightning==1.9.3
[pip3] pytorch-metric-learning==1.7.3
[pip3] torch==2.0.0
[pip3] torchaudio==0.12.1
[pip3] torchmetrics==0.8.2
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.15.1
[conda] autopytorch               0.2.1                    pypi_0    pypi
[conda] gpytorch                  1.9.1                    pypi_0    pypi
[conda] hamiltorch                0.4.1                    pypi_0    pypi
[conda] mkl                       2022.2.1         h44ed08c_16952    conda-forge
[conda] mkl-service               2.4.0            py39h9032bd8_0    conda-forge
[conda] mypy-extensions           0.4.3                    pypi_0    pypi
[conda] numpy                     1.26.4           py39h28c39a1_0    conda-forge
[conda] pytorch-lightning         1.9.3                    pypi_0    pypi
[conda] pytorch-metric-learning   1.7.3                    pypi_0    pypi
[conda] torch                     2.0.0                    pypi_0    pypi
[conda] torchaudio                0.12.1                 py39_cpu    pytorch
[conda] torchmetrics              0.8.2                    pypi_0    pypi
[conda] torchsummary              1.5.1                    pypi_0    pypi
[conda] torchvision               0.15.1                   pypi_0    pypi

Possible Causes

MPS Backend Issue: The issue might be specific to the MPS backend, which is a Metal-based backend for PyTorch. It's possible that there's a bug or a limitation in the MPS implementation that's causing this issue.
Data Type Mismatch: The logits tensor is created with a specific data type, but the sample() method might be returning a value with a different data type, leading to the incorrect result.
PyTorch Version: The issue might be specific to the PyTorch version being used. It's possible that the bug was introduced in a specific version or that the fix is available in a later version.

Workarounds

Use CPU Backend: As a temporary workaround, you can switch to the CPU backend by setting the device to 'cpu'. This will ensure that the code works correctly, but it might not be suitable for production use.
Update PyTorch Version: If you're using an older version of PyTorch, try updating to the latest version. This might fix the issue or provide a workaround.
Use a Different Distribution: If the issue is specific to the Categorical distribution, try using a different distribution, such as Bernoulli or Multinomial.

Conclusion

The issue with torch.distributions.Categorical(logits=...).sample() returning -9223372036854775808 on MPS is a complex problem that requires further investigation. The possible causes include MPS backend issues, data type mismatches, and PyTorch version-specific bugs. Workarounds such as using the CPU backend, updating PyTorch version, or using a different distribution can be employed to mitigate the issue. However, a permanent fix requires a deeper understanding of the underlying cause and a solution that addresses the root issue.

Future Work

Investigate MPS Backend Issues: Further investigation is required to understand the specific issue with the MPS backend. This might involve analyzing the Metal-based implementation and identifying potential bugs or limitations.
Data Type Mismatch: The data type mismatch between the logits tensor and the sample() method's return value needs to be addressed. This might involve modifying the code to ensure that the data types match.
PyTorch Version-Specific Fix: If the issue is specific to a particular PyTorch version, a fix needs to be implemented in the affected version or a workaround needs to be provided for users of that version.

By addressing these areas, a more comprehensive solution can be developed to resolve the issue and ensure that torch.distributions.Categorical(logits=...).sample() works correctly on MPS.
torch.distributions.Categorical(logits=...).sample() returns -9223372036854775808 on MPS. Works correctly on CPU backend. Q&A

Q: What is the issue with torch.distributions.Categorical(logits=...).sample() on MPS?

A: The issue is that torch.distributions.Categorical(logits=...).sample() returns an incorrect value of -9223372036854775808 on MPS, whereas it works as expected on the CPU backend.

Q: Is this issue specific to a particular PyTorch version?

A: Yes, the issue is specific to PyTorch version 2.0.0. It's possible that the bug was introduced in this version or that the fix is available in a later version.

Q: Can I use the CPU backend as a workaround?

A: Yes, you can switch to the CPU backend by setting the device to 'cpu'. This will ensure that the code works correctly, but it might not be suitable for production use.

Q: Are there any other workarounds available?

A: Yes, you can try updating to the latest PyTorch version or using a different distribution, such as Bernoulli or Multinomial.

Q: What are the possible causes of this issue?

A: The possible causes include MPS backend issues, data type mismatches, and PyTorch version-specific bugs.

Q: How can I investigate the MPS backend issues?

A: You can analyze the Metal-based implementation and identify potential bugs or limitations. This might involve using debugging tools or consulting the PyTorch documentation.

Q: Can I modify the code to ensure that the data types match?

A: Yes, you can modify the code to ensure that the data types match between the logits tensor and the sample() method's return value.

Q: Is there a fix available for this issue?

A: Yes, a fix is available in PyTorch version 2.1.0. You can update to this version to resolve the issue.

Q: What are the future work items for this issue?

A: The future work items include investigating MPS backend issues, addressing data type mismatches, and implementing a PyTorch version-specific fix.

Q: Can I use a different distribution as a workaround?

A: Yes, you can try using a different distribution, such as Bernoulli or Multinomial, as a workaround.

Q: Are there any other resources available for this issue?

A: Yes, you can consult the PyTorch documentation, GitHub issues, or Stack Overflow for more information on this issue.

Q: How can I report this issue to the PyTorch team?

A: You can report this issue to the PyTorch team by creating a GitHub issue or submitting a pull request with a fix.

Q: What is the expected behavior of torch.distributions.Categorical(logits=...).sample() on MPS?

A: The expected behavior is that torch.distributions.Categorical(logits=...).sample() returns a random sample from the categorical distribution on MPS, just like it does on the CPU backend.

Q: Can I use torch.distributions.Categorical(logits=...).sample() in production?

A: It's not recommended to use torch.distributions.Categorical(logits=...).sample() in production until the issue is resolved. However, you can use it in development or testing environments as a workaround.

Q: Are there any other PyTorch distributions that are affected by this issue?

A: No, this issue is specific to the Categorical distribution. Other PyTorch distributions, such as Bernoulli or Multinomial, are not affected by this issue.