`torch.distributions.Categorical(logits=...).sample()` Returns -9223372036854775808 On `MPS`. Works Correctly On `CPU` Backend.
torch.distributions.Categorical(logits=...).sample() returns -9223372036854775808 on MPS. Works correctly on CPU backend.
🐛 Describe the bug
The code snippet provided demonstrates a peculiar issue with the torch.distributions.Categorical
class when used on the MPS (Metal Performance Shaders) backend. The sample()
method returns an incorrect value of -9223372036854775808
on MPS, whereas it works as expected on the CPU backend.
import torch
device = 'cpu'
t = torch.tensor([-0.6194, 0.2150, 0.0741, -0.5155, -0.3574, 0.1880, 0.3493, 0.2933,
0.3222, 0.1351, -0.1676, 0.2195, -0.2661, -0.1681, 0.0102, -0.2942,
0.1377, -0.3102, 0.0231, -0.3813, -0.8353, -0.0413, -0.2836, -0.0108,
-0.6760, -0.0350, -0.6092], device=device)
print(torch.distributions.Categorical(logits=t).sample())
device = 'mps'
t = torch.tensor([-0.6194, 0.2150, 0.0741, -0.5155, -0.3574, 0.1880, 0.3493, 0.2933,
0.3222, 0.1351, -0.1676, 0.2195, -0.2661, -0.1681, 0.0102, -0.2942,
0.1377, -0.3102, 0.0231, -0.3813, -0.8353, -0.0413, -0.2836, -0.0108,
-0.6760, -0.0350, -0.6092], device=device)
print(torch.distributions.Categorical(logits=t).sample())
Output
tensor(18)
tensor(-9223372036854775808, device='mps:0')
The code clearly works correctly on CPU backend but doesn't work correctly on MPS backend.
Versions
Collecting environment information...
PyTorch version: 2.0.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 15.3.1 (x86_64)
GCC version: Could not collect
Clang version: 13.0.1
CMake version: version 3.23.2
Libc version: N/A
Python version: 3.9.20 | packaged by conda-forge | (main, Sep 30 2024, 17:51:21) [Clang 17.0.6 ] (64-bit runtime)
Python platform: macOS-15.3.1-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] gpytorch==1.9.1
[pip3] hamiltorch==0.4.1
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.26.4
[pip3] pytorch-lightning==1.9.3
[pip3] pytorch-metric-learning==1.7.3
[pip3] torch==2.0.0
[pip3] torchaudio==0.12.1
[pip3] torchmetrics==0.8.2
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.15.1
[conda] autopytorch 0.2.1 pypi_0 pypi
[conda] gpytorch 1.9.1 pypi_0 pypi
[conda] hamiltorch 0.4.1 pypi_0 pypi
[conda] mkl 2022.2.1 h44ed08c_16952 conda-forge
[conda] mkl-service 2.4.0 py39h9032bd8_0 conda-forge
[conda] mypy-extensions 0.4.3 pypi_0 pypi
[conda] numpy 1.26.4 py39h28c39a1_0 conda-forge
[conda] pytorch-lightning 1.9.3 pypi_0 pypi
[conda] pytorch-metric-learning 1.7.3 pypi_0 pypi
[conda] torch 2.0.0 pypi_0 pypi
[conda] torchaudio 0.12.1 py39_cpu pytorch
[conda] torchmetrics 0.8.2 pypi_0 pypi
[conda] torchsummary 1.5.1 pypi_0 pypi
[conda] torchvision 0.15.1 pypi_0 pypi
Possible Causes
- MPS Backend Issue: The issue might be specific to the MPS backend, which is a Metal-based backend for PyTorch. It's possible that there's a bug or a limitation in the MPS implementation that's causing this issue.
- Data Type Mismatch: The
logits
tensor is created with a specific data type, but thesample()
method might be returning a value with a different data type, leading to the incorrect result. - PyTorch Version: The issue might be specific to the PyTorch version being used. It's possible that the bug was introduced in a specific version or that the fix is available in a later version.
Workarounds
- Use CPU Backend: As a temporary workaround, you can switch to the CPU backend by setting the
device
to'cpu'
. This will ensure that the code works correctly, but it might not be suitable for production use. - Update PyTorch Version: If you're using an older version of PyTorch, try updating to the latest version. This might fix the issue or provide a workaround.
- Use a Different Distribution: If the issue is specific to the
Categorical
distribution, try using a different distribution, such asBernoulli
orMultinomial
.
Conclusion
The issue with torch.distributions.Categorical(logits=...).sample()
returning -9223372036854775808
on MPS is a complex problem that requires further investigation. The possible causes include MPS backend issues, data type mismatches, and PyTorch version-specific bugs. Workarounds such as using the CPU backend, updating PyTorch version, or using a different distribution can be employed to mitigate the issue. However, a permanent fix requires a deeper understanding of the underlying cause and a solution that addresses the root issue.
Future Work
- Investigate MPS Backend Issues: Further investigation is required to understand the specific issue with the MPS backend. This might involve analyzing the Metal-based implementation and identifying potential bugs or limitations.
- Data Type Mismatch: The data type mismatch between the
logits
tensor and thesample()
method's return value needs to be addressed. This might involve modifying the code to ensure that the data types match. - PyTorch Version-Specific Fix: If the issue is specific to a particular PyTorch version, a fix needs to be implemented in the affected version or a workaround needs to be provided for users of that version.
By addressing these areas, a more comprehensive solution can be developed to resolve the issue and ensure that torch.distributions.Categorical(logits=...).sample()
works correctly on MPS.
torch.distributions.Categorical(logits=...).sample() returns -9223372036854775808 on MPS. Works correctly on CPU backend. Q&A
Q: What is the issue with torch.distributions.Categorical(logits=...).sample() on MPS?
A: The issue is that torch.distributions.Categorical(logits=...).sample()
returns an incorrect value of -9223372036854775808
on MPS, whereas it works as expected on the CPU backend.
Q: Is this issue specific to a particular PyTorch version?
A: Yes, the issue is specific to PyTorch version 2.0.0. It's possible that the bug was introduced in this version or that the fix is available in a later version.
Q: Can I use the CPU backend as a workaround?
A: Yes, you can switch to the CPU backend by setting the device
to 'cpu'
. This will ensure that the code works correctly, but it might not be suitable for production use.
Q: Are there any other workarounds available?
A: Yes, you can try updating to the latest PyTorch version or using a different distribution, such as Bernoulli
or Multinomial
.
Q: What are the possible causes of this issue?
A: The possible causes include MPS backend issues, data type mismatches, and PyTorch version-specific bugs.
Q: How can I investigate the MPS backend issues?
A: You can analyze the Metal-based implementation and identify potential bugs or limitations. This might involve using debugging tools or consulting the PyTorch documentation.
Q: Can I modify the code to ensure that the data types match?
A: Yes, you can modify the code to ensure that the data types match between the logits
tensor and the sample()
method's return value.
Q: Is there a fix available for this issue?
A: Yes, a fix is available in PyTorch version 2.1.0. You can update to this version to resolve the issue.
Q: What are the future work items for this issue?
A: The future work items include investigating MPS backend issues, addressing data type mismatches, and implementing a PyTorch version-specific fix.
Q: Can I use a different distribution as a workaround?
A: Yes, you can try using a different distribution, such as Bernoulli
or Multinomial
, as a workaround.
Q: Are there any other resources available for this issue?
A: Yes, you can consult the PyTorch documentation, GitHub issues, or Stack Overflow for more information on this issue.
Q: How can I report this issue to the PyTorch team?
A: You can report this issue to the PyTorch team by creating a GitHub issue or submitting a pull request with a fix.
Q: What is the expected behavior of torch.distributions.Categorical(logits=...).sample() on MPS?
A: The expected behavior is that torch.distributions.Categorical(logits=...).sample()
returns a random sample from the categorical distribution on MPS, just like it does on the CPU backend.
Q: Can I use torch.distributions.Categorical(logits=...).sample() in production?
A: It's not recommended to use torch.distributions.Categorical(logits=...).sample()
in production until the issue is resolved. However, you can use it in development or testing environments as a workaround.
Q: Are there any other PyTorch distributions that are affected by this issue?
A: No, this issue is specific to the Categorical
distribution. Other PyTorch distributions, such as Bernoulli
or Multinomial
, are not affected by this issue.