[Feature Request] MAX Models Auto Load Balancing/loader

Mar 9, 2025 by ADMIN 56 views

What is your request?

We propose the addition of a load balancer/model loader to MAX, enabling users to define a set of models that can be accessed through an OpenAI-compatible API, even if they aren’t currently loaded into VRAM. This feature would significantly enhance the usability and efficiency of MAX, particularly in self-hosted scenarios where multiple open-source models need to be utilized.

What is your motivation for this change?

Our motivation for this change stems from the limitations of the current MAX implementation. We have two different systems set up, each utilizing GPUStack to run APIs for models with different backend systems, depending on the quantization. While the VLLM backend on the HGX 8xH100 is fast and straightforward, the Ollama backend on the four RTX 3090s offers a unique feature: an automatic model loader and unloader. This feature is incredibly useful for self-hosted scenarios, as it eliminates the need for manual unloading and loading of models.

The Ollama backend checks if anyone is using the currently loaded model when a request is made for a model that isn’t loaded into the GPU VRAM. If not, it unloads it from memory and loads the new one. This feature is particularly useful in scenarios where multiple open-source models need to be utilized without the hassle of manual unloading and loading.

Current Limitations

While GPUStack is attempting to implement a similar feature with the VLLM backend, it’s not as reliable at this point. This limitation highlights the need for a more robust and efficient load balancer/model loader in MAX.

Proposed Solution

We propose the implementation of a load balancer/model loader in MAX that can handle the following features:

Auto Load Balancing: The system should check if there’s enough available VRAM to load a requested model. If there isn’t, it should unload the current model from memory before loading the requested one.
User-Defined List of "Unloadable" Models: Users should be able to define a list of models that should be loaded as long as MAX is running for better response times.

Benefits of the Proposed Solution

The proposed solution would bring several benefits to MAX users, including:

Improved Usability: The auto load balancer/model loader would eliminate the need for manual unloading and loading of models, making it easier for users to work with multiple open-source models.
Increased Efficiency: The system would be able to handle multiple models more efficiently, reducing the need for manual intervention and improving overall response times.
Better Resource Utilization: The auto load balancer/model loader would ensure that the available VRAM is utilized more efficiently, reducing the need for manual unloading and loading of models.

Technical Requirements

To implement the proposed solution, the following technical requirements would need to be met:

OpenAI- Compatible API: The load balancer/model loader should be compatible with the OpenAI-compatible API.
GPUStack Integration: The load balancer/model loader should be integrated with GPUStack to ensure seamless interaction with the VLLM and Ollama backends.
User-Defined Configuration: Users should be able to define a list of "unloadable" models that should be loaded as long as MAX is running.

Conclusion

The proposed solution would significantly enhance the usability and efficiency of MAX, particularly in self-hosted scenarios where multiple open-source models need to be utilized. We believe that the auto load balancer/model loader would bring several benefits to MAX users, including improved usability, increased efficiency, and better resource utilization. We look forward to discussing this proposal in more detail and exploring the technical requirements for implementation.

Future Work

Future work on this proposal could include:

Implementation of the Auto Load Balancer/Model Loader: The implementation of the auto load balancer/model loader would require significant technical effort, including the development of the necessary algorithms and integration with GPUStack.
Testing and Validation: Thorough testing and validation of the auto load balancer/model loader would be necessary to ensure that it meets the required technical specifications and performs as expected in various scenarios.
User Feedback and Evaluation: User feedback and evaluation would be essential to ensure that the auto load balancer/model loader meets the needs of MAX users and provides the expected benefits.

References

GPUStack Documentation
Ollama Documentation
OpenAI API Documentation
Q&A: MAX Models Auto Load Balancing/Loader =============================================

Q: What is the purpose of the proposed auto load balancer/model loader in MAX?

A: The proposed auto load balancer/model loader in MAX aims to improve the usability and efficiency of the system by allowing users to define a set of models that can be accessed through an OpenAI-compatible API, even if they aren’t currently loaded into VRAM.

Q: Why is the current implementation of MAX limited in terms of model loading and unloading?

A: The current implementation of MAX is limited in terms of model loading and unloading because it requires manual intervention to unload and load models. This can be time-consuming and inefficient, particularly in scenarios where multiple open-source models need to be utilized.

Q: How does the Ollama backend handle concurrent requests for models?

A: The Ollama backend handles concurrent requests for models by checking if anyone is using the currently loaded model. If not, it unloads it from memory and loads the new one. This feature is incredibly useful for self-hosted scenarios where multiple open-source models need to be utilized without the hassle of manual unloading and loading.

Q: What are the benefits of the proposed auto load balancer/model loader in MAX?

A: The proposed auto load balancer/model loader in MAX would bring several benefits to users, including:

Improved Usability: The auto load balancer/model loader would eliminate the need for manual unloading and loading of models, making it easier for users to work with multiple open-source models.
Increased Efficiency: The system would be able to handle multiple models more efficiently, reducing the need for manual intervention and improving overall response times.
Better Resource Utilization: The auto load balancer/model loader would ensure that the available VRAM is utilized more efficiently, reducing the need for manual unloading and loading of models.

Q: What technical requirements would need to be met to implement the proposed auto load balancer/model loader in MAX?

A: To implement the proposed auto load balancer/model loader in MAX, the following technical requirements would need to be met:

OpenAI- Compatible API: The load balancer/model loader should be compatible with the OpenAI-compatible API.
GPUStack Integration: The load balancer/model loader should be integrated with GPUStack to ensure seamless interaction with the VLLM and Ollama backends.
User-Defined Configuration: Users should be able to define a list of "unloadable" models that should be loaded as long as MAX is running.

Q: What is the expected timeline for implementing the proposed auto load balancer/model loader in MAX?

A: The expected timeline for implementing the proposed auto load balancer/model loader in MAX would depend on the complexity of the implementation and the availability of resources. However, we anticipate that the implementation would take several months to complete.

Q: How would the proposed auto load balancer/model loader in MAX be tested and validated?

A: The proposed auto load balancer/model loader in MAX would be tested and validated through a combination of unit testing, integration testing, and user testing. This would ensure that the system meets the required technical specifications and performs as expected in various scenarios.

Q: What is the expected impact of the proposed auto load balancer/model loader in MAX on MAX users?

A: The expected impact of the proposed auto load balancer/model loader in MAX on MAX users would be significant, with improved usability, increased efficiency, and better resource utilization. We anticipate that the system would be more efficient and easier to use, particularly in scenarios where multiple open-source models need to be utilized.

Q: How would the proposed auto load balancer/model loader in MAX be maintained and updated in the future?

A: The proposed auto load balancer/model loader in MAX would be maintained and updated through a combination of bug fixes, feature enhancements, and security patches. We would work closely with MAX users to ensure that the system meets their needs and continues to provide value over time.

Q: What are the next steps for implementing the proposed auto load balancer/model loader in MAX?

A: The next steps for implementing the proposed auto load balancer/model loader in MAX would include:

Implementation of the Auto Load Balancer/Model Loader: The implementation of the auto load balancer/model loader would require significant technical effort, including the development of the necessary algorithms and integration with GPUStack.
Testing and Validation: Thorough testing and validation of the auto load balancer/model loader would be necessary to ensure that it meets the required technical specifications and performs as expected in various scenarios.
User Feedback and Evaluation: User feedback and evaluation would be essential to ensure that the auto load balancer/model loader meets the needs of MAX users and provides the expected benefits.