Serverless Inference
Last updated
Last updated
NetMind Power also offers serverless inference capabilities. With this feature, you are billed based on the actual runtime of the inference service, which can significantly reduce operational costs for small to medium-sized applications or for applications with peak-and-valley API usage patterns.
To create a serverless inference service, go to "Serverless" page. Click "New Serverless" button
First, you need to choose an appropriate GPU, which depends on the VRAM requirements of the model and the inference efficiency you expect.
Next, you need to fill in a series of information, including basic details such as the inference name and description, payment method, and scaling method. The platform currently supports both manual and automatic scaling.
Manual Scaling: Users need to manually manage the number of workers for model inference instances.
Automatic Scaling: The platform adjusts the number of model inference instances automatically based on the scaling rules chosen by the user.
The platform supports two types of scaling rules: concurrency and RPS (Requests Per Second). Both rules allow users to configure a threshold. When the platform detects that the request volume exceeds the threshold based on the selected rule, it increases the number of model inference workers. Conversely, if the request volume falls below the threshold, it decreases the workers.
Additionally, in automatic mode, the platform allows setting a maximum or minimum number of workers:
Setting the minimum replica count greater than 0 can avoid cold start issues.
Setting a maximum replica count helps users prevent exceeding their budget.
Additionally, you need to package your inference service into a Docker image and push it to a container registry. After that, enter the image's url in the provided form. We support both public and private images.
the maximum number of workers is also limited by the number of machines currently available on the platform.
Please note that the inference service inside the image must be started on port 8080, as we currently do not support inference services running on other ports.
Next, you can see a newly created serverless instance with status "Deploying". You can click "Detail" to view more information. Here, you can see the details you provided during creation, obtain a model inference endpoint, check billing information, monitor the number of workers and access the running logs of the workers. Additionally, you can stop or edit this Serverless instance from this page.
Serverless instance will have these statuses:
Deploying: The status immediately after creating a new instance or redeploying a stopped instance. It means the instance is being initializing.
Running: Typically follows the "Deploying" status. This indicates the instance is running normally, can be scaled, and the endpoint is accessible.
Stopped: The instance is stopped, and the number of workers will be reduced to zero.
Failed: The instance deployment or scaling failed due to an error, which may cause the endpoint to become inaccessible.
Finally, you can obtain the request URL by clicking Endpoint Info or checking the Request URL displayed on the interface.
Note that when making actual inference requests, you may need to concatenate the platform-provided URL with the URL path required by the inference service in the image, depending on the service's requirements.
When making requests to the serverless endpoint, you still need to use an API token. This is to ensure the security of the API.