# Dedicated Endpoints

NetMind also offers serverless inference capabilities. With this feature, you are billed based on the actual runtime of the inference service, which can significantly reduce operational costs for small to medium-sized applications or for applications with peak-and-valley API usage patterns.

1. To create a dedicated endpoint, go to " Dedicated Endpoint" page. Click "New Dedicated Endpoint" button

<figure><img src="/files/2eLbTvqDpFUcgPydegqf" alt=""><figcaption><p>Dedicated Endpoint Page</p></figcaption></figure>

2. First, you need to choose either create an endpoint with "Custom Image" or "NetMind Model". If you choose "Custom Image" you need to then select an appropriate GPU, which depends on the VRAM requirements of the model and the inference efficiency you expect.

<figure><img src="/files/HGlnlqaoZoGew8J4pnrE" alt=""><figcaption><p>"Custom Image" or "NetMind Model"</p></figcaption></figure>

<figure><img src="/files/eqsDdylAkoq8b9yFQryV" alt=""><figcaption><p>Select Dedicated Endpoint GPU</p></figcaption></figure>

3. Next, you need to fill in a series of information, including basic details such as the inference name and description, payment method, and scaling method. The platform currently supports both manual and automatic scaling.

   * **Manual Scaling**: Users need to manually manage the number of workers for model inference instances.
   * **Automatic Scaling**: The platform adjusts the number of model inference instances automatically based on the scaling rules chosen by the user.

   The platform supports two types of scaling rules: **concurrency** and **RPS (Requests Per Second)**. Both rules allow users to configure a **threshold**. When the platform detects that the request volume exceeds the threshold based on the selected rule, it increases the number of model inference workers. Conversely, if the request volume falls below the threshold, it decreases the workers.

   Additionally, in automatic mode, the platform allows setting a maximum or minimum number of workers:

   * Setting the **minimum replica count greater than 0** can avoid cold start issues.
   * Setting a **maximum replica count** helps users prevent exceeding their budget.

   Additionally, you need to package your inference service into a Docker image and push it to a container registry. After that, enter the image's url in the provided form. We support both public and private images.

{% hint style="info" %}
the maximum number of workers is also limited by the number of machines currently available on the platform.
{% endhint %}

{% hint style="info" %}
Please note that the inference service inside the image must be started on port 8080, as we currently do not support inference services running on other ports.
{% endhint %}

<figure><img src="/files/aXQrl6whq6bpF6H5Ovod" alt=""><figcaption><p>Fill in Information to Create Serverless</p></figcaption></figure>

4. Next, you can see a newly created serverless instance with status "**Deploying**". You can click **"Detail"** to view more information. Here, you can see the details you provided during creation, obtain a model inference endpoint, check billing information, monitor the number of workers and access the running logs of the workers. Additionally, you can stop or edit this Serverless instance from this page.&#x20;

   Serverless instance will have these statuses:

   * **Deploying**: The status immediately after creating a new instance or redeploying a stopped instance. It means the instance is being initializing.
   * **Running**: Typically follows the "Deploying" status. This indicates the instance is running normally, can be scaled, and the endpoint is accessible.
   * **Stopped**: The instance is stopped, and the number of workers will be reduced to zero.
   * **Failed**: The instance deployment or scaling failed due to an error, which may cause the endpoint to become inaccessible.

<figure><img src="/files/AXRvqGys7Xy4QKPFMye1" alt=""><figcaption><p>Serverless Page</p></figcaption></figure>

<figure><img src="/files/y0vMlEREmpaTMZjeF43y" alt=""><figcaption><p>Serverless Detail</p></figcaption></figure>

5. Finally, you can obtain the request URL by clicking **Endpoint Info** or checking the **Request URL** displayed on the interface.

<figure><img src="/files/ag0eDkveDrf2dOc5Xz5Y" alt=""><figcaption><p>Get Endpoint Info</p></figcaption></figure>

{% hint style="info" %}
Note that when making actual inference requests, you may need to concatenate the platform-provided URL with the URL path required by the inference service in the image, depending on the service's requirements.

When making requests to the  serverless endpoint, you still need to use an API token. This is to ensure the security of the API.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://netmind-power.gitbook.io/netmind-power-documentation/inference/dedicated-endpoints.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
