Management Requests¶

The Inference Server can be managed through the TorchServe Management API. Find out more about it in the official TorchServe Management API documentation

Server Configuration

Variable	Value
inference_server_endpoint	localhost
management_port	8081

The following are example cURL commands to send management requests to the Inference Server.

List Registered Models¶

To describe all registered models, the template command is:

curl http://{inference_server_endpoint}:{management_port}/models

Example¶

For all registered models

curl http://localhost:8081/models

Describe Registered Models¶

Once a model is loaded on the Inference Server, we can use the following request to describe the model and it's configuration.

The following is the template command for the same:

curl http://{inference_server_endpoint}:{management_port}/models/{model_name}

Example response of the describe models request:

[
  {
    "modelName": "llama2_7b",
    "modelVersion": "6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9",
    "modelUrl": "llama2_7b_6fdf2e6.mar",
    "runtime": "python",
    "minWorkers": 1,
    "maxWorkers": 1,
    "batchSize": 1,
    "maxBatchDelay": 200,
    "loadedAtStartup": false,
    "workers": [
      {
        "id": "9000",
        "startTime": "2023-11-28T06:39:28.081Z",
        "status": "READY",
        "memoryUsage": 0,
        "pid": 57379,
        "gpu": true,
        "gpuUsage": "gpuId::0 utilization.gpu [%]::0 % utilization.memory [%]::0 % memory.used [MiB]::13423 MiB"
      }
    ],
    "jobQueueStatus": {
      "remainingCapacity": 1000,
      "pendingRequests": 0
    }
  }
]

Note

From this request, you can validate if a model is ready for inferencing. You can do this by referring to the values under the "workers" -> "status" keys of the response.

Examples¶

For MPT-7B model

curl http://localhost:8081/models/mpt_7b

For Falcon-7B model

curl http://localhost:8081/models/falcon_7b

For Llama2-7B model

curl http://localhost:8081/models/llama2_7b

Register Additional Models¶

TorchServe allows the registering (loading) of multiple models simultaneously. To register multiple models, make sure that the Model Archive Files for the concerned models are stored in the same directory.

The following is the template command for the same:

curl -X POST "http://{inference_server_endpoint}:{management_port}/models?url={model_archive_file_name}.mar&initial_workers=1&synchronous=true"

Examples¶

For MPT-7B model

curl -X POST "http://localhost:8081/models?url=mpt_7b.mar&initial_workers=1&synchronous=true"

For Falcon-7B model

curl -X POST "http://localhost:8081/models?url=falcon_7b.mar&initial_workers=1&synchronous=true"

For Llama2-7B model

curl -X POST "http://localhost:8081/models?url=llama2_7b.mar&initial_workers=1&synchronous=true"

Note

Make sure the Model Archive file name given in the cURL request is correct and is present in the model store directory.

Edit Registered Model Configuration¶

The model can be configured after registration using the Management API of TorchServe.

The following is the template command for the same:

curl -v -X PUT "http://{inference_server_endpoint}:{management_port}/models/{model_name}?min_workers={number}&max_workers={number}&batch_size={number}&max_batch_delay={delay_in_ms}"

Examples¶

For MPT-7B model

curl -v -X PUT "http://localhost:8081/models/mpt_7b?min_worker=2&max_worker=2"

For Falcon-7B model

curl -v -X PUT "http://localhost:8081/models/falcon_7b?min_worker=2&max_worker=2"

For Llama2-7B model

curl -v -X PUT "http://localhost:8081/models/llama2_7b?min_worker=2&max_worker=2"

Note

Make sure to have enough GPU and System Memory before increasing number of workers, else the additional workers will fail to load.

Unregister a Model¶

The following is the template command to unregister a model from the Inference Server:

curl -X DELETE "http://{inference_server_endpoint}:{management_port}/models/{model_name}/{repo_version}"