Skip to content

Deploying Inference Server

Start and run Kubeflow Serving

Run the following command for starting Kubeflow serving and running inference on the given input:

bash $WORK_DIR/llm/run.sh  -n <MODEL_NAME> -g <NUM_GPUS> -f <NFS_ADDRESS_WITH_SHARE_PATH> -m <NFS_LOCAL_MOUNT_LOCATION> -e <KUBE_DEPLOYMENT_NAME> [OPTIONAL -d <INPUT_PATH> -v <REPO_COMMIT_ID> -t <HUGGINGFACE_HUB_TOKEN>]

  • n: Name of a validated model
  • d: Absolute path of input data folder (Optional)
  • g: Number of gpus to be used to execute (Set 0 to use cpu)
  • f: NFS server address with share path information
  • m: Mount path to your nfs server to be used in the kube PV where model files and model archive file be stored
  • e: Desired name of the deployment metadata (will be created)
  • v: Commit ID of model's HuggingFace repository (optional, if not provided default set in model_config will be used)
  • t: Your HuggingFace token. Needed for LLAMA(2) model.

Should print "Inference Run Successful" as a message once the Inference Server has successfully started.

Examples

The following are example commands to start the Inference Server.

For 1 GPU Inference with official MPT-7B model and keep inference server alive:

bash $WORK_DIR/llm/run.sh -n mpt_7b -d data/translate -g 1 -e llm-deploy -f '1.1.1.1:/llm' -m /mnt/llm
For 1 GPU Inference with official Falcon-7B model and keep inference server alive:
bash $WORK_DIR/llm/run.sh -n falcon_7b -d data/qa -g 1 -e llm-deploy -f '1.1.1.1:/llm' -m /mnt/llm
For 1 GPU Inference with official Llama2-7B model and keep inference server alive:
bash $WORK_DIR/llm/run.sh -n llama2_7b -d data/summarize -g 1 -e llm-deploy -f '1.1.1.1:/llm' -m /mnt/llm -t <HUGGINGFACE_HUB_TOKEN>

Cleanup Inference deployment

Run the following command to stop the inference server and unmount PV and PVC.

python3 $WORK_DIR/llm/cleanup.py --deploy_name <DEPLOYMENT_NAME>
Example:
python3 $WORK_DIR/llm/cleanup.py --deploy_name llm-deploy