Skip to main content
Jarvis exposes an OpenAI-compatible API through LiteLLM. Any tool or library that works with the OpenAI API works with Jarvis — point it at your Jarvis host instead of api.openai.com.

Inference workflow

1

Choose a model

Pick a model for your task. Use a general-purpose model like llama3 or mistral for most tasks, a code-focused model like deepseek-coder for programming work, or a fast quantized model for high-throughput needs.Not sure which to use? See Models for a breakdown by category.
2

Send a completion request

POST your request to the /v1/chat/completions endpoint on your Jarvis host. Include your model name, messages, and API key.
curl https://your-jarvis-host/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "llama3",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Explain how neural networks work." }
    ]
  }'
3

Read the response

The response follows the standard OpenAI format. Your completion is in choices[0].message.content.
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "llama3",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Neural networks are..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 180,
    "total_tokens": 204
  }
}

Request format

The request body follows the OpenAI chat/completions schema:
FieldTypeDescription
modelstringModel name (e.g. llama3, mistral, deepseek-coder)
messagesarrayConversation history with role and content
temperaturenumberSampling temperature, 0–2. Lower = more focused.
max_tokensnumberMaximum tokens in the response
streambooleanSet to true to receive a streaming response
nodestringOptional. Target a specific mesh node.

Specify a node

To run inference on a specific node — for example, to guarantee GPU access — pass the node field alongside your model:
{
  "model": "llama3",
  "node": "ai-max",
  "messages": [{ "role": "user", "content": "Your prompt here" }]
}
See Nodes for the full list of available nodes and their roles.

Streaming responses

Set "stream": true to receive tokens as they are generated instead of waiting for the full response. This improves perceived latency for long completions.
curl https://your-jarvis-host/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "llama3",
    "stream": true,
    "messages": [
      { "role": "user", "content": "Write a short story about a robot." }
    ]
  }'
The response is a stream of data: lines in Server-Sent Events format. Each chunk contains a partial choices[0].delta.content value. The stream ends with data: [DONE].
Streaming works with any client that supports SSE or chunked transfer encoding. The OpenAI Python and Node SDKs handle this automatically when you pass stream=True.

Tips for good results

Use a system prompt. Setting a clear system message focuses the model on your use case and improves consistency across requests.
Keep context concise. Local models have fixed context windows. Long conversation histories can crowd out space for the response — trim messages that are no longer relevant.
Tune temperature for the task. Use a lower temperature (0.1–0.4) for factual or code tasks where accuracy matters. Use a higher temperature (0.7–1.0) for creative or generative tasks.
Use fast models for high-throughput pipelines. If you’re running many requests in parallel — such as in an n8n workflow — route to a smaller quantized model to avoid saturating GPU memory.

Next steps

Models

See which models are available and how to choose one.

API reference

Full API reference including all request parameters.

Agents

Use agents to run multi-step tasks without managing individual requests.

n8n workflows

Integrate inference into automated workflows.