api.openai.com.
Inference workflow
Choose a model
Pick a model for your task. Use a general-purpose model like
llama3 or mistral for most tasks, a code-focused model like deepseek-coder for programming work, or a fast quantized model for high-throughput needs.Not sure which to use? See Models for a breakdown by category.Send a completion request
POST your request to the
/v1/chat/completions endpoint on your Jarvis host. Include your model name, messages, and API key.Request format
The request body follows the OpenAIchat/completions schema:
| Field | Type | Description |
|---|---|---|
model | string | Model name (e.g. llama3, mistral, deepseek-coder) |
messages | array | Conversation history with role and content |
temperature | number | Sampling temperature, 0–2. Lower = more focused. |
max_tokens | number | Maximum tokens in the response |
stream | boolean | Set to true to receive a streaming response |
node | string | Optional. Target a specific mesh node. |
Specify a node
To run inference on a specific node — for example, to guarantee GPU access — pass thenode field alongside your model:
Streaming responses
Set"stream": true to receive tokens as they are generated instead of waiting for the full response. This improves perceived latency for long completions.
data: lines in Server-Sent Events format. Each chunk contains a partial choices[0].delta.content value. The stream ends with data: [DONE].
Streaming works with any client that supports SSE or chunked transfer encoding. The OpenAI Python and Node SDKs handle this automatically when you pass
stream=True.Tips for good results
Next steps
Models
See which models are available and how to choose one.
API reference
Full API reference including all request parameters.
Agents
Use agents to run multi-step tasks without managing individual requests.
n8n workflows
Integrate inference into automated workflows.