Configuration¶

🔧 Configuration¶

Model Configuration¶

The assistant supports multiple model types:

Amazon Bedrock¶

MODEL_ID:
  NAME: us.amazon.nova-pro-v1:0
  TYPE: bedrock
  REGION: us-east-1
  TEMPERATURE: 0.1

Note: Newer Claude models on Bedrock reject temperature as deprecated. Omit TEMPERATURE for those — it is only sent when explicitly configured.

Using a named AWS profile (Bedrock, SageMaker, Mantle). These providers use the standard boto3 credential chain (default profile / env vars / instance role). To select a specific named profile instead, set AWS_PROFILE via the config ENV: section — values there are exported as environment variables at startup, and boto3 picks them up automatically. No model-level config key is needed:
ENV:
  AWS_PROFILE: my-bedrock-profile
  # AWS_REGION: us-east-1   # any AWS env var works here too
Using a Bedrock API key (instead of AWS credentials). Bedrock supports short-term API keys (a bedrock-api-key-... value from the console). For standard Bedrock (TYPE: bedrock), set it as AWS_BEARER_TOKEN_BEDROCK — langchain-aws reads it automatically, no model config needed:
ENV:
  AWS_BEARER_TOKEN_BEDROCK: bedrock-api-key-XXXXXXXX
(For Mantle, the same key is supplied differently — see the Mantle section below.)

Amazon Bedrock Mantle¶

Bedrock Mantle is an OpenAI-compatible API (not the Bedrock Converse API). By default it authenticates with a short-lived bearer token minted from your standard AWS credentials via aws-bedrock-token-generator, so your normal aws configure / SSO setup works — no extra keys to manage. Use TYPE: mantle and a bare model ID from the Mantle catalog.

MODEL_ID:
  NAME: qwen.qwen3-32b # bare Mantle model id (e.g. anthropic.claude-opus-4-8)
  TYPE: mantle
  REGION: us-east-1
  MAX_TOKENS: 8192

Authenticating with a Bedrock API key (no AWS credentials). Instead of minting a token, you can supply a short-term Bedrock API key directly. Mantle reads it from the BEDROCK_API_KEY environment variable (set it via the config ENV: section), or from a per-model API_KEY field. When a key is present it's used as-is; otherwise the app falls back to minting from AWS credentials. (Note: standard Bedrock uses AWS_BEARER_TOKEN_BEDROCK for the same key — Mantle uses BEDROCK_API_KEY.)

# Option A — environment variable (applies to all Mantle calls)
ENV:
  BEDROCK_API_KEY: bedrock-api-key-XXXXXXXX

# Option B — per-model key
MODEL_ID:
  NAME: qwen.qwen3-32b
  TYPE: mantle
  REGION: us-east-1
  API_KEY: bedrock-api-key-XXXXXXXX

API protocols. Mantle serves models under three protocols. Select with API_PROTOCOL (works for both chat and vision):

chat_completions (default) — base /v1, OpenAI Chat Completions API. Most models (Qwen, Gemma, GPT-OSS, DeepSeek, …).
responses — base /openai/v1, OpenAI Responses API. Required by models that only expose Responses, such as openai.gpt-5.4.
anthropic — base /anthropic, Anthropic Messages API. For Claude models (e.g. anthropic.claude-haiku-4-5).

# OpenAI Responses model (e.g. GPT-5.4)
MODEL_ID:
  NAME: openai.gpt-5.4
  TYPE: mantle
  REGION: us-west-2 # gpt-5.4 is in us-west-2, not us-east-1
  API_PROTOCOL: responses
  MAX_TOKENS: 8192

# Anthropic Claude model
MODEL_ID:
  NAME: anthropic.claude-haiku-4-5
  TYPE: mantle
  REGION: us-east-1
  API_PROTOCOL: anthropic
  MAX_TOKENS: 8192

ENDPOINT_URL is optional; it defaults to https://bedrock-mantle.<REGION>.api.aws/{v1 | openai/v1 | anthropic} depending on the protocol.
The Mantle catalog (Qwen, Mistral, DeepSeek, GLM, Gemma, Claude, GPT-5.4, …) differs from standard Bedrock and varies by account/region.
TYPE: mantle works for both MODEL_ID (chat) and VISION_MODEL_ID (image description) — vision-capable models like qwen.qwen3-vl-235b-a22b-instruct are supported.
Caveats: Pick the right API_PROTOCOL per model (using the wrong one returns a 400 "does not support the '/v1/…' API" error). anthropic requires the langchain-anthropic package (in requirements.txt). Models like anthropic.claude-fable-5 also require the account's data-retention mode to be provider_data_share, otherwise they report unavailable.
Reasoning models need a generous MAX_TOKENS. Reasoning models (e.g. Grok, GPT-5) on the responses protocol spend output tokens reasoning before they answer. A small MAX_TOKENS can be consumed entirely by reasoning, leaving no answer — the agent detects this and tells you to raise MAX_TOKENS rather than returning an empty reply. Set it to a few thousand (e.g. 8192) for these models.

For standard Bedrock (Converse API), ENDPOINT_URL is also accepted on MODEL_ID/VISION_MODEL_ID with TYPE: bedrock to override the default endpoint.

Ollama (Local)¶

MODEL_ID:
  NAME: qwen3-4b-thinking-2507-q6-k:latest
  TYPE: ollama
  HOST: localhost
  PORT: 11434
  REPETITION_PENALTY: 1.1
  PRESENCE_PENALTY: 1.5
  TEMPERATURE: 0.1
  TOP_P: 0.95

OpenAI¶

MODEL_ID:
  NAME: gpt-5-mini-2025-08-07
  TYPE: openai
  STREAM: true
  REASONING_EFFORT: medium
# Requires OPENAI_API_KEY environment variable

Anthropic (Claude API)¶

The direct Anthropic API (api.anthropic.com) via langchain-anthropic. This is distinct from the Bedrock Mantle anthropic protocol (which reaches Claude through Bedrock) — TYPE: anthropic talks to Anthropic directly. STOP maps to Anthropic's stop_sequences, and extended thinking is enabled with REASONING (+ optional REASONING_EFFORT / THINKING_TOKENS).

MODEL_ID:
  NAME: claude-opus-4-8
  TYPE: anthropic
  MAX_TOKENS: 4096
  TEMPERATURE: 0.4
  # REASONING: true          # enable extended thinking
  # REASONING_EFFORT: high   # low | medium | high | max
  # ENDPOINT_URL: https://...  # optional custom base URL
# Requires ANTHROPIC_API_KEY env var, or set MODEL_ID.API_KEY

Amazon SageMaker AI¶

MODEL_ID:
  NAME: your-endpoint-name
  TYPE: sagemaker
  REGION: us-east-1
  REPETITION_PENALTY: 1.1
  PRESENCE_PENALTY: 1.5
  TEMPERATURE: 0.1
  MAX_TOKENS: 4096

LiteLLM (100+ Providers)¶

MODEL_ID:
  NAME: openai/your-model-name
  TYPE: litellm
  API_BASE: http://localhost:8000/v1
  API_KEY: your-api-key
  TEMPERATURE: 0.1
  MAX_TOKENS: 4096

Vision Model Configuration¶

For Bedrock:

VISION_MODEL_ID:
  NAME: global.anthropic.claude-haiku-4-5-20251001-v1:0
  TYPE: bedrock
  REGION: us-east-1
  TEMPERATURE: 0.3

For Ollama:

VISION_MODEL_ID:
  NAME: qwen3-vl:2b
  TYPE: ollama
  HOST: localhost
  PORT: 11434
  TEMPERATURE: 0.3

For OpenAI:

VISION_MODEL_ID:
  NAME: gpt-5-mini-2025-08-07
  TYPE: openai
  STREAM: true
  REASONING_EFFORT: medium

For Anthropic (Claude is multimodal):

VISION_MODEL_ID:
  NAME: claude-opus-4-8
  TYPE: anthropic
  MAX_TOKENS: 1500
  TEMPERATURE: 0.3
# Requires ANTHROPIC_API_KEY env var, or set VISION_MODEL_ID.API_KEY

For SageMaker AI (endpoint must serve a vision-capable model accepting the OpenAI image format):

VISION_MODEL_ID:
  NAME: your-endpoint-name
  TYPE: sagemaker
  REGION: us-east-1
  INPUT_FORMAT: openai_chat
  TEMPERATURE: 0.3

For LiteLLM (any of its vision-capable models):

VISION_MODEL_ID:
  NAME: openai/gpt-4o # provider-prefixed model id
  TYPE: litellm
  API_BASE: http://localhost:4000 # optional (proxy / self-hosted)
  API_KEY: your-api-key # optional (else the provider's env var)

Model Parameters¶

This is the full reference for what you can put under MODEL_ID, VISION_MODEL_ID, and RAG.EMBED_MODEL_ID. Only NAME and TYPE are required; everything else is optional and omitted keys fall back to the provider/model default. The interactive configurator (/config, /model) sets the common ones — use this reference to hand-tune config.yaml for anything else a provider or model supports.

Identity, connection & auth¶

Parameter	Applies to `TYPE`	Description
`NAME`	all (required)	Model id / Ollama model / Bedrock model id / Mantle bare id / SageMaker endpoint name
`TYPE`	all (required)	`ollama`, `bedrock`, `mantle`, `openai`, `anthropic`, `sagemaker`, `litellm` (embeddings: `ollama`, `bedrock`, `openai`, `sagemaker`, `litellm`)
`HOST`	`ollama`	Ollama host (default `localhost`)
`PORT`	`ollama`	Ollama port (default `11434`)
`REGION`	`bedrock`, `mantle`, `sagemaker`	AWS region (default `us-east-1`)
`API_PROTOCOL`	`mantle`	`chat_completions` (default), `responses`, or `anthropic`
`ENDPOINT_URL`	`bedrock`, `mantle`, `anthropic`	Override the default endpoint URL (Anthropic: custom base URL)
`API_KEY`	`mantle`, `anthropic`, `litellm`	Mantle: Bedrock API key (else `BEDROCK_API_KEY` env / minted token). Anthropic: else `ANTHROPIC_API_KEY` env. LiteLLM: provider key
`API_BASE`	`litellm`	LiteLLM API base URL
`INPUT_FORMAT`	`sagemaker`	`openai_chat` (default) or `huggingface`

Standard Bedrock also reads the AWS_BEARER_TOKEN_BEDROCK env var, and all AWS providers honor AWS_PROFILE — see the API-key/profile notes under Amazon Bedrock.

Inference parameters¶

Optional generation settings. The Honored by column lists the providers that actually send each one (others ignore it). These apply to MODEL_ID and VISION_MODEL_ID; EMBED_MODEL_ID takes none of them (embeddings only use NAME/TYPE + connection).

This table is derived from models/provider_params.py — the single source of truth that the controllers build their client kwargs from — so it reflects exactly what each provider's init path forwards. (mantle reads TEMPERATURE/MAX_TOKENS/TOP_P via the Mantle factory.)

Parameter	Description	Honored by (`MODEL_ID`)
`MAX_TOKENS`	Max output tokens to generate	ollama, bedrock, mantle, openai, anthropic, sagemaker, litellm
`TEMPERATURE`	Sampling temperature	ollama, bedrock, mantle, openai, anthropic, sagemaker, litellm
`TOP_P`	Top-p (nucleus) sampling	ollama, bedrock, mantle, openai, anthropic, sagemaker, litellm
`TOP_K`	Top-k sampling	ollama, anthropic, sagemaker
`STOP`	Stop sequences (YAML list)	ollama, bedrock, anthropic, sagemaker, litellm
`STREAM`	Stream tokens (default `true`)	mantle, openai, anthropic, litellm
`PRESENCE_PENALTY`	Presence penalty	ollama, openai
`FREQUENCY_PENALTY`	Frequency penalty	ollama
`REPETITION_PENALTY`	Repetition penalty	ollama, litellm
`REASONING`	Enable extended thinking (boolean)	bedrock, anthropic
`THINKING_TOKENS`	Thinking token budget (default `2048`)	bedrock, anthropic
`REASONING_EFFORT`	reasoning effort (provider-dependent: `none`/`minimal`/`low`/`medium`/`high`/`xhigh`/`max`)	openai, anthropic, bedrock, mantle, litellm

VISION_MODEL_ID supports the same seven providers as MODEL_ID. It accepts a subset of params: MAX_TOKENS/TEMPERATURE/TOP_P across providers, plus TOP_K on ollama/anthropic/sagemaker and STOP on ollama/sagemaker. Connection keys follow the provider (host/port, region, Mantle protocol, SageMaker INPUT_FORMAT, LiteLLM/Anthropic API_BASE/API_KEY/base URL).

/params only offers what the provider supports. The set of tunable params is taken per-provider from the registry, so /params never prompts for — and never writes — a key the model ignores (e.g. Anthropic has no PRESENCE_PENALTY/FREQUENCY_PENALTY; only the params it honors are offered).

REASONING_EFFORT is a single, first-class knob translated per provider. Set one effort value and mnemoai maps it to each provider's mechanism: forwarded as reasoning_effort on OpenAI and Mantle's responses protocol; mapped to a thinking token budget on Anthropic, standard Bedrock, and Mantle's anthropic protocol; passed through LiteLLM (which translates it per backend). When thinking is enabled this way, temperature/top_p/top_k are dropped automatically (the providers reject them). For finer control, set the raw provider parameter via EXTRA_PARAMS (below), which overrides this.

`EXTRA_PARAMS` — generic passthrough for anything else¶

The table above is the curated set. For provider-specific knobs it doesn't model — or new ones that ship after a release — add an EXTRA_PARAMS dict to any MODEL_ID / VISION_MODEL_ID. Its contents are forwarded verbatim to the underlying model's request body, with no interpretation by mnemoai, so you use the provider's own parameter names. This means new parameters need no code change. Works for every provider; it's the right place for reasoning controls on Mantle, which the curated columns don't cover.

# OpenAI / GPT-5.x (TYPE: openai, or Mantle API_PROTOCOL: responses)
MODEL_ID:
  NAME: openai.gpt-5.5
  TYPE: mantle
  API_PROTOCOL: responses
  EXTRA_PARAMS:
    reasoning_effort: high      # none | low | medium | high | xhigh
    # verbosity: low

# Anthropic / Claude (TYPE: anthropic, or Mantle API_PROTOCOL: anthropic)
MODEL_ID:
  NAME: anthropic.claude-opus-4-8
  TYPE: mantle
  API_PROTOCOL: anthropic
  EXTRA_PARAMS:
    thinking: { type: enabled, budget_tokens: 10000 }

Notes: reasoning_effort is lifted to a first-class argument on OpenAI-family clients (so it isn't double-specified); everything else is merged into model_kwargs. A non-dict EXTRA_PARAMS is ignored rather than crashing. It is not offered by the /params interactive tuner (it's a free-form dict, not a scalar) — set it in config.yaml directly.

Provider-appropriate tuning matters. Newer Claude and GPT models reject TEMPERATURE outright; STOP, penalties, and TOP_K are largely Ollama/SageMaker concepts. When /model switches a section's provider it drops the keys the new provider doesn't consume for you, but for everything else edit config.yaml to match what your specific provider/model accepts.

The context window is set separately, at the top level (it's not part of a model section): MAX_CONVERSATION_TOKENS (see General Parameters below).

General Parameters¶

# Context window size (passed to model as num_ctx for Ollama)
MAX_CONVERSATION_TOKENS: 65536

# Maximum tokens when reading documents (CSV, JSON, text files)
DOC_MAX_TOKENS: 16384

# Profile configuration
PROFILE:
  NAME: default # Used for session data isolation (~/.mnemoai/{NAME}/)
  USE_PROFILING: true # Enable automatic user profiling

Embeddings Configuration¶

Embeddings settings are nested under the RAG section:

RAG:
  EMBEDDINGS:
    CACHE_ENABLED: true # LRU cache for embedding vectors (avoids re-embedding same text)
    CACHE_SIZE: 1000 # Maximum cached embeddings
    FALLBACK_ENABLED: true # Fall back to SHA256 if embedding model unavailable
    FALLBACK_TYPE: "sha256" # Fallback type (sha256, random, zeros)

LLM Interaction Configuration¶

LLM:
  ENABLE_THINKING: true # Enable thinking tags (verbose mode)
  RETRY_ENABLED: true # Retry failed LLM calls
  MAX_RETRIES: 3 # Maximum retry attempts
  RETRY_DELAY: 1.0 # Seconds between retries
  RETRY_BACKOFF: 2.0 # Exponential backoff multiplier
  SUMMARIZATION_THINK: false # Include thinking in summarization
  TOKEN_COUNTING:
    OLLAMA_APPROXIMATION: 1.3 # Chars-to-tokens multiplier for Ollama
    FALLBACK_MODEL: "gpt-4" # Tiktoken model for fallback counting

System Prompt¶

The system prompt lives in prompts.yaml (a sibling of config.yaml, in the same config/ directory — all model-facing prompts live there, separate from configuration). Customize the SYSTEM_PROMPT field to change the assistant's personality, instructions, and tool usage patterns. Key sections in the default prompt:

<identity>: Basic identity and core principles
<reasoning_discipline>: Thinking rules and loop detection
<output_format>: Response formatting requirements
<information_sources>: RAG vs web vs internal knowledge decision tree
<file_operations>: Read/write/edit workflow rules
<search_tools>: Glob and grep usage guidance
<git_operations>: Git safety rules
<task_management>: Todo, plan mode, and background task rules
<error_handling>: Error response guidelines
<communication>: Style and security rules

RAG Configuration¶

ENABLE_RAG: true # Master toggle for RAG system
RAG:
  MAX_TOKENS: 8192 # Threshold: documents above this are ingested into RAG
  CHUNK_TOKENS: 1024 # Chunk size in tokens (recommended: 512-2048)
  SEARCH:
    SEMANTIC_WEIGHT: 0.5 # Semantic similarity weight (0-1)
    KEYWORD_WEIGHT: 0.5 # BM25 keyword weight (0-1)
  VECTOR_STORE:
    TYPE: chromadb # Vector store backend: "faiss" or "chromadb"
  EMBEDDINGS:
    CACHE_ENABLED: true
    CACHE_SIZE: 1000
    FALLBACK_ENABLED: true
    FALLBACK_TYPE: "sha256"

Requires: An embedding model configured via RAG.EMBED_MODEL_ID (see Embeddings Model).

Episodic Memory Configuration¶

ENABLE_EPISODIC_MEMORY: true
EPISODIC_MEMORY:
  STORE_TYPE: chromadb # or faiss
  # Similarity Thresholds
  DUPLICATE_THRESHOLD: 0.95 # Higher = stricter duplicate detection
  RETRIEVAL_THRESHOLD: 0.7 # Minimum similarity to retrieve episodes
  FOLLOW_UP_THRESHOLD: 0.4 # Similarity to detect follow-up questions (skips injection)
  REDUNDANCY_THRESHOLD: 0.5 # Filter episodes redundant with conversation
  # Hybrid Search Weights
  SEMANTIC_WEIGHT: 0.7 # Semantic similarity weight (0-1)
  KEYWORD_WEIGHT: 0.3 # Keyword matching weight (0-1)
  # Token and Size Limits
  MAX_TOKENS_PER_EPISODE: 400 # Max tokens for episode text
  MAX_EPISODES: 1000 # Maximum stored episodes
  MAX_AGE_DAYS: 90 # Maximum episode age in days
  # Success Detection
  SUCCESS_MARKERS: # Phrases that indicate task success
    - thanks
    - perfect
    - great
    - worked
  CORRECTION_MARKERS: # Phrases that indicate errors
    - wrong
    - error
    - fix
    - actually
  # Storage Behavior
  IMMEDIATE_STORAGE: true # Store episodes immediately
  MIN_TOOLS_OR_LENGTH: 300 # Min response length if no tools used
  # Query Enhancement
  ENABLE_QUERY_EXPANSION: true # Expand queries with synonyms
  QUERY_EXPANSION_TERMS: 3 # Max terms to add per query