Configuration¶
🔧 Configuration¶
Model Configuration¶
The assistant supports multiple model types:
Amazon Bedrock¶
Note: Newer Claude models on Bedrock reject
temperatureas deprecated. OmitTEMPERATUREfor those — it is only sent when explicitly configured.Using a named AWS profile (Bedrock, SageMaker, Mantle). These providers use the standard boto3 credential chain (default profile / env vars / instance role). To select a specific named profile instead, set
AWS_PROFILEvia the configENV:section — values there are exported as environment variables at startup, and boto3 picks them up automatically. No model-level config key is needed:Using a Bedrock API key (instead of AWS credentials). Bedrock supports short-term API keys (a
bedrock-api-key-...value from the console). For standard Bedrock (TYPE: bedrock), set it asAWS_BEARER_TOKEN_BEDROCK—langchain-awsreads it automatically, no model config needed:(For Mantle, the same key is supplied differently — see the Mantle section below.)
Amazon Bedrock Mantle¶
Bedrock Mantle is an OpenAI-compatible API (not the Bedrock Converse API). By default it authenticates with a short-lived bearer token minted from your standard AWS credentials via aws-bedrock-token-generator, so your normal aws configure / SSO setup works — no extra keys to manage. Use TYPE: mantle and a bare model ID from the Mantle catalog.
MODEL_ID:
NAME: qwen.qwen3-32b # bare Mantle model id (e.g. anthropic.claude-opus-4-8)
TYPE: mantle
REGION: us-east-1
MAX_TOKENS: 8192
Authenticating with a Bedrock API key (no AWS credentials). Instead of minting a token, you can supply a short-term Bedrock API key directly. Mantle reads it from the BEDROCK_API_KEY environment variable (set it via the config ENV: section), or from a per-model API_KEY field. When a key is present it's used as-is; otherwise the app falls back to minting from AWS credentials. (Note: standard Bedrock uses AWS_BEARER_TOKEN_BEDROCK for the same key — Mantle uses BEDROCK_API_KEY.)
# Option A — environment variable (applies to all Mantle calls)
ENV:
BEDROCK_API_KEY: bedrock-api-key-XXXXXXXX
# Option B — per-model key
MODEL_ID:
NAME: qwen.qwen3-32b
TYPE: mantle
REGION: us-east-1
API_KEY: bedrock-api-key-XXXXXXXX
API protocols. Mantle serves models under three protocols. Select with API_PROTOCOL (works for both chat and vision):
chat_completions(default) — base/v1, OpenAI Chat Completions API. Most models (Qwen, Gemma, GPT-OSS, DeepSeek, …).responses— base/openai/v1, OpenAI Responses API. Required by models that only expose Responses, such asopenai.gpt-5.4.anthropic— base/anthropic, Anthropic Messages API. For Claude models (e.g.anthropic.claude-haiku-4-5).
# OpenAI Responses model (e.g. GPT-5.4)
MODEL_ID:
NAME: openai.gpt-5.4
TYPE: mantle
REGION: us-west-2 # gpt-5.4 is in us-west-2, not us-east-1
API_PROTOCOL: responses
MAX_TOKENS: 8192
# Anthropic Claude model
MODEL_ID:
NAME: anthropic.claude-haiku-4-5
TYPE: mantle
REGION: us-east-1
API_PROTOCOL: anthropic
MAX_TOKENS: 8192
ENDPOINT_URLis optional; it defaults tohttps://bedrock-mantle.<REGION>.api.aws/{v1 | openai/v1 | anthropic}depending on the protocol.- The Mantle catalog (Qwen, Mistral, DeepSeek, GLM, Gemma, Claude, GPT-5.4, …) differs from standard Bedrock and varies by account/region.
TYPE: mantleworks for bothMODEL_ID(chat) andVISION_MODEL_ID(image description) — vision-capable models likeqwen.qwen3-vl-235b-a22b-instructare supported.- Caveats: Pick the right
API_PROTOCOLper model (using the wrong one returns a 400 "does not support the '/v1/…' API" error).anthropicrequires thelangchain-anthropicpackage (inrequirements.txt). Models likeanthropic.claude-fable-5also require the account's data-retention mode to beprovider_data_share, otherwise they reportunavailable. - Reasoning models need a generous
MAX_TOKENS. Reasoning models (e.g. Grok, GPT-5) on theresponsesprotocol spend output tokens reasoning before they answer. A smallMAX_TOKENScan be consumed entirely by reasoning, leaving no answer — the agent detects this and tells you to raiseMAX_TOKENSrather than returning an empty reply. Set it to a few thousand (e.g.8192) for these models.
For standard Bedrock (Converse API),
ENDPOINT_URLis also accepted onMODEL_ID/VISION_MODEL_IDwithTYPE: bedrockto override the default endpoint.
Ollama (Local)¶
MODEL_ID:
NAME: qwen3-4b-thinking-2507-q6-k:latest
TYPE: ollama
HOST: localhost
PORT: 11434
REPETITION_PENALTY: 1.1
PRESENCE_PENALTY: 1.5
TEMPERATURE: 0.1
TOP_P: 0.95
OpenAI¶
MODEL_ID:
NAME: gpt-5-mini-2025-08-07
TYPE: openai
STREAM: true
REASONING_EFFORT: medium
# Requires OPENAI_API_KEY environment variable
Anthropic (Claude API)¶
The direct Anthropic API (api.anthropic.com) via langchain-anthropic. This is distinct from the Bedrock Mantle anthropic protocol (which reaches Claude through Bedrock) — TYPE: anthropic talks to Anthropic directly. STOP maps to Anthropic's stop_sequences, and extended thinking is enabled with REASONING (+ optional REASONING_EFFORT / THINKING_TOKENS).
MODEL_ID:
NAME: claude-opus-4-8
TYPE: anthropic
MAX_TOKENS: 4096
TEMPERATURE: 0.4
# REASONING: true # enable extended thinking
# REASONING_EFFORT: high # low | medium | high | max
# ENDPOINT_URL: https://... # optional custom base URL
# Requires ANTHROPIC_API_KEY env var, or set MODEL_ID.API_KEY
Amazon SageMaker AI¶
MODEL_ID:
NAME: your-endpoint-name
TYPE: sagemaker
REGION: us-east-1
REPETITION_PENALTY: 1.1
PRESENCE_PENALTY: 1.5
TEMPERATURE: 0.1
MAX_TOKENS: 4096
LiteLLM (100+ Providers)¶
MODEL_ID:
NAME: openai/your-model-name
TYPE: litellm
API_BASE: http://localhost:8000/v1
API_KEY: your-api-key
TEMPERATURE: 0.1
MAX_TOKENS: 4096
Vision Model Configuration¶
For Bedrock:
VISION_MODEL_ID:
NAME: global.anthropic.claude-haiku-4-5-20251001-v1:0
TYPE: bedrock
REGION: us-east-1
TEMPERATURE: 0.3
For Ollama:
For OpenAI:
For Anthropic (Claude is multimodal):
VISION_MODEL_ID:
NAME: claude-opus-4-8
TYPE: anthropic
MAX_TOKENS: 1500
TEMPERATURE: 0.3
# Requires ANTHROPIC_API_KEY env var, or set VISION_MODEL_ID.API_KEY
For SageMaker AI (endpoint must serve a vision-capable model accepting the OpenAI image format):
VISION_MODEL_ID:
NAME: your-endpoint-name
TYPE: sagemaker
REGION: us-east-1
INPUT_FORMAT: openai_chat
TEMPERATURE: 0.3
For LiteLLM (any of its vision-capable models):
VISION_MODEL_ID:
NAME: openai/gpt-4o # provider-prefixed model id
TYPE: litellm
API_BASE: http://localhost:4000 # optional (proxy / self-hosted)
API_KEY: your-api-key # optional (else the provider's env var)
Model Parameters¶
This is the full reference for what you can put under MODEL_ID,
VISION_MODEL_ID, and RAG.EMBED_MODEL_ID. Only NAME and TYPE are
required; everything else is optional and omitted keys fall back to the
provider/model default. The interactive configurator (/config, /model)
sets the common ones — use this reference to hand-tune config.yaml for
anything else a provider or model supports.
Identity, connection & auth¶
| Parameter | Applies to TYPE |
Description |
|---|---|---|
NAME |
all (required) | Model id / Ollama model / Bedrock model id / Mantle bare id / SageMaker endpoint name |
TYPE |
all (required) | ollama, bedrock, mantle, openai, anthropic, sagemaker, litellm (embeddings: ollama, bedrock, openai, sagemaker, litellm) |
HOST |
ollama |
Ollama host (default localhost) |
PORT |
ollama |
Ollama port (default 11434) |
REGION |
bedrock, mantle, sagemaker |
AWS region (default us-east-1) |
API_PROTOCOL |
mantle |
chat_completions (default), responses, or anthropic |
ENDPOINT_URL |
bedrock, mantle, anthropic |
Override the default endpoint URL (Anthropic: custom base URL) |
API_KEY |
mantle, anthropic, litellm |
Mantle: Bedrock API key (else BEDROCK_API_KEY env / minted token). Anthropic: else ANTHROPIC_API_KEY env. LiteLLM: provider key |
API_BASE |
litellm |
LiteLLM API base URL |
INPUT_FORMAT |
sagemaker |
openai_chat (default) or huggingface |
Standard Bedrock also reads the
AWS_BEARER_TOKEN_BEDROCKenv var, and all AWS providers honorAWS_PROFILE— see the API-key/profile notes under Amazon Bedrock.
Inference parameters¶
Optional generation settings. The Honored by column lists the providers that
actually send each one (others ignore it). These apply to MODEL_ID and
VISION_MODEL_ID; EMBED_MODEL_ID takes none of them (embeddings only use
NAME/TYPE + connection).
This table is derived from models/provider_params.py — the single source of
truth that the controllers build their client kwargs from — so it reflects
exactly what each provider's init path forwards. (mantle reads
TEMPERATURE/MAX_TOKENS/TOP_P via the Mantle factory.)
| Parameter | Description | Honored by (MODEL_ID) |
|---|---|---|
MAX_TOKENS |
Max output tokens to generate | ollama, bedrock, mantle, openai, anthropic, sagemaker, litellm |
TEMPERATURE |
Sampling temperature | ollama, bedrock, mantle, openai, anthropic, sagemaker, litellm |
TOP_P |
Top-p (nucleus) sampling | ollama, bedrock, mantle, openai, anthropic, sagemaker, litellm |
TOP_K |
Top-k sampling | ollama, anthropic, sagemaker |
STOP |
Stop sequences (YAML list) | ollama, bedrock, anthropic, sagemaker, litellm |
STREAM |
Stream tokens (default true) |
mantle, openai, anthropic, litellm |
PRESENCE_PENALTY |
Presence penalty | ollama, openai |
FREQUENCY_PENALTY |
Frequency penalty | ollama |
REPETITION_PENALTY |
Repetition penalty | ollama, litellm |
REASONING |
Enable extended thinking (boolean) | bedrock, anthropic |
THINKING_TOKENS |
Thinking token budget (default 2048) |
bedrock, anthropic |
REASONING_EFFORT |
reasoning effort (provider-dependent: none/minimal/low/medium/high/xhigh/max) |
openai, anthropic, bedrock, mantle, litellm |
VISION_MODEL_ID supports the same seven providers as MODEL_ID. It accepts a
subset of params: MAX_TOKENS/TEMPERATURE/TOP_P across providers, plus
TOP_K on ollama/anthropic/sagemaker and STOP on ollama/sagemaker. Connection
keys follow the provider (host/port, region, Mantle protocol, SageMaker
INPUT_FORMAT, LiteLLM/Anthropic API_BASE/API_KEY/base URL).
/paramsonly offers what the provider supports. The set of tunable params is taken per-provider from the registry, so/paramsnever prompts for — and never writes — a key the model ignores (e.g. Anthropic has noPRESENCE_PENALTY/FREQUENCY_PENALTY; only the params it honors are offered).
REASONING_EFFORTis a single, first-class knob translated per provider. Set one effort value and mnemoai maps it to each provider's mechanism: forwarded asreasoning_efforton OpenAI and Mantle'sresponsesprotocol; mapped to athinkingtoken budget on Anthropic, standard Bedrock, and Mantle'santhropicprotocol; passed through LiteLLM (which translates it per backend). When thinking is enabled this way,temperature/top_p/top_kare dropped automatically (the providers reject them). For finer control, set the raw provider parameter viaEXTRA_PARAMS(below), which overrides this.
EXTRA_PARAMS — generic passthrough for anything else¶
The table above is the curated set. For provider-specific knobs it doesn't model
— or new ones that ship after a release — add an EXTRA_PARAMS dict to any
MODEL_ID / VISION_MODEL_ID. Its contents are forwarded verbatim to the
underlying model's request body, with no interpretation by mnemoai, so you
use the provider's own parameter names. This means new parameters need no
code change. Works for every provider; it's the right place for reasoning
controls on Mantle, which the curated columns don't cover.
# OpenAI / GPT-5.x (TYPE: openai, or Mantle API_PROTOCOL: responses)
MODEL_ID:
NAME: openai.gpt-5.5
TYPE: mantle
API_PROTOCOL: responses
EXTRA_PARAMS:
reasoning_effort: high # none | low | medium | high | xhigh
# verbosity: low
# Anthropic / Claude (TYPE: anthropic, or Mantle API_PROTOCOL: anthropic)
MODEL_ID:
NAME: anthropic.claude-opus-4-8
TYPE: mantle
API_PROTOCOL: anthropic
EXTRA_PARAMS:
thinking: { type: enabled, budget_tokens: 10000 }
Notes: reasoning_effort is lifted to a first-class argument on OpenAI-family
clients (so it isn't double-specified); everything else is merged into
model_kwargs. A non-dict EXTRA_PARAMS is ignored rather than crashing. It is
not offered by the /params interactive tuner (it's a free-form dict, not a
scalar) — set it in config.yaml directly.
Provider-appropriate tuning matters. Newer Claude and GPT models reject
TEMPERATUREoutright;STOP, penalties, andTOP_Kare largely Ollama/SageMaker concepts. When/modelswitches a section's provider it drops the keys the new provider doesn't consume for you, but for everything else editconfig.yamlto match what your specific provider/model accepts.
The context window is set separately, at the top level (it's not part of a model
section): MAX_CONVERSATION_TOKENS (see General Parameters below).
General Parameters¶
# Context window size (passed to model as num_ctx for Ollama)
MAX_CONVERSATION_TOKENS: 65536
# Maximum tokens when reading documents (CSV, JSON, text files)
DOC_MAX_TOKENS: 16384
# Profile configuration
PROFILE:
NAME: default # Used for session data isolation (~/.mnemoai/{NAME}/)
USE_PROFILING: true # Enable automatic user profiling
Embeddings Configuration¶
Embeddings settings are nested under the RAG section:
RAG:
EMBEDDINGS:
CACHE_ENABLED: true # LRU cache for embedding vectors (avoids re-embedding same text)
CACHE_SIZE: 1000 # Maximum cached embeddings
FALLBACK_ENABLED: true # Fall back to SHA256 if embedding model unavailable
FALLBACK_TYPE: "sha256" # Fallback type (sha256, random, zeros)
LLM Interaction Configuration¶
LLM:
ENABLE_THINKING: true # Enable thinking tags (verbose mode)
RETRY_ENABLED: true # Retry failed LLM calls
MAX_RETRIES: 3 # Maximum retry attempts
RETRY_DELAY: 1.0 # Seconds between retries
RETRY_BACKOFF: 2.0 # Exponential backoff multiplier
SUMMARIZATION_THINK: false # Include thinking in summarization
TOKEN_COUNTING:
OLLAMA_APPROXIMATION: 1.3 # Chars-to-tokens multiplier for Ollama
FALLBACK_MODEL: "gpt-4" # Tiktoken model for fallback counting
System Prompt¶
The system prompt lives in prompts.yaml (a sibling of config.yaml, in the same config/ directory — all model-facing prompts live there, separate from configuration). Customize the SYSTEM_PROMPT field to change the assistant's personality, instructions, and tool usage patterns. Key sections in the default prompt:
<identity>: Basic identity and core principles<reasoning_discipline>: Thinking rules and loop detection<output_format>: Response formatting requirements<information_sources>: RAG vs web vs internal knowledge decision tree<file_operations>: Read/write/edit workflow rules<search_tools>: Glob and grep usage guidance<git_operations>: Git safety rules<task_management>: Todo, plan mode, and background task rules<error_handling>: Error response guidelines<communication>: Style and security rules
RAG Configuration¶
ENABLE_RAG: true # Master toggle for RAG system
RAG:
MAX_TOKENS: 8192 # Threshold: documents above this are ingested into RAG
CHUNK_TOKENS: 1024 # Chunk size in tokens (recommended: 512-2048)
SEARCH:
SEMANTIC_WEIGHT: 0.5 # Semantic similarity weight (0-1)
KEYWORD_WEIGHT: 0.5 # BM25 keyword weight (0-1)
VECTOR_STORE:
TYPE: chromadb # Vector store backend: "faiss" or "chromadb"
EMBEDDINGS:
CACHE_ENABLED: true
CACHE_SIZE: 1000
FALLBACK_ENABLED: true
FALLBACK_TYPE: "sha256"
Requires: An embedding model configured via RAG.EMBED_MODEL_ID (see Embeddings Model).
Episodic Memory Configuration¶
ENABLE_EPISODIC_MEMORY: true
EPISODIC_MEMORY:
STORE_TYPE: chromadb # or faiss
# Similarity Thresholds
DUPLICATE_THRESHOLD: 0.95 # Higher = stricter duplicate detection
RETRIEVAL_THRESHOLD: 0.7 # Minimum similarity to retrieve episodes
FOLLOW_UP_THRESHOLD: 0.4 # Similarity to detect follow-up questions (skips injection)
REDUNDANCY_THRESHOLD: 0.5 # Filter episodes redundant with conversation
# Hybrid Search Weights
SEMANTIC_WEIGHT: 0.7 # Semantic similarity weight (0-1)
KEYWORD_WEIGHT: 0.3 # Keyword matching weight (0-1)
# Token and Size Limits
MAX_TOKENS_PER_EPISODE: 400 # Max tokens for episode text
MAX_EPISODES: 1000 # Maximum stored episodes
MAX_AGE_DAYS: 90 # Maximum episode age in days
# Success Detection
SUCCESS_MARKERS: # Phrases that indicate task success
- thanks
- perfect
- great
- worked
CORRECTION_MARKERS: # Phrases that indicate errors
- wrong
- error
- fix
- actually
# Storage Behavior
IMMEDIATE_STORAGE: true # Store episodes immediately
MIN_TOOLS_OR_LENGTH: 300 # Min response length if no tools used
# Query Enhancement
ENABLE_QUERY_EXPANSION: true # Expand queries with synonyms
QUERY_EXPANSION_TERMS: 3 # Max terms to add per query
Requires: An embedding model configured via RAG.EMBED_MODEL_ID (see Embeddings Model).
How it works:
- Automatically stores successful task completions with full conversation context
- Uses hybrid search (70% semantic + 30% BM25) to find similar past tasks
- Conversation-aware injection: Only injects episodic memory when relevant
- Detects follow-up questions and skips injection (uses conversation context instead)
- Filters out episodes redundant with current conversation
- Uses semantic similarity (with embeddings) or Jaccard similarity (fallback)
- Injects compact context showing: task → tools used → outcome
- Automatic cleanup: keeps max 1000 episodes, removes entries older than 90 days
Success detection:
- User feedback: "thanks", "perfect", "great"
- No error markers in response
- All tools executed successfully
- Filters out simple greetings and short responses
Embeddings Model¶
All embedding configuration is nested under RAG::
For Bedrock:
For Ollama:
For OpenAI:
For SageMaker:
For LiteLLM (any of its 100+ providers via one OpenAI-style API):
RAG:
EMBED_MODEL_ID:
NAME: openai/text-embedding-3-small # provider-prefixed model id
TYPE: litellm
API_BASE: http://localhost:4000 # optional (proxy / self-hosted)
API_KEY: your-api-key # optional (else the provider's env var)
Vector Store Options:
- ChromaDB (default): Persistent vector database with built-in metadata support
- FAISS: Fast, in-memory vector search with disk persistence
Switch between stores by changing RAG.VECTOR_STORE.TYPE in config. The system uses a controller pattern, so all RAG functionality works identically regardless of the store.