Tools & Registry Reference¶
Built-in tool catalog and the async discovery/registration system that connects tools to the pipeline executor.
Quick reference¶
| Tool | DSL name | LLM required | What it does |
|---|---|---|---|
| ingest_document | ingest_document |
For OCR only | Load and parse a document; OCR any scanned pages |
| select | select |
For NL selection | Filter a document to relevant pages |
| extract_from_texts | extract_from_texts |
Yes | Freeform field extraction from document text |
| extract_from_tables | extract_from_tables |
Yes | Row/column/cell extraction from tabular data |
| extract_fields | extract_fields |
Yes | Schema-bound typed field extraction |
| extract_chart | extract_chart |
— | Chart data extraction (stub) |
| load_schema | load_schema |
No | Resolve a field schema from file, dict, or document |
| llm_job | llm_job |
Yes | General LLM reasoning and generation |
| fetch_data | fetch_data |
No | Fetch SEC EDGAR filings or HTTP URLs |
| search_web | search_web |
No | Web search via DuckDuckGo or SerpAPI |
| compute | compute |
No | Invoke a registered deterministic function |
| store | store |
No | Persist a value to the session blackboard |
| export | export |
No | Write output to JSON, Markdown, CSV, or XLSX |
ingest_document¶
Loads a document and fully populates the text of every page. For digital PDFs the native text layer is extracted. For scanned pages (image coverage above the threshold), a vision LLM is called to OCR each page. After this tool runs, all pages in the returned DocumentHandle are ready for select, extract_from_texts, and extract_fields — those tools never need to consider OCR.
Inputs
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
path |
str, URL, list, or DocumentHandle |
Yes | — | File path, HTTPS URL, list of paths, or the output dict of fetch_data |
model |
str | No | INGEST_OCR_MODEL env var or openai/gpt-4o |
LiteLLM model string used for OCR (ignored for native-text pages) |
Output — DocumentHandle
{
"source": "https://example.com/report.pdf",
"format": "PDF", # PDF | TEXT | IMAGE | XLSX | DOCX | UNKNOWN
"page_count": 42,
"is_scanned": False, # True if any page was OCR'd
"pages": [
{
"number": 1, # 1-based
"text": "Annual Report...", # populated for every page
"is_scanned": False,
"sheet_name": None # set for XLSX sheets
},
# ...
]
}
Environment variables
| Variable | Default | Effect |
|---|---|---|
INGEST_OCR_MODEL |
openai/gpt-4o |
Vision model for OCR |
PYMUPDF_RASTERIZE_COVERAGE_THRESHOLD |
0.25 |
Image-coverage ratio above which a page is rasterized |
PYMUPDF_RASTERIZE_DPI |
150 |
DPI used when rasterizing pages |
EXTRACT_MIN_NATIVE_CHARS |
80 |
Char count below which OCR is preferred over native text |
Minimal example
Pass the output of fetch_data directly — ingest_document understands the SEC EDGAR result structure:
- id: fetch
tool: fetch_data
inputs:
source: sec_edgar
ticker: AAPL
period_end: "2024-09-30"
count: 1
- id: ingest
tool: ingest_document
inputs:
path: "{{fetch.output}}" # passes the full fetch_data result dict
select¶
Filters a DocumentHandle or PageList to the subset of pages that match a selection criterion. Three modes in priority order:
- Explicit page numbers — pass
pages: [3, 5, 12]; no LLM call, fastest - BM25 chunk retrieval —
granularity: chunk; builds an in-memory BM25 index over structural chunks and scores them against a keyword query; no LLM call; best for large documents - NL prompt —
granularity: page(default); the LLM reads a page inventory (page number + first 300 chars) and returns the relevant page numbers - Passthrough — if neither
pagesnorpromptis provided, all pages are returned unchanged
Inputs
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
document |
DocumentHandle, PageList, list, or str |
Yes | — | Ingested document; pages must already have text populated |
prompt |
str | No | None |
Keyword query (chunk mode) or NL instruction (page mode) |
pages |
list[int] | No | None |
Explicit 1-based page numbers to select |
granularity |
str | No | "page" |
"page" for LLM page scan; "chunk" for BM25 keyword retrieval |
top_k |
int | No | 15 |
Number of top-scoring chunks to retrieve (chunk mode only) |
model |
str | No | SELECT_MODEL or openai/gpt-4o |
LiteLLM model string for NL page-mode selection |
Output — PageList (or list[PageList] when input is a list)
{
"parent_source": "https://example.com/report.pdf",
"parent_format": "PDF",
"pages": [
{"number": 4, "text": "Income Statement..."},
{"number": 5, "text": "Notes to Financial Statements..."}
],
"selector_prompt": "consolidated income statement" # or "[explicit pages]" / "[passthrough]"
}
Environment variables
| Variable | Default | Effect |
|---|---|---|
SELECT_MODEL |
openai/gpt-4o |
Model for NL page-mode selection |
Examples
# BM25 keyword retrieval — no LLM, best for large documents
- id: select_section
tool: select
inputs:
document: "{{ingest.output}}"
granularity: chunk
prompt: "Human Capital Resources employees workforce headcount"
top_k: 10
# NL prompt selection (default page mode)
- id: select_income_stmt
tool: select
inputs:
document: "{{ingest.output}}"
prompt: "Select only the consolidated income statement pages"
# Explicit page numbers — no LLM call
- id: select_pages
tool: select
inputs:
document: "{{ingest.output}}"
pages: [3, 4, 5]
See the BM25 Section Extraction tutorial for a full worked example and guidance on choosing top_k and writing effective keyword queries.
extract_from_texts¶
Sends the text content of selected pages to an LLM with a freeform extraction prompt. Returns the extracted fields as a structured JSON dict. Use this when you want flexible, prompt-driven extraction without a rigid schema.
Inputs
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
document |
DocumentHandle, PageList, list, or str |
Yes | — | Document with populated page text |
prompt |
str | Yes | — | What to extract, e.g. "Extract the revenue and net income figures" |
model |
str | No | EXTRACT_MODEL or openai/gpt-4o |
LiteLLM model override |
Output — TextExtractionResult
{
"extracted": { # LLM-parsed JSON dict
"revenue": "391035",
"net_income": "93736",
"reporting_currency": "USD"
},
"source_pages": [4, 5], # 1-based pages processed
"sources": ["https://example.com/report.pdf"],
"prompt": "Extract the revenue and net income figures",
"model": "openai/gpt-4o"
}
Access extracted fields in downstream tasks as {{extract.output.extracted.revenue}}.
Environment variables
| Variable | Default | Effect |
|---|---|---|
EXTRACT_MODEL |
openai/gpt-4o |
Extraction model |
Minimal example
- id: extract
tool: extract_from_texts
inputs:
document: "{{select_pages.output}}"
prompt: "Extract the total revenue, gross profit, and net income for the most recent period"
extract_from_tables¶
Extracts structured table data (headers and rows) from document pages. Can target a specific table with an optional selector hint. Useful for XLSX, HTML tables embedded in filings, and tabular PDFs.
Inputs
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
document |
DocumentHandle, PageList, list, or str |
Yes | — | Document with populated page text |
selector |
str | No | None |
NL hint to target a specific table, e.g. "income statement" |
model |
str | No | EXTRACT_MODEL or openai/gpt-4o |
LiteLLM model override |
Output — TableExtractionResult
{
"tables": [
{
"headers": ["", "FY2024", "FY2023"],
"rows": [
{"": "Total Revenues", "FY2024": "391,035", "FY2023": "383,285"},
{"": "Cost of Sales", "FY2024": "210,352", "FY2023": "214,137"},
# ...
],
"source_page": 4,
"sheet_name": None,
"selector": "income statement"
}
],
"source_pages": [4],
"sources": ["report.pdf"],
"model": "openai/gpt-4o"
}
Minimal example
- id: extract_table
tool: extract_from_tables
inputs:
document: "{{select_pages.output}}"
selector: "consolidated balance sheet"
extract_fields¶
Schema-bound extraction: extracts values for every field declared in a SchemaHandle in a single LLM call. Fields that cannot be located are set to the sentinel string "__not_found__". Use this when you have a predefined list of fields and want typed, validated output.
Inputs
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
document |
DocumentHandle, PageList, list, or str |
Yes | — | Pre-selected document pages |
schema |
SchemaHandle |
Yes | — | Field definitions to extract |
rules |
DocumentHandle |
No | None |
Spreading manual or extraction rules document injected as context |
selector |
str | No | None |
NL hint to scope extraction to a sub-region of the document |
period_end |
str | No | None |
ISO date (YYYY-MM-DD); instructs the model to extract only this period's values |
section_filter |
str | No | None |
Extract only fields with this section value (face, segments, footnotes, etc.) |
Output — dict[str, Any]
{
"revenue": 391035,
"gross_profit": 180683,
"operating_income": 123216,
"net_income": 93736,
"eps_diluted": "__not_found__" # field not located in document
}
Environment variables
| Variable | Default | Effect |
|---|---|---|
EXTRACT_FIELDS_MODEL |
openai/gpt-4o |
Extraction model |
Minimal example
- id: extract
tool: extract_fields
inputs:
document: "{{select_pages.output}}"
schema: "{{load_schema.output}}"
period_end: "{{params.period_end}}"
section_filter: face
With a spreading manual for higher accuracy
- id: extract
tool: extract_fields
inputs:
document: "{{select_pages.output}}"
schema: "{{load_schema.output}}"
rules: "{{ingest_manual.output}}"
period_end: "{{params.period_end}}"
extract_chart¶
Extract numerical data from charts in a document. Currently a stub — returns an empty charts list. Intended for future multimodal chart parsing.
Inputs
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
document |
DocumentHandle, PageList, or str |
Yes | — | Document containing charts |
classification |
PageClassification or list |
No | None |
Page classification to guide the backend |
Output
load_schema¶
Resolves a field schema from multiple source types and returns a SchemaHandle. Use this as the first step in any extraction pipeline that uses extract_fields.
Source resolution order
- If
sourceis already aSchemaHandle→ returns it unchanged (pass-through) - If a
SchemaRegistryis configured andsourceis a registered name → looked up - If
sourceis a.json/.yaml/.ymlfile path → loaded and parsed - If
sourceis adict→ interpreted as{field_name: type_hint}or{"fields": [...]} - If
sourceis alist→ treated as a list of field name strings or field definition dicts - If
sourceis aDocumentHandle→ field names derived from Markdown table rows or XLSX column headers
Inputs
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
source |
str, dict, list, DocumentHandle, or SchemaHandle |
Yes | — | Schema source (see resolution order above) |
hint |
str | No | None |
NL hint guiding field derivation when source is a document |
Output — SchemaHandle
{
"fields": [
{
"name": "revenue",
"type_hint": "number",
"section": "face",
"required": True,
"description": "Total consolidated revenue",
"computed": False,
"formula": None,
"sign_convention": None,
"manual_ref": "§2.1"
},
# ...
],
"source": "income_statement.json",
"task_id": "load_schema"
}
JSON schema file format
{
"fields": [
{
"name": "revenue",
"type_hint": "number",
"section": "face",
"description": "Total revenues",
"required": true,
"sign_convention": null,
"formula": null,
"manual_ref": "§2.1"
}
]
}
Minimal example
llm_job¶
General-purpose LLM invocation backed by LiteLLM. Accepts any number of extra keyword arguments alongside prompt — each extra kwarg is serialised as a labelled context block that is prepended to the prompt before the model sees it.
Inputs
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
prompt |
str | Yes | — | Instruction text for the LLM |
model |
str | No | TRELLIS_LLM_MODEL or openai/gpt-4o-mini |
LiteLLM model string |
temperature |
float | No | 0.7 |
Sampling temperature |
max_tokens |
int | No | 2000 |
Maximum response tokens |
**kwargs |
any | No | — | Any additional key-value pairs become labelled context blocks in the prompt |
Output — str (the LLM response text)
Context injection
Any input key that is not prompt, model, temperature, or max_tokens is injected into the prompt as a labelled section:
- id: review
tool: llm_job
inputs:
extracted: "{{extract.output}}" # injected as "--- extracted ---\n{...}"
schema: "{{schema.output}}" # injected as "--- schema ---\n{...}"
prompt: "Review the extracted values and fix any __not_found__ entries."
The model receives:
--- extracted ---
{"revenue": "391035", "eps_diluted": "__not_found__"}
--- schema ---
{"source": "income_statement.json", "fields": ["revenue", "eps_diluted"]}
--- prompt ---
Review the extracted values and fix any __not_found__ entries.
Environment variables
| Variable | Default | Effect |
|---|---|---|
TRELLIS_LLM_MODEL |
openai/gpt-4o-mini |
Default model for all llm_job tasks |
fetch_data¶
Fetches data from SEC EDGAR or a generic HTTP URL. In SEC mode, resolves the company or ticker to a CIK, queries the EDGAR submissions API, and returns a list of matching filings with URLs. In URL mode, performs a plain HTTP GET and returns the parsed response.
SEC EDGAR mode (source: sec_edgar)¶
Inputs
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
source |
str | Yes | — | sec_edgar, sec, or edgar |
companies |
str or list[str] | Yes* | — | Company names or tickers; ticker is an alias |
ticker |
str or list[str] | Yes* | — | Alias for companies |
period_end |
str | No | None |
ISO date YYYY-MM-DD; year extracted for filing filter |
period_type |
str | No | None |
annual → 10-K only; quarterly/ytd_current/ytd_prior → 10-Q only |
forms |
list[str] | No | None |
Explicit form filter, e.g. ["10-K", "10-Q"] |
year |
int | No | None |
Filter by filing year (overrides year derived from period_end) |
count |
int | No | 20 |
Max filings per company |
*One of companies or ticker is required.
Output
{
"status": "success",
"source": "sec_edgar",
"results": [
{
"company_input": "AAPL",
"company_name": "Apple Inc.",
"ticker": "AAPL",
"cik": "0000320193",
"filings": [
{
"form": "10-K",
"filing_date": "2024-11-01",
"accession_no": "0000320193-24-000123",
"url": "https://www.sec.gov/Archives/edgar/data/320193/.../aapl20240928.htm",
"primary_document": "aapl20240928.htm"
}
]
}
]
}
Environment variables
| Variable | Default | Effect |
|---|---|---|
SEC_USER_AGENT |
Trellis/0.1 (contact@example.com) |
User-Agent header sent to SEC EDGAR (required by SEC policy) |
SEC_THROTTLE_SECONDS |
0.2 |
Delay between EDGAR requests |
HTTP URL mode (source: url)¶
Additional inputs
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
url |
str | Yes | — | HTTP URL to fetch |
method |
str | No | GET |
HTTP method |
headers |
dict | No | None |
Custom HTTP headers |
Output
{
"status": "success",
"source": "url",
"url": "https://api.example.com/data.json",
"content_type": "application/json",
"data": {"key": "value"} # parsed JSON, text, or raw bytes
}
Examples
# Annual 10-K for Apple, fiscal year ending 2024-09-30
- id: fetch
tool: fetch_data
inputs:
source: sec_edgar
ticker: AAPL
period_end: "2024-09-30"
period_type: annual
count: 1
# Multiple companies
- id: fetch_multi
tool: fetch_data
inputs:
source: sec_edgar
companies: ["Apple Inc.", "Microsoft Corporation"]
forms: ["10-K"]
year: 2024
# HTTP URL
- id: fetch_json
tool: fetch_data
inputs:
source: url
url: "https://api.example.com/metrics.json"
search_web¶
Performs web search and returns titles, snippets, and URLs. Uses DuckDuckGo by default (no API key required). Falls back automatically to DuckDuckGo if SerpAPI is configured but fails.
Inputs
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
query |
str or list[str] | Yes | — | Search query or list of queries |
top_n |
int | No | 5 |
Max results per query |
provider |
str | No | duckduckgo |
duckduckgo or serpapi |
timeout |
int | No | 15 |
HTTP timeout in seconds |
Output
{
"status": "success",
"results": [
{
"title": "Apple Reports First Quarter Results",
"snippet": "Apple today announced financial results for its fiscal 2025 first quarter...",
"url": "https://www.apple.com/newsroom/...",
"source_query": "Apple earnings Q1 2025"
},
# ...
]
}
When query is a list, all results are merged into a single flat list with the originating query preserved in source_query.
Environment variables
| Variable | Default | Effect |
|---|---|---|
TRELLIS_SEARCH_PROVIDER |
duckduckgo |
Default provider |
TRELLIS_SEARCH_TOP_N |
5 |
Default result count |
TRELLIS_SEARCH_TIMEOUT |
15 |
Default HTTP timeout |
SERPAPI_API_KEY |
— | Required for SerpAPI provider |
Minimal example
compute¶
Invokes a named deterministic function from the FunctionRegistry. This is the DSL surface for all codeable, side-effect-free computations: date arithmetic, currency normalization, fiscal period resolution, financial ratios, etc. The function body lives in the operator's registry — the pipeline only names it.
Inputs
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
function |
str | Yes | — | Registered function name |
**kwargs |
any | No | — | Forwarded to the function as keyword arguments |
Output — whatever the registered function returns (depends on the function)
The built-in FunctionRegistry is populated by build_finance_registry(). See the Extensibility guide for how to add functions.
Minimal example
- id: resolve_periods
tool: compute
inputs:
function: fiscal_period_logic
as_of_date: "{{params.period_end}}"
company: "{{params.company}}"
store¶
Persists a value to the session blackboard under a named key. The actual persistence happens in the DAG executor, which writes to the Blackboard after store returns — the value is then visible to all subsequent tasks in the same pipeline via {{session.key}}.
In a multi-pipeline Plan, values written by store in one sub-pipeline are available as {{session.key}} in any later sub-pipeline that declares the key in its reads list.
Inputs
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
key |
str | Yes | — | Blackboard key to write under |
value |
any | Yes | — | Value to persist; can be any Python object |
append |
bool | No | False |
Append to an existing list under this key instead of replacing |
Output
{
"status": "success",
"key": "extraction_result",
"append": False,
"value": { ... } # echoes back the stored value
}
Minimal example
Keys can be parameterized:
- id: persist
tool: store
inputs:
key: "{{params.ticker}}_{{params.period_end}}_result"
value: "{{extract.output}}"
export¶
Writes any pipeline value to disk in a choice of formats. Output goes to TRELLIS_OUTPUT_DIR (default: outputs/) unless overridden. JSON strings from llm_job are parsed automatically before writing.
Inputs
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
data |
any | No* | — | Content to export (takes precedence over content if both supplied) |
content |
any | No* | — | Alias for data |
format |
str | No | markdown |
json, markdown, csv, xlsx |
filename |
str | No | artifact |
Base filename without extension |
output_dir |
str | No | outputs |
Directory to write into (created if absent) |
schema |
SchemaHandle |
No | None |
When provided, validates conformance before writing |
strict |
bool | No | False |
Raise on extra fields when schema is supplied |
metadata |
dict | No | None |
Document header fields for Markdown export (company, ticker, currency, units, audited, source_filing, filed) |
periods |
list | No | None |
Period label dicts ({"label": "FY2024"}) used as Markdown column headers |
analyst_notes |
list or str | No | None |
Notes appended as an Analyst Notes table in Markdown output |
*At least one of data or content should be supplied.
Output
{
"status": "success",
"format": "json",
"filename": "AAPL_extraction",
"path": "/absolute/path/to/outputs/AAPL_extraction.json",
"size": 512,
"schema_source": "income_statement.json" # only when schema= was supplied
}
Environment variables
| Variable | Default | Effect |
|---|---|---|
TRELLIS_OUTPUT_DIR |
outputs |
Default output directory |
Minimal example
- id: write
tool: export
inputs:
data: "{{extract.output}}"
format: json
filename: "{{params.ticker}}_result"
Markdown export with metadata
- id: write_md
tool: export
inputs:
data: "{{extract.output}}"
format: markdown
filename: income_statement
metadata:
company: "Apple Inc."
ticker: AAPL
currency: USD
units: millions
audited: true
periods:
- label: "FY2024"
- label: "FY2023"
Registry mechanics¶
How tools are discovered¶
AsyncToolRegistry.discover_impls() scans every module under trellis.tools.impls, finds all BaseTool subclasses, instantiates them with default arguments, and registers them. The build_default_registry() factory calls this and then re-registers ComputeTool with the built-in FunctionRegistry:
from trellis.tools.registry import build_default_registry
registry = build_default_registry()
print(registry.registered_tools())
# ['compute', 'export', 'extract_chart', 'extract_fields', 'extract_from_tables',
# 'extract_from_texts', 'fetch_data', 'ingest_document', 'llm_job', 'load_schema',
# 'mock', 'search_web', 'select', 'store', ...]
The Orchestrator calls build_default_registry() on construction, so all built-in tools are available automatically.
How invocation works¶
When the executor runs a task, it resolves the task's template inputs and calls registry.invoke(task.tool, resolved_inputs). Internally:
- If the tool's
execute()method is a coroutine function, it isawait-ed directly. - If it is a synchronous function, it is run in a worker thread via
asyncio.to_thread()so it does not block the event loop.
This means all tools — sync or async — are safe to run concurrently inside a wave.
Registering a custom tool¶
Subclass BaseTool and pass an instance to registry.register_tool():
from trellis.tools.base import BaseTool, ToolInput, ToolOutput
from trellis.tools.registry import build_default_registry
from trellis.execution.orchestrator import Orchestrator
from typing import Any, Dict
class MyTool(BaseTool):
def __init__(self):
super().__init__("my_tool", "Does something custom")
def execute(self, text: str, *, multiplier: int = 1, **kwargs) -> Dict[str, Any]:
return {"result": text * multiplier}
def get_inputs(self) -> Dict[str, ToolInput]:
return {
"text": ToolInput("text", "Input text", required=True),
"multiplier": ToolInput("multiplier", "Repeat count", required=False, default=1),
}
def get_output(self) -> ToolOutput:
return ToolOutput("result", "Repeated text", type_="string")
# Wire it into the orchestrator
registry = build_default_registry()
registry.register_tool(MyTool())
orch = Orchestrator(registry=registry)
The tool name ("my_tool") is what you use in the DSL tool: field. Note that KNOWN_TOOLS in trellis/models/pipeline.py is the set of names that pass validation — add your tool name there, or disable validation for your custom pipelines.
Registering a plain callable¶
For lightweight tools that don't need BaseTool metadata, register any callable directly:
async def my_async_tool(query: str, **kwargs) -> dict:
return {"answer": f"result for {query}"}
registry.register_callable("my_async_tool", my_async_tool)
Sync and async callables are both supported — the registry detects iscoroutinefunction at invoke time.