BM25 Section Extraction¶
Extract structured fields from a specific section of a large document using keyword-based BM25 retrieval — without reading every page with an LLM.
Tools used: ingest_document → select (BM25 mode) → extract_from_texts → export
Requires: LLM API key (for extract_from_texts only)
Time: ~10 minutes
The problem with page-level selection¶
The default select mode (NL prompt, granularity: page) works by asking an LLM to scan a short inventory of every page — page number plus the first 300 characters of text. For most short documents this is fine. For long filings it breaks down:
- A 200-page HTML 10-K has cover pages, risk factors, and table-of-contents entries that dominate the early snippet for many pages.
- The section you want may start mid-page, past the snippet window.
- The LLM may confuse section headings that appear in passing references.
granularity: chunk solves this by building a BM25 index over the document's structural chunks — every heading, prose paragraph, and table — and scoring them against your query terms. Pages with highly-scoring chunks are returned. No LLM page scan, no snippet truncation, no context-window pressure.
How BM25 chunk selection works¶
When granularity: chunk is set on a select task:
- Chunking — the document is split into structural chunks per page: headings, prose blocks (≤512 tokens each), table rows, and footnotes.
- BM25 indexing — all chunks from the document are indexed in-memory using rank-bm25.
- Retrieval — the
promptfield is treated as a keyword query. The top-top_kchunks are ranked by BM25 score. - Page mapping — the page numbers of the matched chunks are collected, deduplicated, and returned as a
PageList. - Downstream compatibility — the output is a plain
PageList, identical to what the default mode returns.extract_from_textsandextract_fieldswork unchanged.
The BM25 index is built on-the-fly from the ingested document — no pre-indexing or infrastructure needed.
The pipeline¶
pipeline:
id: bm25_field_extraction
tasks:
- id: ingest
tool: ingest_document
inputs:
path: "https://www.sec.gov/Archives/edgar/data/1326801/000162828026003942/meta-20251231.htm"
- id: select_section
tool: select
inputs:
document: "{{ingest.output}}"
granularity: chunk
prompt: "Human Capital Resources employees workforce headcount office cities attrition diversity"
top_k: 10
- id: extract
tool: extract_from_texts
inputs:
document: "{{select_section.output}}"
prompt: |
Extract every workforce and human capital fact stated in the document.
Return a JSON object with the following fields (null if not stated):
- total_employees: integer
- as_of_date: string (ISO date, e.g. "2025-12-31")
- office_cities: integer
- full_time_employees: integer or null
- part_time_employees: integer or null
- female_employees_pct: number or null
- underrepresented_minorities_pct: number or null
- voluntary_attrition_pct: number or null
- source_quote: string — exact sentence(s) stating employee count and office cities
- id: export_result
tool: export
inputs:
data: "{{extract.output.extracted}}"
format: json
filename: "meta_2025_workforce"
Execution waves¶
| Wave | Tasks | Why concurrent |
|---|---|---|
| 1 | ingest |
Network fetch + parse; no dependencies |
| 2 | select_section |
Needs ingest output; builds BM25 in-memory |
| 3 | extract |
Needs select_section output; LLM call |
| 4 | export_result |
Needs extract output |
Run it — CLI¶
Run it — Python SDK¶
import asyncio
from trellis.models.pipeline import Pipeline
from trellis.execution.orchestrator import Orchestrator
async def main():
pipeline = Pipeline.from_yaml_file("examples/pipelines/bm25_field_extraction.yaml")
orch = Orchestrator()
result = await orch.run_pipeline(pipeline)
# The BM25 selection narrows the 200-page filing to a handful of pages.
# The LLM only ever sees those pages.
selection = result.outputs["select_section"]
print(f"Pages selected: {[p['number'] for p in selection['pages']]}")
extracted = result.outputs["extract"]["extracted"]
print("\nExtracted workforce facts:")
for field, value in extracted.items():
print(f" {field}: {value}")
export = result.outputs["export_result"]
print(f"\nJSON written to: {export['path']} ({export['size']} bytes)")
asyncio.run(main())
Build a BM25 index directly (no pipeline)¶
If you want to use the select tool programmatically — for example to inspect scores or reuse the index across multiple queries — construct it directly:
from trellis.tools.impls.document import IngestDocumentTool
from trellis.tools.impls.select import SelectTool
# Ingest
ingest_tool = IngestDocumentTool()
handle = ingest_tool.execute(
path="examples/data/MET_Annual_Report_FY_2025.pdf"
)
# BM25 selection — builds the index on first call
select_tool = SelectTool()
pages = select_tool.execute(
document=handle,
granularity="chunk",
prompt="investment returns endowment performance asset allocation",
top_k=8,
)
print(f"Selected {len(pages.pages)} pages: {[p.number for p in pages.pages]}")
# → Selected 3 pages: [14, 15, 16]
# Pass directly to extract_from_texts
from trellis.tools.impls.extract import ExtractFromTextsTool
extract_tool = ExtractFromTextsTool()
result = extract_tool.execute(
document=pages,
prompt="Extract total endowment value, annual return percentage, and asset allocation breakdown.",
)
print(result.extracted)
Expected output¶
# result.outputs["select_section"] — PageList
{
"parent_source": "https://www.sec.gov/Archives/edgar/data/.../meta-20251231.htm",
"parent_format": "HTML",
"pages": [
{"number": 28, "text": "Human Capital Resources\nWe had a global workforce..."},
{"number": 29, "text": "...continued..."},
],
"selector_prompt": "[chunk-fallback-passthrough]"
}
# result.outputs["extract"]["extracted"]
{
"total_employees": 78865,
"as_of_date": "2025-12-31",
"office_cities": 100,
"full_time_employees": null,
"part_time_employees": null,
"female_employees_pct": null,
"underrepresented_minorities_pct": null,
"voluntary_attrition_pct": null,
"source_quote": "We had a global workforce of 78,865 employees as of December 31, 2025, ..."
}
# result.outputs["export_result"]
{
"status": "success",
"format": "json",
"filename": "meta_2025_workforce",
"path": "/absolute/path/to/outputs/meta_2025_workforce.json",
"size": 512
}
Choosing top_k¶
top_k controls how many chunks are retrieved. More chunks means more pages in the PageList, which gives the LLM more context but increases token cost.
| Document type | Recommended top_k |
|---|---|
| Single focused section (2–4 pages) | 5–8 |
| Section with sub-sections or tables | 8–15 |
| Multiple related sections | 15–25 |
If a section spans many pages or the keywords appear in both headings and body text, a higher top_k ensures surrounding context pages are included.
Choosing the query prompt¶
The prompt in chunk mode is a keyword query, not an instruction to the LLM. Write it as a dense list of terms that appear in the target section:
# Good — terms that appear in the section text
prompt: "Human Capital Resources employees workforce headcount office cities attrition diversity"
# Less effective — natural language instruction (works but wastes query budget on stopwords)
prompt: "Select only the pages that contain information about employees and workforce"
BM25 tokenizes and scores on term frequency, so including synonyms and related domain terms (e.g. headcount, workforce, employees) boosts recall.
When to use each selection mode¶
| Mode | Inputs | Best for |
|---|---|---|
granularity: page (default) |
NL prompt |
Short documents; sections with distinctive first-paragraph text |
granularity: chunk |
Keyword prompt, top_k |
Long documents; sections that start mid-page; when LLM page scan is unreliable |
pages: [3, 5, 12] |
Explicit page list | Known page numbers; deterministic pipelines |
Next steps¶
- SEC Filing Field Extraction — schema-bound extraction with
extract_fieldsand typed validation - PDF Ingest, Page Selection, and Extraction — LLM-based page selection for shorter documents
- Exporting Results — write extracted data to Markdown, CSV, or XLSX