Content intelligence platform

Honcho

A content intelligence platform that extends AI beyond its training data. Ingest your organization’s documents, feeds, and archives into a single searchable index—then get answers grounded in real sources rather than confident-sounding hallucination.

The Problem

AI models have a knowledge cutoff. Anything after that date is simply unknown. For content within their training data, they reconstruct answers from statistical patterns—details can be wrong or conflated. They can’t cite a primary source because they don’t have one. And they have no access to your organization’s internal documents at all.

The result: plausible-sounding but subtly wrong answers—which is worse than “I don’t know.”

Honcho solves this by giving AI the actual source material. Ask a question, Honcho retrieves the real documents, and you get grounded answers with citations back to the originals.

Honcho extends the AI’s effective knowledge cutoff to now, grounded in real sources rather than reconstructed training data.
  • Recency Your feeds and documents are current. The AI’s training data isn’t.
  • Precision Answers from primary sources, not reconstructed from memory.
  • Provenance Every answer cites the actual entry. You can always check the original.

The Short Version

Honcho is a content intelligence platform that indexes large collections of documents—research archives, institutional publications, news corpora, internal knowledge bases—and makes them searchable and accessible to AI assistants. Import tens of thousands of articles from a CMS, bulk-load a document archive, or crawl hundreds of feeds over time. Everything lands in the same full-text index.

What makes it powerful is what happens next. Connect Honcho to Claude or another AI assistant, and the entire corpus becomes conversational. Ask “how have expert views on China’s economy evolved over the past decade?” and the AI synthesizes across thousands of indexed documents—threading together analysis, tracking how positions shifted, surfacing connections that no keyword search would find. Combine results from your library with live web content to compare, verify, and fill gaps.

Content flows in from many directions: bulk importers for CMS migrations and document archives, an extract API with declarative rules for structured data sources, pull replication from WordPress and other CMS platforms, RSS/Atom crawling for ongoing feeds, pluggable connectors for custom sources, and AI assistants that save articles and notes on your behalf.

It supports multiple users and groups, so a team can share a common pool of content and collaborate through their AI assistants. An analyst can summarize a report and share it to the team’s feed. A writer can research across the entire corpus and send findings to colleagues. A shared knowledge base builds organically without anyone copying and pasting links around.

Honcho is also a persistence layer for AI workflows. Agent apps that summarize articles, monitor topics, or produce analysis can write their output back to Honcho, where it becomes searchable alongside everything else. The content you curate and the content your tools produce all live in one place.

What does this look like in practice?

Cross-topic synthesis across years of financial commentary. Briefings assembled from dozens of sources in minutes. Expert opinion tracked over time. Pattern recognition across industries.

See real use cases →

What It Does

Most content platforms are assembled from separate services—a CMS for storage, Elasticsearch for search, a feed reader for ingestion, S3 for assets, a custom API layer to glue it all together. Honcho replaces that stack with a single integrated system deployed on your infrastructure.

  • Aggregate Crawl RSS, Atom, and HTML sources on configurable schedules. A declarative rules engine lets you define per-site extraction logic—CSS selectors, field mappings, timestamp parsing, fallback chains—in a config file instead of code. Built-in deduplication and change detection keep the index clean.
  • Ingest Content doesn’t just come from feeds. Pull replication syncs from WordPress and other CMS platforms with incremental updates. A pluggable connector API supports custom Java connectors dropped in as JARs. AI assistants save articles and notes via MCP. A save-URL endpoint supports bookmarklets and mobile shortcuts. An extract API accepts raw JSON, HTML, or XML and runs it through extraction rules. Bulk importers handle CMS migrations. Everything lands in the same index.
  • Index Full-text search powered by Lucene 9 with a configurable text analysis pipeline. Boolean queries, time-range filters, tag and topic facets, custom numeric fields, and configurable relevance scoring.
  • Store Persistent content storage with content hashing for change detection. Encrypted backups with cloud KMS integration. Protobuf-based metadata model.
  • Serve GraphQL, REST, and MCP (Model Context Protocol) endpoints for search, retrieval, and content distribution. No rendering opinions—bring your own frontend, feed reader, mobile app, or AI assistant.
  • Replicate Push and pull replication between instances and from external CMS platforms. WordPress connector with incremental sync, pluggable connector API for custom sources. Distribute content across organizational boundaries with topic-based routing.
Content flows from crawl to index to API to replication as a single transaction with a single data model.
Concern Typical Stack Honcho
Storage PostgreSQL / DynamoDB Built-in
Search Algolia / Elasticsearch Built-in
Crawling Scrapy / custom crawlers Built-in
Ingestion Custom importers / ETL Built-in (MCP, save URL, extract API, bulk import)
Assets S3 / cloud storage Built-in
API Custom REST / GraphQL Built-in (GraphQL + REST + MCP)
Sync Custom ETL / webhooks Built-in

What Makes It Different

  • Search is built in, not bolted on The Lucene index is the primary read path. Content is indexed at write time with no sync lag. The query API exposes the full power of Lucene: phrase queries, field-scoped search, boolean composition, range-boosted relevance, and custom numeric fields for domain-specific ranking.
  • Structured content fragments Content is modeled as typed fragments—paragraphs, code blocks, headings, pull quotes, recipe steps. Each fragment type is indexed separately, so you can search within specific block types: find all entries where a code block mentions HashMap, or where a heading contains authentication.
  • Declarative content extraction A rules DSL lets you define how to extract entries from any JSON, HTML, or XML source—selectors, extractors, transforms, timestamp parsers—without writing code. Rules compose via config layering: define a generic RSS base, then override only the fields that differ per source. You can develop rules conversationally through an AI assistant—paste in your raw content, iterate on the rules until the extraction is right, then save them.
  • AI-powered search & summarization Ask questions in plain English and the search engine does the right thing. Natural language queries are automatically translated to structured Lucene syntax—“articles about US-Iran nuclear negotiations” becomes a precise boolean query. The AI understands tag, topic, and type fields too—“show me things tagged with bug” just works. Available in the admin UI, and opt-in via the REST API (ai=true) and GraphQL API (enhance: true). Click Summarize to get a concise synthesis of any set of search results. Each user configures their own API key and model preference; usage is tracked per-user with full transparency. Users who know Lucene syntax can still use it directly; the enhancement only activates for natural language.
  • One system, not six Each service in the typical stack is another deployment, another set of credentials, another failure mode, another thing to keep in sync. Honcho’s tight integration eliminates the boundaries where things break.
Most content platforms are assembled from parts that weren’t designed to work together. Honcho was built as one system from the start—search, storage, ingestion, and API share the same data model, the same transaction, and the same deployment. Your data stays on your infrastructure.

Architecture

Built on standard enterprise infrastructure that any Java team can deploy and maintain. No exotic dependencies, no cloud-specific lock-in, no operational surprises.

  • Runtime Java 17 on Jetty 12 with Jakarta Servlet.
  • Search engine Lucene 9 with unified numeric fields, near-real-time search with searcher warm-up, and proper FILTER vs MUST clause handling.
  • API layer GraphQL for flexible queries, REST for JSON/RSS/Sitemap output, and MCP (Model Context Protocol) for AI assistant integration.
  • Data model Protocol Buffers for internal data model and wire format, with a purpose-built JSON encoder for browser and API clients.
  • Instrumentation Dropwizard Metrics on every significant operation—search latency, indexing throughput, crawl rates, storage I/O.
  • Text analysis Pluggable stemmer (KStem, Porter, minimal, none), stop words, protected words, ASCII folding, per-index analyzer selection.
  • Custom fields Define domain-specific indexed fields via configuration with configurable tokenization and case handling.
Every dependency is production-grade open source with a permissive license—Apache 2.0, MIT, or BSD. Deploy anywhere, license freely, and know exactly what you’re running.
Java 17 Jetty 12 Jakarta Servlet Lucene 9 Protocol Buffers GraphQL Java MCP SDK MariaDB Maven Dropwizard Metrics

Built-in MCP Server

Honcho includes a built-in Model Context Protocol server, so AI assistants like Claude can search, retrieve, and create content directly. Tools cover search, content management, collaboration, memory, and personalized digest. Point any MCP client at the /mcp endpoint and the entire index becomes conversational. See practical use cases →

Tool What It Does
search_content Full-text search with host, author, date range, tag, topic, type, and sort filters. Natural language queries are automatically enhanced to Lucene syntax when LLM integration is enabled.
get_entry Retrieve a single entry by UID or most recent
get_entries Retrieve multiple entries by UID in a single call (max 25)
find_similar_entries Find content similar to a given entry
list_hosts List content sources, optionally filtered
term_frequency Term frequency statistics for any indexed field
create_entry Create a new entry with title, content, tags, topics, type, author, and metadata
update_entry Update fields on an existing entry—replace or append tags/topics, merge metadata
delete_entry Soft-delete an entry from the database and search index
tag_entries Add or remove tags from entries matching a search query
list_groups List your collaborative groups with members and shared feed status
discover_feeds Discover RSS/Atom feeds from any URL
add_source Add a content source and enable crawling
add_source_to_group Share an existing source with a group so all members can access it
remove_source_from_group Remove a source from a group
send_message Send a message to a group’s shared feed
share_to_group Share an existing entry to a group feed, preserving original author
get_status Account overview—sources, entries, groups, and favorites
save_memory Save a note that can be recalled in future conversations
recall_memory Search or list saved memories
get_digest Get a personalized feed from favorited authors, sources, hosts, and saved searches
list_favorites List favorites with enabled/disabled status
add_favorite Follow an author, source, host, or search query
remove_favorite Remove a favorite from the personalized digest
  • Read and write AI assistants can search and retrieve content, but also create entries, update metadata, manage tags, and curate the index—all through the same MCP interface.
  • Structured retrieval Content fragments let MCP tools return specific block types—code examples, definitions, key paragraphs—rather than dumping entire documents into the context window.
  • Multi-source aggregation Hundreds of curated sources through one interface—industry news, internal docs, regulatory updates—without the model needing to know where each piece lives.
  • Real-time content Continuous crawling and near-real-time indexing. Content is searchable within seconds of arrival.
  • Group collaboration Multiple users sharing an MCP server can write to shared group feeds. “Summarize this and send it to the research group”—an AI-mediated collaboration channel where each person's assistant contributes to a shared knowledge pool.
  • AI memory Assistants can save and recall notes across conversations. “Remember that the client prefers weekly reports on Mondays”—and it’s there next time you ask.
  • Messaging Lightweight messaging through group feeds. AI assistants can send messages, share summaries, and post updates on behalf of their users.
  • Cross-posting Share entries across groups while preserving original author attribution. Add commentary when sharing—provenance metadata tracks who shared what and from where.
  • Personalized digest Users build a personal digest by favoriting authors, sources, hosts, and search queries. “Give me my morning briefing” returns a curated feed of what matters to them, updated in real time.
  • Feed management AI assistants can discover feeds from any URL and add them as crawled sources—“follow this site” becomes a single conversational command.
  • Multi-user isolation Each authenticated user sees only their assigned sources. OAuth with DB-backed tokens provides secure, persistent access.

Use Cases

  • Research & analysis Analysts search across the organization’s entire corpus, combine internal documents with web sources, and share synthesized findings with their team. The AI grounds every answer in actual source material—no hallucination, no guesswork.
  • Team knowledge base Groups organize teams and content. An analyst summarizes a report and shares it to the research feed. A writer pulls from the archive and sends findings to colleagues. Shared knowledge builds organically through AI-mediated collaboration.
  • Content aggregation and monitoring Crawl and normalize hundreds of sources—industry publications, competitor blogs, wire services, regulatory feeds—into a single searchable interface. Automated digest delivers briefings from the sources and topics that matter to each user.
  • CMS integration & replication Pull content from WordPress and other CMS platforms with incremental sync. Pluggable connector API for custom sources. Bulk importers for migrations. Everything lands in the same searchable index.
  • Headless search API Lucene-quality full-text search as a service. Boolean queries, faceted filtering, configurable relevance, time-range constraints, and custom ranking signals—via GraphQL, REST, and MCP endpoints.
  • Content archiving Long-term content storage with full-text search and retrieval. Encrypted backups with cloud KMS integration for compliance and retention requirements.

Status

Honcho is actively developed and available for licensing, collaboration, or investment.

Designed for on-premise and private cloud deployment—your documents, indexes, and user data stay on your infrastructure. No external dependencies for core functionality; AI features use your organization’s own API keys with the provider of your choice. If you are interested in licensing, collaboration, or investment, please get in touch.