Draft plugin specification for source adapters, mirroring RFC 001's role for storage backends. Formalizes the contract six community ingester PRs (#274, #23, #169, #232, #567, #98, #702) plus #981's metadata-only mode have been reinventing ad-hoc, so adapter authors can build to a stable surface. Key decisions: - Single ingest() method; lazy adapters yield SourceItemMetadata ahead of drawers, eager adapters interleave - Declared-transformation model (§1.4) replaces informal verbatim promise with a verifiable one; byte_preserving adapters declare the empty set, declared_lossy adapters enumerate. Existing miner.py and the convo_miner+normalize pipeline map cleanly - Palace is the incremental cursor via is_current(item, metadata); no sidecar persistence - Routing is adapter-owned; detect_room/detect_hall move into the filesystem adapter - Flat metadata per ChromaDB (RFC 001 §1.4) — entity hints as json_string field, KG triples route to SQLite knowledge graph - Closets stay core-built as a post-step; adapters may emit flat closet_hints. Closes existing gap where convo drawers get no closets - No per-drawer field renames: source_file, filed_at, source_mtime, added_by, normalize_version, entities, ingest_mode all preserved. Spec adds adapter_name, adapter_version, privacy_class §9 enumerates the cleanup PR prerequisites (mempalace/sources/ module, PalaceContext facade, KnowledgeGraph.add_triple gaining backwards-compatible source_drawer_id + adapter_name params). Tracking issue: #989
51 KiB
RFC 002 — Source Adapter Plugin Specification
- Status: Draft
- Tracking issue: #989
- Related: #274, #23, #169, #232, #567, #98, #591, #592, #702, #981, #244, #419, #300, #952, #389, #434
- Sibling spec: RFC 001 — Storage Backend Plugin Specification
- Spec version:
1.0
Summary
A formal contract for MemPalace source adapters so third parties can ship pip install mempalace-source-<name> packages (Cursor, OpenCode, git, Slack, Notion, email, calendar, Whisper transcripts, …) that drop into mempalace mine without patching core. The spec defines the adapter interface, record shape, metadata schema contract, privacy class, entry-point registration, incremental-ingest semantics, closet integration, a declared-transformation model that replaces the informal "verbatim" promise with a verifiable one, conformance tests, and the refactor of the existing file and conversation miners into first-party adapters on the same contract.
RFC 001 formalized the write side (where drawers are stored). This RFC formalizes the read side (where content comes from). Both are required for MemPalace to function as a durable daemon managing heterogeneous palaces across many source types.
Motivation
Six source ingesters are currently in flight, each solving the same problem a different way:
| PR / Issue | Source | Mechanism |
|---|---|---|
| #274 | Cursor | workspaceStorage/*.vscdb SQLite extraction |
| #23 | OpenCode | SQLite session database |
| #169 | Pi agent | JSONL session normalizer |
| #232 | Cursor (JSONL variant) | JSONL normalizer |
| #567, #98 | Git | git log + gh pr view with structured diff summary |
| #591, #592 | Delphi Oracle | Real-time intelligence signals |
| #702 | Cursor + factory.ai | Combined session miners |
Plus three ingesters already grafted into core:
mempalace/miner.py— filesystem project miner, fixed char-window chunking, keyword hall routingmempalace/convo_miner.py— chat transcript miner with exchange-pair chunkingmempalace/normalize.py— format detection for four chat-export shapes (Claude Code JSONL, Codex JSONL, Claude.ai / ChatGPT / Slack JSON)
Plus one open proposal for a different ingest semantic:
- #981 — path-level descriptions: mine metadata-as-content instead of raw bytes for matched paths. This is a legitimate third ingest mode (alongside chunked-content and whole-record) that the current architecture has no home for.
Each contributor has reinvented source discovery, source-item identity, incremental-ingest bookkeeping, metadata shape, and chunking strategy. Format detection for new chat exports lands in normalize.py as one more branch in an if chain. There is no shared abstraction, no conformance suite, and no contract new adapter authors can build against.
This is the same situation RFC 001 addresses for storage backends: a pattern that emerged organically, now needs a specification so the community can contribute cleanly and enterprises can build against a stable surface.
Why this matters beyond developer tooling
The adapter pattern is source-agnostic. What has so far shown up as "Cursor transcripts" and "git commits" generalizes to:
- Knowledge work — Notion, Obsidian, Logseq, Google Docs, iA Writer, Zettlr
- Communications — Slack, Discord, Teams, Signal backups, mbox/eml email, iMessage
- Research — arXiv PDFs, Zotero libraries, bookmarked articles, Kindle highlights, web archives
- Creator workflows — YouTube captions, podcast transcripts (Whisper/Deepgram), Descript projects
- Regulated domains — medical records, legal filings, financial statements (all gated on §6 privacy class)
Enterprises key on their own domain metadata — repo/PR/SHA for engineering, patient/encounter/CPT for healthcare, case/docket/jurisdiction for legal. The schema lives in the adapter; the content lives in the drawer. This is how structured-data use cases are served without violating the byte-preservation commitments adapters make.
Goals
- A source adapter ships as a standalone Python package;
pip install mempalace-source-<name>is sufficient to use it. mempalace mineand the MCP mine tool are source-agnostic — all extraction goes through registered adapters. Noif source_type == 'foo'branches in core.- Content transformations are declared (§1.4): each adapter advertises the set of transformations it applies to source bytes. Byte-preserving adapters declare the empty set. Consumers can programmatically determine what happened to their data.
- Incremental ingest is cheap and correct: re-running mine only touches items whose source-side version changed, using the palace itself as the cursor (no sidecar).
- Each adapter declares a structured metadata schema. Enterprises index and filter on that schema. Core is schema-agnostic beyond the universal fields in §5.1.
- The existing
miner.pyandconvo_miner.pybecome the first two first-party adapters on the new contract. Drawer metadata fields and field names are preserved — the spec adds fields, does not rename them. - A privacy class is declarable at the adapter boundary so sensitive sources (medical, financial, personal comms) are handled with explicit policy rather than implicit trust.
Non-goals
- Defining chunking. Each adapter owns its chunking strategy — tree-sitter for code, exchange-pair for chat, whole-record for a PR. Core does not impose a chunk size.
- Defining live-stream / webhook shapes (the Delphi Oracle pattern of continuous signal ingestion). That is a separate future RFC; v1 is pull-mode.
- Defining LLM-based structured extraction. Adapters MAY use an LLM; the spec does not mandate or standardize this.
- Defining cross-adapter dedup. When the same content appears via two adapters (e.g., a PR body mined via
gitand as a conversation quote mined viaclaude-code), both drawers land. Deduplication policy is a separate concern handled at query time bysearcher.py. - Defining closet construction. Core continues to build closets from adapter-yielded drawers (§1.7); the closet-building algorithm itself is not part of this spec.
1. Source adapter contract
1.1 Required method
All adapters implement BaseSourceAdapter with a single kwargs-only ingest method:
class BaseSourceAdapter(ABC):
@abstractmethod
def ingest(
self,
*,
source: SourceRef,
palace: PalaceContext,
) -> Iterator[IngestResult]:
"""Enumerate and extract content from a source.
Yields a stream of IngestResult values. Lazy adapters yield
`SourceItemMetadata` ahead of the drawers for that item, so core
can report progress and check `is_current` before the adapter
commits to the fetch. Adapters with no lazy-fetch benefit may
interleave `SourceItemMetadata` and `DrawerRecord` items freely.
"""
@abstractmethod
def describe_schema(self) -> AdapterSchema:
"""Declare the structured metadata this adapter attaches.
Returned value is stable for a given adapter version. Enterprises
index on this schema; core uses it to validate adapter output.
"""
The single-method ingest() contract was chosen over a discover / extract split. Most current ingesters have no meaningful laziness benefit (filesystem walking is cheap, transcript normalizing is cheap). Adapters that do (git-mine's gh pr list vs gh pr view; hypothetical Slack/Notion API) express laziness by yielding SourceItemMetadata first and deferring fetch until core confirms staleness via is_current().
1.2 Optional methods (default implementations on the ABC)
def is_current(
self,
*,
item: SourceItemMetadata,
existing_metadata: dict | None,
) -> bool:
"""Return True if the palace already has an up-to-date copy.
Called by core after querying the palace for existing drawers with
matching source_file. The adapter compares its version token against
the stored metadata and returns True to skip extraction.
Default implementation: returns False (always re-extract). Adapters
advertising `supports_incremental` override this.
"""
return False
def source_summary(self, *, source: SourceRef) -> SourceSummary:
"""Describe a source without extracting (e.g., 'git repo mempalace,
847 commits, 132 PRs'). Default: returns empty summary."""
return SourceSummary(description=self.name)
def close(self) -> None:
return None
Core's incremental loop (pseudocode):
for result in adapter.ingest(source=source, palace=ctx):
if isinstance(result, SourceItemMetadata):
existing = ctx.collection.get(where={"source_file": result.source_file}, limit=1)
if adapter.is_current(item=result, existing_metadata=existing):
ctx.skip_current_item() # adapter stops yielding drawers for this item
elif isinstance(result, DrawerRecord):
ctx.upsert_drawer(result)
1.3 Typed records
@dataclass(frozen=True)
class SourceRef:
"""A handle to the source a user wants to ingest.
local_path is for filesystem-rooted sources (project dir, mbox file).
uri is for URL-like references (github.com/org/repo, slack://workspace/channel).
options carries adapter-specific config (non-secret values only; §M2).
"""
local_path: str | None = None
uri: str | None = None
options: dict = field(default_factory=dict)
@dataclass(frozen=True)
class SourceItemMetadata:
"""Lightweight pointer yielded before drawers for lazy-fetch adapters."""
source_file: str # Logical identity — filesystem path, PR URI, etc.
version: str # Source-side version token (mtime, commit SHA, ETag, rev id).
size_hint: int | None = None # Bytes, if known. Used for progress reporting.
route_hint: RouteHint | None = None
@dataclass(frozen=True)
class DrawerRecord:
"""One drawer's worth of content plus metadata."""
content: str # Subject to §1.4 declared transformations.
source_file: str # Foreign key to SourceItemMetadata.source_file.
chunk_index: int = 0 # 0 for single-drawer items; 0..N-1 for chunked items.
metadata: dict = field(default_factory=dict) # Flat: str/int/float/bool only. Must conform to adapter schema.
route_hint: RouteHint | None = None
@dataclass(frozen=True)
class RouteHint:
wing: str | None = None
room: str | None = None
hall: str | None = None
@dataclass(frozen=True)
class SourceSummary:
description: str
item_count: int | None = None
# IngestResult is the union type adapters yield.
IngestResult = SourceItemMetadata | DrawerRecord
# PalaceContext carries collection handles, palace config, and progress hooks
# into the adapter. Full definition in §9 (cleanup prerequisite).
1.4 Declared transformations
Adapters cannot silently alter content. Every adapter declares the set of transformations it applies:
class BaseSourceAdapter(ABC):
declared_transformations: ClassVar[frozenset[str]] = frozenset()
The invariant: no transformation is applied that is not declared in this set. Adapters declaring frozenset() are byte-preserving end-to-end (modulo the read, which may itself involve utf8_replace_invalid — see below).
Reserved transformation names (v1):
| Name | Meaning |
|---|---|
utf8_replace_invalid |
Undecodable bytes replaced with U+FFFD on read (equivalent to open(..., errors="replace")). |
newline_normalize |
CRLF / CR converted to LF. |
whitespace_trim |
Leading / trailing whitespace stripped at a record boundary. |
whitespace_collapse_internal |
Runs of three or more blank lines collapsed to two. |
line_trim |
Each line individually stripped of leading / trailing whitespace. |
line_join_spaces |
Adjacent lines joined with single spaces, newlines discarded. |
blank_line_drop |
Empty lines between non-empty lines dropped. |
strip_tool_chrome |
System tags, hook output, tool UI chrome removed (see normalize.strip_noise). |
tool_result_truncate |
Tool output heads/tails kept; middle replaced with a marker string. |
spellcheck_user |
User turns rewritten by spellcheck. |
synthesized_marker |
Adapter inserts its own strings (e.g., [N lines omitted], [registry] …, Slack provenance footer). |
speaker_role_assignment |
Multi-party speakers alternately assigned user / assistant roles (Slack). |
tool_result_omitted |
Some tool outputs fully omitted from transcript (e.g., Read/Edit/Write results in normalize._format_tool_result). |
Adapters MAY define their own transformation names for behaviors the reserved list does not cover. Third-party names SHOULD be prefixed with the adapter name to avoid collisions (e.g., cursor.composer_ordering).
Capability derivation:
byte_preserving— declared_transformations is empty AND output bytes equal input bytes for any source the adapter can read. Advertised via thebyte_preservingcapability (§2.1). MUST be verified by §7.2 round-trip test.declared_lossy— declared_transformations is non-empty. The adapter's output is reproducible from source by applying only the declared transformations. MUST be verified by §7.3 declared-transformation test.
Existing code mapping (for the cleanup PR):
| Module | Declared transformations |
|---|---|
filesystem (current miner.py) |
utf8_replace_invalid, whitespace_trim |
conversations (current convo_miner.py + normalize.py) |
utf8_replace_invalid, newline_normalize, line_trim, line_join_spaces, blank_line_drop, whitespace_collapse_internal, strip_tool_chrome, tool_result_truncate, tool_result_omitted, spellcheck_user, synthesized_marker, speaker_role_assignment |
The filesystem adapter is nearly byte-preserving today; the conversations adapter is extensively transformed. Both are honest after this spec lands because both are fully declared.
This replaces the MISSION.md promise of "verbatim always" with a stronger one: every adapter publishes what it does to your data, and the conformance suite verifies it hasn't lied. "Verbatim" becomes a capability some adapters hold (byte_preserving), not a global claim about a lossy pipeline.
1.5 Three ingest modes
A single adapter declares one or more of three modes via a class attribute:
class BaseSourceAdapter(ABC):
supported_modes: ClassVar[frozenset[Literal["chunked_content", "whole_record", "metadata_only"]]]
| Mode | Content origin |
|---|---|
chunked_content |
Source bytes, split into chunks the adapter chooses (current filesystem behavior). |
whole_record |
Source bytes, one drawer per source item (e.g., PR → 1 drawer). |
metadata_only |
Synthesized description of a source item (absorbs #981). The description bytes are authored by the user or adapter, not the source. Declared transformations (§1.4) do not apply — content is not derived from source bytes. |
metadata_only resolves #981: description-mode matches a path pattern and produces one drawer whose content is the user-authored description rather than the file contents. Conformance tests (§7.2, §7.3) skip metadata_only records.
An adapter MAY support multiple modes and select per-item; the per-item mode is recorded in metadata["ingest_mode"] (§5.1). This field already exists on conversation drawers (convo_miner.py:346) and is the only existing field whose semantics this spec extends rather than preserves.
1.6 Chunking delegation
Core does not impose chunking. miner.py's 800-character sliding window is the filesystem adapter's default for unknown file types — not a contract. Adapter authors choose what makes sense:
- Code files → tree-sitter function/class boundaries (future enhancement to the filesystem adapter).
- Chat transcripts → exchange pairs (current
convo_miner.pybehavior). - PRs → whole-record (current
git-minebehavior in #567). - PDFs → page or section.
- Voice transcripts → speaker turn.
The sole cross-adapter requirement for chunked_content mode: chunks for a given source_file, re-assembled in chunk_index order and accounting for declared transformations in §1.4, reproduce the adapter's internal representation of the source. The conformance suite verifies this.
1.7 Closet integration
Closets are the AAAK-compressed index layer (palace.build_closet_lines, upsert_closet_lines) that points to drawer content and enables LLM-scale scanning without reading every drawer. Closet-building is not an adapter concern:
- Core builds closets from adapter-yielded drawers as a post-step, via the existing
palace.pyhelpers. Adapters do not call these APIs. - Adapters MAY emit closet hints in drawer metadata via a flat
;-joined string:Core splits onmetadata["closet_hints"] = "decided GraphQL; migrated to Postgres; fixed PR-567";and feeds these as candidate topics alongside the content-scanned ones inbuild_closet_lines. The git adapter can hint decision-signal quotes that raw content-scanning would miss; the conversations adapter can hint section headers; the filesystem adapter has no need and omits the field. - metadata_only drawers get closets too. Core builds them from the synthesized description content the same way it builds closets for any other drawer. This is how #981's path-level descriptions become searchable.
- Closet purging remains keyed on
source_file(purge_file_closetsinpalace.py:221). Adapters' source_file values must be stable so purge is correct on re-ingest.
Current convo_miner.py does not build closets for conversation drawers — an existing gap. The cleanup PR (§9) routes the conversations adapter through the same post-step closet builder as filesystem, closing the gap as a side effect.
2. Adapter contract
2.1 Identity and capabilities
class BaseSourceAdapter(ABC):
name: ClassVar[str] # "filesystem", "cursor", "git", "slack", ...
spec_version: ClassVar[str] = "1.0"
adapter_version: ClassVar[str] # Independent of spec_version; recorded on every drawer.
capabilities: ClassVar[frozenset[str]]
supported_modes: ClassVar[frozenset[str]] # Per §1.5.
declared_transformations: ClassVar[frozenset[str]] # Per §1.4.
default_privacy_class: ClassVar[str] # Per §6.
Defined capability tokens (v1):
| Token | Meaning |
|---|---|
byte_preserving |
declared_transformations is empty AND extracted content equals source bytes. |
supports_incremental |
Implements is_current() meaningfully; ingest() respects ctx.skip_current_item(). |
supports_structured_metadata |
Attaches fields beyond §5.1 universals. |
supports_entity_hints |
Emits entity hints via metadata["entity_hints_json"] (§5.4). |
supports_kg_triples |
Writes knowledge-graph triples directly to the SQLite KG (§5.5). |
supports_closet_hints |
Emits metadata["closet_hints"] (§1.7). |
requires_auth |
Needs credentials at runtime (env vars — §4.2). |
requires_external_service |
Needs a running service (Slack API, email server). |
requires_local_tool |
Needs a local binary (gh, rg, whisper). |
adapter_owns_routing |
Returns authoritative RouteHint values from ingest() that core uses as-is (§G3 / §2.5). |
respects_privacy_class |
Honors §6 privacy-class filtering. |
Capability tokens are free-form strings; third-party adapters MAY declare novel tokens for their ecosystem. Core only inspects the above.
2.2 Source references
See SourceRef in §1.3. The shape is deliberately open — adapters parse uri and options as they see fit. Core does not canonicalize URIs.
Secrets in SourceRef.options: credentials MUST NOT be placed in options. The spec reserves options for non-secret values (paths, filters, date ranges). Secrets come from env vars per §4.2. An adapter that reads a credential from options violates the spec and MUST be rejected by the conformance suite.
2.3 Lifecycle
__init__: lightweight. No I/O, no network, no credential fetch.- First call to
ingest: may open resources. All I/O is lazy. close(): releases all resources. Afterclose(), further calls MUST raiseAdapterClosedError.
2.4 Concurrency
An adapter instance is long-lived and serves many mine operations. Adapters MUST be thread-safe for concurrent ingest calls across different SourceRef values. MemPalace core serializes calls within a single SourceRef unless an adapter advertises supports_parallel_ingest (not in v1 — reserved for v1.1).
2.5 Routing
Routing is the adapter's responsibility. The filesystem adapter reads mempalace.yaml (hall keywords, rooms list) via MempalaceConfig() and returns RouteHint(wing=..., room=..., hall=...) on each drawer. This relocates detect_room() and detect_hall() (currently in miner.py and convo_miner.py) into their respective adapters.
Order of precedence for routing:
- Explicit
--wing/--roomCLI flags → passed throughSourceRef.options→ adapter honors verbatim. - Palace config match (
mempalace.yamlhall keywords, room keywords) → adapter computes. - Adapter-internal fallback (e.g., filesystem adapter falls back to
"general"room).
Adapters advertising adapter_owns_routing return the final answer; core uses it verbatim. Adapters not advertising it return None and core applies a generic fallback router (writing to wing default, room general, hall general). Absent any adapter, this is how mempalace mine behaves today.
2.6 Incremental ingest
is_current() is the incremental-ingest primitive. The palace itself is the cursor — no separate persisted state. Correctness requirements:
- The adapter's
SourceItemMetadata.source_fileMUST be stable across re-ingests of the same logical item. Filesystem adapter uses the absolute path (as today). Git adapter uses a URI shape likegithub.com/org/repo#pr=567orgithub.com/org/repo#commit=abc123. is_current()returns True when the stored metadata matches the adapter's current version token. The default implementation returns False (always re-extract) — adapters advertisingsupports_incrementaloverride.- Deletion tombstones: an adapter MAY yield a
SourceItemMetadata(source_file=..., version="__deleted__")entry — core purges drawers with matchingsource_fileand builds no new drawers for that item. Advertised viasupports_deletion_tombstones. - Adapters without
supports_incrementalignoreis_current()and fully re-extract. Core logs a warning.
2.7 Errors
SourceNotFoundError— theSourceRefdoes not resolve.AuthRequiredError— adapter needs credentials; raises with a message describing which env vars to set.AdapterClosedError— method called afterclose().TransformationViolationError— conformance suite raises this when the content round-trip requires an undeclared transformation.SchemaConformanceError— aDrawerRecord.metadatais missing required fields declared indescribe_schema()or violates declared types.
3. Registration and discovery
3.1 Entry points (primary mechanism)
Third-party adapters ship as installable packages:
# pyproject.toml of mempalace-source-cursor
[project.entry-points."mempalace.sources"]
cursor = "mempalace_source_cursor:CursorAdapter"
MemPalace discovers adapters at process start via importlib.metadata.entry_points(group="mempalace.sources").
3.2 In-tree registry (secondary)
from mempalace.sources.registry import register
register("my-experimental-adapter", MyAdapter)
Entry-point discovery and explicit register() populate the same registry. Explicit registration wins on name conflict.
3.3 Selection (explicit only — no auto-detect)
Unlike storage backends (RFC 001 §3.3), source adapters are never auto-detected. The user selects the adapter explicitly:
mempalace mine --source cursor ~/ # explicit adapter
mempalace mine --source git /path/to/repo # explicit adapter
mempalace mine --source filesystem /path/to/project # explicit adapter
mempalace mine /path/to/project # implicit: filesystem (default)
The default when no --source is given is filesystem, preserving current mempalace mine <path> behavior.
Backwards compatibility with --mode. Current cli.py:517-519 exposes --mode {projects,convos}. This spec maps:
--mode projects→--source filesystem(the new default)--mode convos→--source conversations
--mode stays as a deprecated alias through v4.x with a deprecation warning on use; removed in v5.0.
Auto-detection would be hostile — a directory containing a .git folder, a workspaceStorage/ subdir, and an mbox file is not a signal of user intent.
4. Configuration
4.1 Shape
{
"sources": {
"my-cursor": {
"type": "cursor",
"workspace_storage": "~/Library/Application Support/Cursor/User/workspaceStorage"
},
"my-git": {
"type": "git",
"repos": ["/projects/mempalace", "/projects/site"]
}
},
"palaces": {
"work": {
"sources": ["my-git"],
"privacy_floor": "internal"
},
"personal": {
"sources": ["my-cursor"]
}
}
}
Single-user local mode: config is optional. mempalace mine <path> with no config uses the filesystem adapter and defaults.
4.2 Environment variables
MEMPALACE_SOURCE_<NAME>_*— per-adapter secrets and connection info. Examples:MEMPALACE_SOURCE_SLACK_TOKEN,MEMPALACE_SOURCE_NOTION_API_KEY,MEMPALACE_SOURCE_GIT_GITHUB_TOKEN.- Secrets MUST be readable from env vars; config files carry structure, env vars carry credentials. Same rule as RFC 001 §4.2.
4.3 Adapter-specific options
SourceRef.options is a free-form dict of non-secret values (§2.2). Each adapter documents its accepted keys. Unknown keys MUST be ignored (forward compatibility); the adapter MAY log a warning.
5. Metadata schema contract
5.1 Universal fields
Existing drawer metadata fields are preserved — the spec adds the following:
| New field | Type | Added by | Purpose |
|---|---|---|---|
adapter_name |
str |
core, from BaseSourceAdapter.name |
Which registered source produced this drawer. |
adapter_version |
str |
adapter | Adapter's own version (distinct from palace normalize_version). Enables re-extract workflows targeted at drawers from a known-buggy adapter version. |
privacy_class |
str |
adapter default, config override | Per §6. |
Existing fields retain their current semantics (verified against miner.py:542-561 and convo_miner.py:338-350):
| Existing field | Role in the spec |
|---|---|
source_file |
Functions as the adapter's source-item identifier. Adapter defines the shape — a filesystem path for filesystem, a URI like github.com/org/repo#pr=123 for git. MUST be stable across re-ingests of the same logical item. |
source_mtime |
Functions as the source-item version for filesystem. Adapters without mtime semantics MAY omit this field and use a different version discriminator (e.g., commit SHA in a separate metadata["commit_sha"] field); the spec only requires that is_current() can decide staleness from the stored metadata. |
filed_at |
When the record was written. ISO-8601 string. |
added_by |
Agent name (e.g., lumi, claude-code). Orthogonal to adapter_name — the agent is who triggered mining; the adapter is how data was extracted. |
wing, room, hall |
Palace routing. Populated by adapter per §2.5. |
chunk_index |
Per §1.6. Always 0 for whole_record / metadata_only. |
normalize_version |
Palace-wide schema version (currently palace.py:50). Unchanged. Separate from adapter_version. |
entities |
Semicolon-joined candidate entity names. Already flat; kept flat (§5.4 replacement). |
ingest_mode |
Per §1.5. Already on conversation drawers; added to filesystem drawers by the cleanup PR. |
extract_mode |
Conversation-adapter-specific (exchange vs general). Moves into the conversations adapter's declared schema per §5.2. |
Nothing is renamed. Nothing is removed. The spec formalizes the shape ingesters already converge on. Existing where={"source_file": ...} queries in searcher.py, palace.py, and callers keep working.
Chroma metadata constraint: all metadata values MUST be str | int | float | bool. No lists, no nested dicts. This matches RFC 001 §1.4 and the underlying ChromaDB contract. Structured side-data goes to the SQLite knowledge graph (§5.5) or to a declared flat JSON-encoded string field (§5.4).
5.2 Adapter schemas
Each adapter returns an AdapterSchema from describe_schema():
@dataclass(frozen=True)
class AdapterSchema:
fields: dict[str, FieldSpec] # Keyed by metadata key.
version: str
@dataclass(frozen=True)
class FieldSpec:
type: Literal["string", "int", "float", "bool", "delimiter_joined_string", "json_string"]
required: bool
description: str
indexed: bool = False # Hint to backends that can build indexes (RFC 001 §2.1).
# delimiter_joined_string: the delimiter character (default ";").
delimiter: str = ";"
# json_string: the JSON schema of the encoded object (informational only).
json_schema: dict | None = None
delimiter_joined_string covers the entities shape (current ;-joined list of names). json_string is the escape hatch for adapters needing to pack nested data — the value stored is still a single flat str from Chroma's perspective, but the adapter is allowed to document its parsed shape.
Example for a hypothetical slack adapter:
AdapterSchema(
version="1.0",
fields={
"channel_name": FieldSpec(type="string", required=True, description="Slack channel name", indexed=True),
"channel_id": FieldSpec(type="string", required=True, description="Slack channel ID"),
"thread_ts": FieldSpec(type="string", required=False, description="Thread root timestamp"),
"author_id": FieldSpec(type="string", required=True, description="Slack user ID", indexed=True),
"author_name": FieldSpec(type="string", required=True, description="Display name at extraction time"),
"reactions": FieldSpec(type="delimiter_joined_string", required=False, description="Emoji shortcodes"),
},
)
5.3 Enterprise keying
The adapter schema is the stable surface enterprises filter on. A support team querying the palace for channel_id = "C01234" does not care about ChromaDB's internal representation. The schema field is declared by the adapter, indexed by the backend (RFC 001 §2.1 supports_metadata_filters), and exposed through the existing where= clause.
This is how "structured data" serves company use cases without breaking transformation guarantees: declared-transformation content in the drawer, structured fields in the metadata, schema declared by the adapter, filtering done by the backend.
5.4 Entity hints (optional)
Adapters with supports_entity_hints MAY include:
metadata["entity_hints_json"] = '[{"type":"person","name":"Milla Jovovich","confidence":0.95,"offset":120},{"type":"project","name":"MemPalace","confidence":1.0,"offset":0}]'
The value is a JSON-encoded string (type json_string in the adapter schema). Core parses on read and feeds into mempalace/entity_detector.py as a prior: hints with confidence >= 0.9 bypass the heuristic detector; lower-confidence hints feed into it as candidates.
This is additive to the existing flat entities field — entity_hints carries structure (type, confidence, offset); entities remains the Chroma-indexable flat string. An adapter that produces entity_hints MUST also populate entities as the flat name-only projection, so existing filter queries keep working.
5.5 Knowledge-graph triples (optional)
Adapters with supports_kg_triples write directly to the SQLite knowledge graph via mempalace/knowledge_graph.py — not to drawer metadata. Chroma cannot store structured triples; the KG already exists for this purpose.
The adapter calls the existing KnowledgeGraph.add_triple() (signature verified against mempalace/knowledge_graph.py:130):
palace.kg.add_triple(
subject="Ben",
predicate="committed",
obj="PR-567", # `object` is a Python builtin — the API uses `obj`.
valid_from="2026-03-12",
confidence=1.0,
source_file=drawer.source_file, # Existing provenance parameter.
)
Drawer metadata includes a flat counter — metadata["kg_triples_count"]: int — so search consumers can see at a glance that KG side-data exists for a drawer without hitting SQLite.
The existing API has source_closet and source_file provenance parameters but no source_drawer_id or adapter_name. The cleanup PR (§9) should add these two optional parameters to add_triple() so adapter-written triples can be traced back to (a) the specific drawer that produced them and (b) the adapter that authored them — necessary for re-extraction workflows. Until that lands, adapters use source_file as the provenance key and record adapter authorship via a separate table or a predicate naming convention (e.g., adapter:git:committed).
This aligns with the existing architecture in CLAUDE.md ("Knowledge Graph: ENTITY → PREDICATE → ENTITY with valid_from / valid_to dates") — the RFC formalizes the adapter-side write path.
5.6 Source encoding and newline
Current ingesters handle encoding lossily (errors="replace" in miner.py:595 and normalize.py:124) and do not record original encoding. The spec does not require per-drawer source_encoding / source_newline — most runs are uniform UTF-8 / LF, and storing the same value on every drawer wastes bytes.
Instead: adapters that handle non-UTF-8 or non-LF sources record the values once on the adapter's SourceSummary and per-drawer only when a specific drawer diverges from the adapter default. The utf8_replace_invalid declared transformation (§1.4) already communicates that lossy decoding happened; specific drawer-level provenance is opt-in.
6. Privacy class
6.1 Defined levels
| Level | Meaning | Example sources |
|---|---|---|
public |
Content intended for public consumption. | arXiv papers, public GitHub repos, published blogs. |
internal |
Organizational content, not for public disclosure. | Corporate Slack, internal Notion, private git repos. |
pii_potential |
May contain personally identifiable information. | Email, iMessage, Claude/ChatGPT transcripts. |
sensitive |
Known to contain PII, financial, or health data. | Medical records, financial statements, legal filings. |
secrets_possible |
May contain credentials or secrets. | Git history, environment dumps, CI logs. |
An adapter declares a default on BaseSourceAdapter.default_privacy_class. Users MAY override per-source in config.
6.2 Enforcement
- Each palace declares a
privacy_floor. Drawers above the floor (equal to or laxer) are admitted; drawers below are rejected at write time and surfaced in arejectedlist on the CLI and MCP tool. - Default floor: none — v1 accepts all levels unless the palace explicitly configures a floor. This keeps the single-user local default low-friction (users who run
mempalace mineon a git repo expectsecrets_possibledrawers to land). Enterprise deployments MUST set a floor; docs for regulated-domain setup will recommend starting strict and relaxing as needed. - Search results surface
privacy_classin result metadata. MCP tool wrappers MAY redact results above a caller-declared ceiling. secrets_possibledrawers SHOULD pass through a secrets-scan pre-index hook when one is available. PR #389 (sensitive content scanner) is the expected enforcement mechanism for v1; until it lands,secrets_possibleis a label without automated scanning. The label is still useful — it enables floor-based rejection and alerts downstream consumers.- The privacy class is recorded in drawer metadata and cannot be downgraded without a migration log entry, matching RFC 001's embedder-identity pattern.
Privacy class is how a regulated-domain deployment (medical, legal, financial) can use MemPalace safely. Without it, flexible ingest becomes a liability; with it, ingest is scoped by policy.
7. Testing contract
7.1 The abstract suite
MemPalace ships mempalace.sources.testing.AbstractSourceAdapterContractSuite — a pytest mixin. Every adapter package ships a concrete subclass:
from mempalace.sources.testing import AbstractSourceAdapterContractSuite
class TestCursorAdapter(AbstractSourceAdapterContractSuite):
@pytest.fixture
def adapter(self):
return CursorAdapter()
@pytest.fixture
def fixture_source(self, tmp_path):
"""Build a minimal Cursor workspaceStorage fixture."""
...
return SourceRef(local_path=str(tmp_path))
@pytest.fixture
def canonical_source_bytes(self, fixture_source):
"""Return a mapping of source_file -> authoritative bytes.
For filesystem sources: the file's raw bytes.
For SQLite sources: the extracted value column bytes for each row.
For API sources: the canonical HTTP response body bytes.
Adapter-defined — the adapter knows what its 'source bytes' are.
"""
...
The suite covers:
ingestyields items with stablesource_fileand well-formedversion.is_current()returns True when metadata matches, False when it differs.close()releases resources; subsequent calls raiseAdapterClosedError.- Unicode content and unicode identifiers are preserved end-to-end.
- Large-source handling: 10k+ items ingest without loading all into memory.
- Error paths:
SourceNotFoundError,AuthRequiredErrorraise with correct types. SourceRef.optionsMUST NOT contain secrets — the adapter raises if it detects a value matching a common-secret pattern (GitHub token prefix, Slack token prefix, etc.). Advisory test, not blocking.
7.2 Byte-preserving round-trip (for byte_preserving adapters only)
Required for adapters advertising byte_preserving:
def test_byte_preserving_round_trip(self, adapter, fixture_source, canonical_source_bytes):
"""Concatenated chunks must equal the canonical source bytes.
For each source_file in the fixture:
1. Read canonical_source_bytes[source_file].
2. Collect all DrawerRecords for that source_file from adapter.ingest(...).
Skip metadata_only drawers (§1.5).
3. Sort by chunk_index.
4. Concatenate record.content values.
5. Assert equality with the canonical bytes (UTF-8 decoded).
"""
Failure raises TransformationViolationError.
7.3 Declared-transformation round-trip (for declared_lossy adapters)
Required for adapters with non-empty declared_transformations:
def test_declared_transformation_round_trip(self, adapter, fixture_source, canonical_source_bytes):
"""Adapter output must be reproducible by applying ONLY declared transformations.
1. For each source_file, read canonical_source_bytes.
2. Apply each declared transformation in declared_transformations to the bytes,
in the order declared by the adapter, using the reference implementations
in mempalace.sources.transforms.
3. Compare the result to the concatenated record.content values.
4. If they differ, the adapter has applied a transformation it did not declare.
Raise TransformationViolationError.
"""
For transformations not in the reserved list (§1.4) — adapter-custom names — the adapter MUST provide a reference implementation callable under mempalace.sources.transforms.<adapter_name>_<transform_name>. The conformance suite imports and applies it. Undiscoverable custom transforms fail the test.
7.4 Schema conformance
A generator-based property test validates that every record yielded by ingest across the fixture source has metadata matching describe_schema(). Missing required fields, wrong types, or (in strict mode) undeclared fields fail the test.
7.5 Note on current corpus
No existing test in tests/ asserts byte-preservation or declared-transformation correctness (verified via grep of tests/ for verbatim|byte.?preserv|round.?trip). This RFC's conformance suite introduces the first such coverage. The existing MISSION.md claim of "verbatim always" is a social contract until this lands; afterward it becomes a machine-verified property of adapters that declare byte_preserving.
8. Versioning and compatibility
BaseSourceAdapter.spec_versiondeclares which spec version an adapter implements.- MemPalace refuses to load an adapter declaring a different major spec version.
- Minor spec versions are additive: new optional methods, new capability tokens, new reserved transformation names, new universal metadata fields with sensible defaults.
- Adapters MAY declare their own
adapter_versionindependent of the spec version; this is recorded on every drawer (§5.1) and enables "this drawer was extracted by cursor-adapter 0.3; 0.4 fixed a parsing bug; re-extract affected drawers" workflows. - This is spec v1.0.
9. Cleanup prerequisite (not in this spec, but gating)
The existing in-tree ingesters are not adapter-shaped. Before RFC 002 can be enforced, the following refactor lands in a separate PR:
- Introduce
mempalace/sources/base.pydefiningBaseSourceAdapter, the typed records, and the registry. - Introduce
mempalace/sources/transforms.pywith reference implementations of every reserved transformation in §1.4. Adapters and the conformance suite both consume these. mempalace/miner.py→mempalace/sources/filesystem.pyimplementingBaseSourceAdapter. Current behavior preserved: 800-char chunking becomes the adapter's default;READABLE_EXTENSIONSmoves to the adapter;detect_room()anddetect_hall()move to the adapter per §2.5.declared_transformations = frozenset({"utf8_replace_invalid", "whitespace_trim"}).mempalace/convo_miner.py→mempalace/sources/conversations.py. Exchange-pair chunking stays. The format-detection logic innormalize.pybecomes per-format plugins the conversations adapter composes (one for Claude Code JSONL, one for Codex JSONL, one for ChatGPT mapping trees, one for Claude.ai JSON, one for Slack JSON) — each small and independently testable, eliminating theif source_typechain.declared_transformationsenumerates every transformationnormalize.pyandconvo_miner._chunk_by_exchangeactually perform (see §1.4 "Existing code mapping").- Closet-building wired into the conversations adapter's post-step (currently missing, per §1.7) — side effect of routing through the unified core post-step.
mempalace/cli.pysubcommandmineroutes through themempalace.sourcesregistry.--mode {projects,convos}becomes a deprecated alias for--source {filesystem,conversations}.mempalace/mcp_server.pymempalace_minetool accepts asourceparameter.mempalace/palace.pyexposesPalaceContext— a per-mine-invocation facade that bundles the drawer collection, closet collection, knowledge graph, palace config, and progress hooks. Adapters receive this; they do not importpalace.pydirectly.NORMALIZE_VERSION(currently a module-level constant inpalace.py:50) stays. It is the palace-wide schema version, orthogonal to per-adapteradapter_version.KnowledgeGraph.add_triple()(knowledge_graph.py:130) gains two optional parameters:source_drawer_id: str = Noneandadapter_name: str = None. Existing callers are unaffected; adapters advertisingsupports_kg_triples(§5.5) populate both. Backwards-compatible change.
This cleanup is substantial — comparable to RFC 001 §10's chroma-import removal — and should land before any new third-party adapter PR merges. Each new adapter is easier after the cleanup, not harder.
10. Impact on in-flight PRs
| PR / Issue | Effort to align |
|---|---|
| #274 Cursor SQLite | Becomes mempalace-source-cursor third-party package. Author has a working prototype on Windows; needs describe_schema(), declared_transformations, and the conformance suite. Prior #287 (closed unmerged) is predecessor work. |
| #23 OpenCode SQLite | Becomes mempalace-source-opencode. Same shape as Cursor. |
| #169 Pi agent | Becomes mempalace-source-pi or a format plugin under the conversations adapter (depending on format similarity). |
| #232 Cursor JSONL | Deprecated in favor of #274's SQLite path; or a second mode of mempalace-source-cursor. |
| #567, #98 git-mine | Closest existing work to what the spec envisions. Becomes first-party mempalace/sources/git.py. Exercises whole_record mode, supports_structured_metadata, supports_closet_hints (decision-signal quotes), supports_kg_triples (commit authorship, PR review relationships). |
| #591, #592 Delphi Oracle | Deferred. The live-stream pattern is out of scope for v1 (§Non-goals). A v1.1 addition will specify webhook/stream adapters. |
| #702 Cursor + factory.ai | Splits into two adapter packages. |
| #981 path-level descriptions | Absorbed by §1.5 metadata_only mode + §5.1 ingest_mode. A new first-party descriptions adapter or a second mode on filesystem. |
| #244 Cursor memory-first MCP workflow docs | Points at mempalace-source-cursor once the adapter lands. |
#419, #300, #952 language-extension additions to READABLE_EXTENSIONS |
Becomes per-language config on the filesystem adapter. Contributors can publish domain-specific adapters without touching core. |
| #389 sensitive content scanner | Expected enforcement mechanism for the secrets_possible privacy class (§6.2). Not a blocker for this spec, but a natural consumer. |
| #434 auto-populate KG from drawers | Complementary: post-hoc derivation of KG triples from drawer content. Adapters with supports_kg_triples provide the up-front path; #434 handles everything else. |
11. Open questions
- Cross-adapter dedup. When a PR body is mined via
gitAND shows up as a conversation quote mined viaclaude-code, both drawers land. Is query-time dedup insearcher.pysufficient, or should core maintain a content-hash index across adapters? Declared non-goal in v1 but worth revisiting if user feedback demands it. - Live-stream pattern. Delphi Oracle (#591/592) and potentially Slack/Discord real-time ingestion need a push-mode contract. This is a v1.1 addition (streaming adapter trait + webhook surface), not blocking.
- LLM-assisted structured extraction. Some adapters will want to call an LLM to extract structured fields. The spec does not standardize this — should it? Argument for: conformance test for LLM-driven fields, consistent caching. Argument against: local-first / zero-API is a core promise; LLM dependencies are opt-in per adapter.
- Adapter-vs-format split for conversations. §9 proposes format plugins composed under a single conversations adapter. Alternative: one adapter per format (claude-code, chatgpt, codex, cursor-jsonl, slack). The trade-off is discoverability (one adapter is easier to find) vs. encapsulation (format plugins are simpler to test). Preference leans toward the single-adapter + plugin model; open to counter-argument.
- Default
privacy_floor. v1 defaults to none (§6.2) so single-user local mining is frictionless. An argument exists for defaulting topii_potential— forces regulated-domain users to opt in to sensitive levels rather than opt out. Open to changing the default before v1 ships. canonical_source_bytesfor API-backed adapters. §7.1 defines this as adapter-declared. For API-backed adapters (Slack, Notion), what constitutes "canonical bytes" in a conformance test — the fixture's captured HTTP response? A serialized representation of the parsed object? Leaves to the adapter; may need a follow-up spec for common conventions.adapter_versionbump semantics. When does an adapter bumpadapter_version? On any behavior change? On declared-transformation changes only? Suggests a follow-up doc on adapter SemVer conventions for the community to agree on.
12. Rollout
- Land the cleanup PR (§9): introduce
mempalace/sources/, refactorminer.py→ filesystem adapter,convo_miner.py→ conversations adapter, route CLI and MCP through the sources registry. Behavior preserved end-to-end. Closets get built for conversation drawers as a side effect. - Land this spec as-is. Add
AbstractSourceAdapterContractSuite, entry-point discovery,AdapterSchemavalidation, privacy-class enforcement (floor-gated writes), declared-transformation reference implementations inmempalace/sources/transforms.py. - Land
mempalace/sources/git.pyas the first-party adapter absorbing #567. Exerciseswhole_record,supports_structured_metadata,supports_closet_hints,supports_kg_triplestogether. - Encourage the Cursor (#274), OpenCode (#23), and Pi (#169) authors to publish as third-party packages under
mempalace-source-*. Offer review help against the spec. - Publish adapter-authoring docs at mempalaceofficial.com/guide/authoring-sources.
- Update ROADMAP.md with spec v1.0 adoption under v4.0.0-alpha.