refactor(sources): RFC 002 §9 scaffolding — BaseSourceAdapter, registry, PalaceContext

Lands the read-side contract so third-party adapter authors (@Perseusxrltd,
@JakobSachs, @adv3nt3, @zendesk-thittesdorf, @mfhens, @roip, @MrDys) have a
stable target matching what RFC 001 §10 landed on the write side in #995.

Scope (this PR):

- mempalace/sources/base.py: BaseSourceAdapter ABC with kwargs-only
  ingest() / describe_schema() and default is_current() / source_summary()
  / close() (§1.1–1.2). Typed records: SourceRef, SourceItemMetadata,
  DrawerRecord, RouteHint, SourceSummary, AdapterSchema, FieldSpec (§1.3,
  §5.2). Error classes: SourceNotFoundError, AuthRequiredError,
  AdapterClosedError, TransformationViolationError, SchemaConformanceError
  (§2.7). Class-level identity contract: name / adapter_version /
  capabilities / supported_modes / declared_transformations /
  default_privacy_class (§2.1, §1.4, §1.5, §6).

- mempalace/sources/transforms.py: reference implementations of the 13
  reserved transformations (§1.4) — utf8_replace_invalid, newline_normalize,
  whitespace_trim, whitespace_collapse_internal, line_trim, line_join_spaces,
  blank_line_drop — as pure functions, plus identity shims for the six
  adapter-specific ones (strip_tool_chrome, tool_result_truncate,
  tool_result_omitted, spellcheck_user, synthesized_marker,
  speaker_role_assignment) that the conversations adapter will override
  when migrated. get_transformation(name) resolves by reserved name.

- mempalace/sources/registry.py: entry-point discovery via
  importlib.metadata.entry_points(group="mempalace.sources") + explicit
  register()/unregister() surface (§3.1–3.2). resolve_adapter_for_source()
  implements the §3.3 priority order; crucially, no auto-detection on the
  read side (§3.3 is explicit about that — user intent never inferred from
  on-disk artifacts).

- mempalace/sources/context.py: PalaceContext facade (§9) bundling the
  drawer/closet collections, knowledge graph, palace path, adapter identity,
  and progress hooks core passes into adapter.ingest(). upsert_drawer()
  applies the spec-mandated adapter_name/adapter_version stamps from §5.1.
  skip_current_item() signals laziness; emit() dispatches to hooks and
  swallows hook exceptions.

- mempalace/knowledge_graph.py: add_triple() gains optional source_drawer_id
  and adapter_name kwargs (§5.5). Backwards-compatible column migration
  auto-adds the new columns on open of a pre-RFC 002 palace (PRAGMA
  table_info then ALTER TABLE ADD COLUMN), matching the pattern used for
  any new palace-side provenance fields.

- pyproject.toml: mempalace.sources entry-point group declared. Empty on
  the first-party side for now — miners migrate in a follow-up; the group
  being present means third-party packages can begin registering today.

Out of scope (explicit follow-ups):

- miner.py → mempalace/sources/filesystem.py. Behavior-preserving rename
  that also moves READABLE_EXTENSIONS, detect_room(), detect_hall() into
  the adapter (§9). Larger refactor; lands separately.
- convo_miner.py + normalize.py → mempalace/sources/conversations.py. The
  format-detection if-chain in normalize.py becomes per-format plugins;
  declared_transformations enumerates what the current pipeline already
  does to source bytes (§1.4 existing-code mapping).
- Closet post-step wired into the conversations adapter (§1.7).
- CLI --source flag + --mode deprecation alias (§3.3).
- MCP mempalace_mine tool source parameter.
- AbstractSourceAdapterContractSuite (§7.1–7.3): byte-preservation round-
  trip and declared-transformation round-trip tests.
- Privacy-class floor enforcement (§6.2); depends on #389 for
  secrets_possible scanning.

Tests: 1018 passed (up from ~990 on develop), +27 targeted tests covering
the ABC instantiation rules, typed records, all reserved transformations,
the registry register/get/unregister surface, PalaceContext upsert + skip +
emit semantics, and both the new KG provenance kwargs and backwards-
compatible legacy-schema migration.

Refs: #989 (RFC 002 tracking), #990 (RFC 002 spec), #995 (RFC 001 §10
cleanup — sibling PR on the write side).
This commit is contained in:
Igor Lins e Silva
2026-04-18 16:05:32 -03:00
parent 2b9f17c401
commit 552e9927b7
8 changed files with 1263 additions and 2 deletions
+30 -2
View File
@@ -93,8 +93,24 @@ class KnowledgeGraph:
CREATE INDEX IF NOT EXISTS idx_triples_predicate ON triples(predicate); CREATE INDEX IF NOT EXISTS idx_triples_predicate ON triples(predicate);
CREATE INDEX IF NOT EXISTS idx_triples_valid ON triples(valid_from, valid_to); CREATE INDEX IF NOT EXISTS idx_triples_valid ON triples(valid_from, valid_to);
""") """)
self._migrate_schema(conn)
conn.commit() conn.commit()
def _migrate_schema(self, conn):
"""Backwards-compatible schema migration for older triples tables.
RFC 002 §5.5 adds two optional provenance columns so adapter-written
triples can be traced back to (a) the specific drawer that produced
them and (b) the adapter that authored them. Older palaces predate
these columns; SQLite has no ``ADD COLUMN IF NOT EXISTS``, so we
introspect the schema first and only issue the ALTER if needed.
"""
existing = {row["name"] for row in conn.execute("PRAGMA table_info(triples)")}
if "source_drawer_id" not in existing:
conn.execute("ALTER TABLE triples ADD COLUMN source_drawer_id TEXT")
if "adapter_name" not in existing:
conn.execute("ALTER TABLE triples ADD COLUMN adapter_name TEXT")
def _conn(self): def _conn(self):
if self._connection is None: if self._connection is None:
self._connection = sqlite3.connect(self.db_path, timeout=10, check_same_thread=False) self._connection = sqlite3.connect(self.db_path, timeout=10, check_same_thread=False)
@@ -137,10 +153,16 @@ class KnowledgeGraph:
confidence: float = 1.0, confidence: float = 1.0,
source_closet: str = None, source_closet: str = None,
source_file: str = None, source_file: str = None,
source_drawer_id: str = None,
adapter_name: str = None,
): ):
""" """
Add a relationship triple: subject → predicate → object. Add a relationship triple: subject → predicate → object.
``source_drawer_id`` and ``adapter_name`` are RFC 002 §5.5 provenance
fields populated by adapters that advertise ``supports_kg_triples``;
they default to ``None`` so every existing caller stays source-compatible.
Examples: Examples:
add_triple("Max", "child_of", "Alice", valid_from="2015-04-01") add_triple("Max", "child_of", "Alice", valid_from="2015-04-01")
add_triple("Max", "does", "swimming", valid_from="2025-01-01") add_triple("Max", "does", "swimming", valid_from="2025-01-01")
@@ -173,8 +195,12 @@ class KnowledgeGraph:
triple_id = f"t_{sub_id}_{pred}_{obj_id}_{hashlib.sha256(f'{valid_from}{datetime.now().isoformat()}'.encode()).hexdigest()[:12]}" triple_id = f"t_{sub_id}_{pred}_{obj_id}_{hashlib.sha256(f'{valid_from}{datetime.now().isoformat()}'.encode()).hexdigest()[:12]}"
conn.execute( conn.execute(
"""INSERT INTO triples (id, subject, predicate, object, valid_from, valid_to, confidence, source_closet, source_file) """INSERT INTO triples (
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)""", id, subject, predicate, object,
valid_from, valid_to, confidence,
source_closet, source_file,
source_drawer_id, adapter_name
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
( (
triple_id, triple_id,
sub_id, sub_id,
@@ -185,6 +211,8 @@ class KnowledgeGraph:
confidence, confidence,
source_closet, source_closet,
source_file, source_file,
source_drawer_id,
adapter_name,
), ),
) )
return triple_id return triple_id
+74
View File
@@ -0,0 +1,74 @@
"""Source adapter subsystem (RFC 002).
Public surface:
* :class:`BaseSourceAdapter` — per-source read-side contract.
* Typed records: :class:`SourceRef`, :class:`SourceItemMetadata`,
:class:`DrawerRecord`, :class:`RouteHint`, :class:`SourceSummary`,
:class:`AdapterSchema`, :class:`FieldSpec`.
* Error classes: :class:`SourceNotFoundError`, :class:`AuthRequiredError`,
:class:`AdapterClosedError`, :class:`TransformationViolationError`,
:class:`SchemaConformanceError`.
* Registry: :func:`register`, :func:`get_adapter`, :func:`available_adapters`,
:func:`resolve_adapter_for_source`.
* :class:`PalaceContext` — facade core passes to adapters during ``ingest``.
* :mod:`transforms` — reference implementations of the reserved §1.4
transformations + :func:`get_transformation` resolver.
"""
from .base import (
AdapterClosedError,
AdapterSchema,
AuthRequiredError,
BaseSourceAdapter,
DrawerRecord,
FieldSpec,
IngestMode,
IngestResult,
RouteHint,
SchemaConformanceError,
SourceAdapterError,
SourceItemMetadata,
SourceNotFoundError,
SourceRef,
SourceSummary,
TransformationViolationError,
)
from .context import PalaceContext, ProgressHook
from .registry import (
available_adapters,
get_adapter,
get_adapter_class,
register,
reset_adapters,
resolve_adapter_for_source,
unregister,
)
__all__ = [
"AdapterClosedError",
"AdapterSchema",
"AuthRequiredError",
"BaseSourceAdapter",
"DrawerRecord",
"FieldSpec",
"IngestMode",
"IngestResult",
"PalaceContext",
"ProgressHook",
"RouteHint",
"SchemaConformanceError",
"SourceAdapterError",
"SourceItemMetadata",
"SourceNotFoundError",
"SourceRef",
"SourceSummary",
"TransformationViolationError",
"available_adapters",
"get_adapter",
"get_adapter_class",
"register",
"reset_adapters",
"resolve_adapter_for_source",
"unregister",
]
+245
View File
@@ -0,0 +1,245 @@
"""Source adapter contract for MemPalace (RFC 002).
Mirrors what ``mempalace/backends/base.py`` does for the write side: it defines
the read-side surface every source adapter must implement. A source adapter
extracts content from a specific origin (filesystem, git, Slack, Cursor …) and
yields typed records (``SourceItemMetadata`` / ``DrawerRecord``) that core
routes into the palace.
This module is spec scaffolding. The first-party miners (``mempalace/miner.py``
and ``mempalace/convo_miner.py``) are migrated onto it in a follow-up PR;
in this PR we publish the contract so third-party adapters can begin building
against a stable surface.
See ``docs/rfcs/002-source-adapter-plugin-spec.md`` for the authoritative
spec text.
"""
from __future__ import annotations
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import TYPE_CHECKING, ClassVar, Iterator, Literal, Optional
if TYPE_CHECKING:
from .context import PalaceContext # noqa: F401 (used in string annotation)
# ---------------------------------------------------------------------------
# Errors
# ---------------------------------------------------------------------------
class SourceAdapterError(Exception):
"""Base class for every source-adapter error raised by core."""
class SourceNotFoundError(SourceAdapterError):
"""Raised when a ``SourceRef`` does not resolve to a readable source."""
class AuthRequiredError(SourceAdapterError):
"""Raised when an adapter needs credentials that were not provided.
The message MUST name the env vars (or other supported mechanism) the
operator needs to set.
"""
class AdapterClosedError(SourceAdapterError):
"""Raised when an adapter method is called after ``close()``."""
class TransformationViolationError(SourceAdapterError):
"""Raised by the conformance suite when round-tripping a drawer requires
an undeclared transformation (RFC 002 §7.27.3)."""
class SchemaConformanceError(SourceAdapterError):
"""Raised when a ``DrawerRecord.metadata`` violates the adapter schema
returned by :meth:`BaseSourceAdapter.describe_schema`."""
# ---------------------------------------------------------------------------
# Value objects
# ---------------------------------------------------------------------------
@dataclass(frozen=True)
class SourceRef:
"""A handle to the source a user wants to ingest.
``local_path`` is for filesystem-rooted sources (project dir, mbox file).
``uri`` is for URL-like references (``github.com/org/repo``,
``slack://workspace/channel``).
``options`` carries adapter-specific non-secret config. Secrets MUST NOT
be placed here; see §4.2.
"""
local_path: Optional[str] = None
uri: Optional[str] = None
options: dict = field(default_factory=dict)
@dataclass(frozen=True)
class RouteHint:
"""Adapter-supplied routing hint (RFC 002 §2.5)."""
wing: Optional[str] = None
room: Optional[str] = None
hall: Optional[str] = None
@dataclass(frozen=True)
class SourceItemMetadata:
"""Lightweight pointer yielded before drawers for lazy-fetch adapters.
Core inspects ``version`` via :meth:`BaseSourceAdapter.is_current` to
decide whether to skip extraction; an adapter that responds positively
stops yielding drawers for this item and moves to the next.
"""
source_file: str
version: str
size_hint: Optional[int] = None
route_hint: Optional[RouteHint] = None
@dataclass(frozen=True)
class DrawerRecord:
"""One drawer's worth of extracted content plus flat metadata.
``metadata`` values MUST be flat scalars (``str``/``int``/``float``/``bool``)
per RFC 001 §1.4 — the chroma constraint. Nested data belongs on the
knowledge graph (§5.5) or in a declared ``json_string`` field (§5.4).
"""
content: str
source_file: str
chunk_index: int = 0
metadata: dict = field(default_factory=dict)
route_hint: Optional[RouteHint] = None
@dataclass(frozen=True)
class SourceSummary:
"""High-level description of a source returned by :meth:`source_summary`."""
description: str
item_count: Optional[int] = None
IngestMode = Literal["chunked_content", "whole_record", "metadata_only"]
@dataclass(frozen=True)
class FieldSpec:
"""Declared shape of a single per-adapter metadata field (§5.2)."""
type: Literal["string", "int", "float", "bool", "delimiter_joined_string", "json_string"]
required: bool
description: str
indexed: bool = False
delimiter: str = ";"
json_schema: Optional[dict] = None
@dataclass(frozen=True)
class AdapterSchema:
"""The per-adapter metadata schema returned by :meth:`describe_schema`."""
fields: dict[str, FieldSpec]
version: str
# The union type adapters yield from ``ingest``.
IngestResult = object # intentionally broad; runtime checks in core
# ---------------------------------------------------------------------------
# Adapter contract
# ---------------------------------------------------------------------------
class BaseSourceAdapter(ABC):
"""Long-lived adapter serving many ``SourceRef`` invocations (RFC 002 §2).
Instances are lightweight on construction — no I/O, no network, no
credential fetch. All work is deferred to :meth:`ingest`. Instances are
thread-safe for concurrent ``ingest`` calls across different ``SourceRef``
values (v1 serializes within a single ``SourceRef``).
Class attributes form the adapter's identity contract:
* ``name`` — stable adapter name used for registration and drawer metadata.
* ``adapter_version`` — adapter's own version, independent of
``spec_version``. Recorded on every drawer so re-extract workflows can
target drawers from a known-buggy adapter version.
* ``capabilities`` — free-form tokens; core inspects a documented subset.
* ``supported_modes`` — subset of ``chunked_content``, ``whole_record``,
``metadata_only``.
* ``declared_transformations`` — set of transformation names the adapter
applies to source bytes. The empty set marks a byte-preserving adapter.
* ``default_privacy_class`` — privacy class level (§6) applied unless the
palace config overrides it.
"""
name: ClassVar[str]
spec_version: ClassVar[str] = "1.0"
adapter_version: ClassVar[str] = "0.0.0"
capabilities: ClassVar[frozenset[str]] = frozenset()
supported_modes: ClassVar[frozenset[str]] = frozenset({"chunked_content"})
declared_transformations: ClassVar[frozenset[str]] = frozenset()
default_privacy_class: ClassVar[str] = "pii_potential"
# ------------------------------------------------------------------
# Required methods
# ------------------------------------------------------------------
@abstractmethod
def ingest(
self,
*,
source: SourceRef,
palace: "PalaceContext",
) -> Iterator[IngestResult]:
"""Enumerate and extract content from a source.
Yields a stream of ``SourceItemMetadata`` and ``DrawerRecord`` values.
Lazy adapters yield ``SourceItemMetadata`` ahead of the drawers for
that item so core can check :meth:`is_current` before committing to
the fetch. Eager adapters MAY interleave freely.
"""
@abstractmethod
def describe_schema(self) -> AdapterSchema:
"""Declare the structured metadata this adapter attaches.
The returned schema MUST be stable for a given ``adapter_version``.
Enterprises index on it; core uses it to validate adapter output.
"""
# ------------------------------------------------------------------
# Optional methods with default implementations
# ------------------------------------------------------------------
def is_current(
self,
*,
item: SourceItemMetadata,
existing_metadata: Optional[dict],
) -> bool:
"""Return True if the palace already has an up-to-date copy of ``item``.
Default: always returns False (re-extract every time). Adapters
advertising ``supports_incremental`` MUST override.
"""
return False
def source_summary(self, *, source: SourceRef) -> SourceSummary:
"""Describe a source without extracting."""
return SourceSummary(description=self.name)
def close(self) -> None:
"""Release any resources the adapter holds. Default: no-op."""
return None
+140
View File
@@ -0,0 +1,140 @@
"""``PalaceContext`` facade passed to source adapters (RFC 002 §9).
Bundles the palace-side surface an adapter needs during :meth:`ingest`:
drawer collection, closet collection, knowledge graph, palace config, and
progress hooks. Adapters receive a ``PalaceContext`` instance and MUST NOT
import ``mempalace.palace`` directly — that coupling is what the facade
exists to prevent.
This module publishes the shape third-party adapters target. Core's mine
loop will construct a concrete ``PalaceContext`` and pass it to adapters
when the filesystem/conversations miners are migrated onto ``BaseSourceAdapter``
in a follow-up PR; until then, no in-tree code constructs one, but the
contract is stable.
"""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Any, Callable, Optional, Protocol
from .base import DrawerRecord
class _CollectionLike(Protocol):
"""Minimum of :class:`mempalace.backends.BaseCollection` adapters rely on.
Declared as a Protocol so tests and third-party adapters can substitute
any object with compatible method signatures without importing the
concrete backend. See ``mempalace/backends/base.py`` for the full surface.
"""
def add(self, **kwargs: Any) -> None: ...
def upsert(self, **kwargs: Any) -> None: ...
def query(self, **kwargs: Any) -> Any: ...
def get(self, **kwargs: Any) -> Any: ...
def delete(self, **kwargs: Any) -> None: ...
def count(self) -> int: ...
class _KnowledgeGraphLike(Protocol):
def add_triple(self, subject: str, predicate: str, obj: str, **kwargs: Any) -> Any: ...
# Progress hook signature: ``fn(event_name, **details) -> None``.
ProgressHook = Callable[..., None]
@dataclass
class PalaceContext:
"""Per-mine-invocation facade passed to :meth:`BaseSourceAdapter.ingest`.
Fields:
drawer_collection: The palace's drawer collection (via RFC 001 backend).
closet_collection: The palace's closet collection, or ``None`` if the
palace has no closets yet. Adapters should not write to this
directly; core builds closets post-step (RFC 002 §1.7).
knowledge_graph: The palace's SQLite knowledge graph. Adapters
advertising ``supports_kg_triples`` call ``add_triple`` on it.
palace_path: Filesystem root of the palace (convenience; same as
``backend.PalaceRef.local_path``).
config: Palace config object (hall keywords, rooms list, privacy
floor, etc.). Shape is the existing :class:`MempalaceConfig`.
adapter_name: Name of the adapter currently ingesting; populated by
core so drawers can carry ``metadata["adapter_name"]``.
adapter_version: Version of the adapter currently ingesting.
progress_hooks: Optional callables core invokes on progress events.
Methods are intentionally thin wrappers so the concrete mine loop in
core can swap implementations without changing adapter code.
"""
drawer_collection: _CollectionLike
knowledge_graph: _KnowledgeGraphLike
palace_path: str
closet_collection: Optional[_CollectionLike] = None
config: Optional[Any] = None
adapter_name: str = ""
adapter_version: str = ""
progress_hooks: list[ProgressHook] = field(default_factory=list)
# Internal: flag set by :meth:`skip_current_item` and checked by the core
# mine loop between yields. Not part of the adapter-facing contract; the
# adapter only needs to know that calling :meth:`skip_current_item` stops
# drawer emission for the current ``SourceItemMetadata``.
_skip_requested: bool = False
# ------------------------------------------------------------------
# Adapter-facing surface
# ------------------------------------------------------------------
def upsert_drawer(self, record: DrawerRecord) -> None:
"""Persist a ``DrawerRecord`` to the drawer collection.
Applies the spec-mandated ``adapter_name`` and ``adapter_version``
metadata stamps (§5.1) so adapters never need to populate them.
"""
meta = dict(record.metadata)
meta.setdefault("source_file", record.source_file)
meta.setdefault("chunk_index", record.chunk_index)
if self.adapter_name:
meta.setdefault("adapter_name", self.adapter_name)
if self.adapter_version:
meta.setdefault("adapter_version", self.adapter_version)
drawer_id = _build_drawer_id(record)
self.drawer_collection.upsert(
documents=[record.content],
ids=[drawer_id],
metadatas=[meta],
)
def skip_current_item(self) -> None:
"""Signal to core that the current ``SourceItemMetadata`` is up-to-date
and no drawers should be emitted for it. Core resets the flag after
advancing past the item."""
self._skip_requested = True
def emit(self, event: str, **details: Any) -> None:
"""Invoke each registered progress hook with ``(event, **details)``."""
for hook in self.progress_hooks:
try:
hook(event, **details)
except Exception: # pragma: no cover - hook errors never fail mine
import logging
logging.getLogger(__name__).exception("progress hook failed on %r", event)
def _build_drawer_id(record: DrawerRecord) -> str:
"""Deterministic drawer id: ``<sha1(source_file)>_<chunk_index>``.
Matches the shape existing miners rely on (``source_file`` + chunk index
pair) while keeping the id chroma-safe (no separators that collide with
existing metadata values). Adapters that need a different id scheme can
bypass :meth:`PalaceContext.upsert_drawer` and write through
``drawer_collection.upsert`` directly.
"""
import hashlib
digest = hashlib.sha1(record.source_file.encode("utf-8")).hexdigest()[:16]
return f"{digest}_{record.chunk_index}"
+162
View File
@@ -0,0 +1,162 @@
"""Source adapter registry + entry-point discovery (RFC 002 §3).
Third-party adapters ship as installable packages that declare a
``mempalace.sources`` entry point::
# pyproject.toml of mempalace-source-cursor
[project.entry-points."mempalace.sources"]
cursor = "mempalace_source_cursor:CursorAdapter"
MemPalace discovers them at process start. In-tree tests and local
development can register manually via :func:`register`. Explicit
registration wins on name conflict (RFC 002 §3.2).
Unlike storage backends (RFC 001 §3.3), source adapters are never auto-
detected — the user selects the adapter explicitly via ``--source NAME``
or config (§3.3). The default when no adapter is named is ``filesystem``
(to preserve current ``mempalace mine <path>`` behavior).
"""
from __future__ import annotations
import logging
from importlib import metadata
from threading import Lock
from typing import Type
from .base import BaseSourceAdapter
logger = logging.getLogger(__name__)
_ENTRY_POINT_GROUP = "mempalace.sources"
_DEFAULT_ADAPTER = "filesystem"
_registry: dict[str, Type[BaseSourceAdapter]] = {}
_instances: dict[str, BaseSourceAdapter] = {}
_explicit: set[str] = set()
_discovered = False
_lock = Lock()
def register(name: str, adapter_cls: Type[BaseSourceAdapter]) -> None:
"""Register ``adapter_cls`` under ``name``.
Explicit registration wins over entry-point discovery on conflict (§3.2).
"""
with _lock:
_registry[name] = adapter_cls
_explicit.add(name)
_instances.pop(name, None)
def unregister(name: str) -> None:
"""Remove an adapter registration (primarily for tests)."""
with _lock:
_registry.pop(name, None)
_explicit.discard(name)
_instances.pop(name, None)
def _discover_entry_points() -> None:
global _discovered
if _discovered:
return
with _lock:
if _discovered:
return
try:
eps = metadata.entry_points()
group = (
eps.select(group=_ENTRY_POINT_GROUP)
if hasattr(eps, "select")
else eps.get(_ENTRY_POINT_GROUP, [])
)
except Exception:
logger.exception("entry-point discovery for %s failed", _ENTRY_POINT_GROUP)
group = []
for ep in group:
if ep.name in _explicit:
continue # explicit registration wins
try:
cls = ep.load()
except Exception:
logger.exception("failed to load adapter entry point %r", ep.name)
continue
if not isinstance(cls, type) or not issubclass(cls, BaseSourceAdapter):
logger.warning(
"entry point %r did not resolve to a BaseSourceAdapter subclass (got %r)",
ep.name,
cls,
)
continue
_registry.setdefault(ep.name, cls)
_discovered = True
def available_adapters() -> list[str]:
"""Return sorted list of all registered adapter names."""
_discover_entry_points()
return sorted(_registry.keys())
def get_adapter_class(name: str) -> Type[BaseSourceAdapter]:
"""Return the registered adapter class for ``name``."""
_discover_entry_points()
try:
return _registry[name]
except KeyError as e:
raise KeyError(f"unknown source adapter {name!r}; available: {available_adapters()}") from e
def get_adapter(name: str) -> BaseSourceAdapter:
"""Return a long-lived instance of the named adapter.
Instances are cached per-name; repeated calls return the same object.
Call :func:`reset_adapters` in tests that need isolation.
"""
_discover_entry_points()
with _lock:
inst = _instances.get(name)
if inst is not None:
return inst
cls = _registry.get(name)
if cls is None:
raise KeyError(
f"unknown source adapter {name!r}; available: {sorted(_registry.keys())}"
)
inst = cls()
_instances[name] = inst
return inst
def reset_adapters() -> None:
"""Close and drop all cached adapter instances (primarily for tests)."""
with _lock:
for inst in _instances.values():
try:
inst.close()
except Exception:
logger.exception("error closing adapter during reset")
_instances.clear()
def resolve_adapter_for_source(
*,
explicit: str | None = None,
config_value: str | None = None,
default: str = _DEFAULT_ADAPTER,
) -> str:
"""Resolve the adapter name per RFC 002 §3.3 priority order.
1. Explicit ``--source`` flag or kwarg
2. Per-source config value
3. Default (``filesystem``)
Auto-detection is *intentionally* absent on the read side (§3.3); a
directory containing ``.git`` + ``workspaceStorage/`` + an ``mbox`` file
is not a signal of user intent.
"""
for candidate in (explicit, config_value):
if candidate:
return candidate
return default
+179
View File
@@ -0,0 +1,179 @@
"""Reference implementations of the reserved content transformations (RFC 002 §1.4).
Every source adapter declares the set of transformations it applies to source
bytes via ``declared_transformations``. The conformance suite then verifies
that the adapter's output can be reproduced from the source bytes by applying
*only* the declared transformations in declaration order, using these
reference implementations.
Each transformation is a pure function on strings (text content after UTF-8
decoding). ``utf8_replace_invalid`` is the one that operates on bytes.
The invariant the spec enforces: **no transformation is applied that is not
declared in the adapter's set**. Adapters with an empty set are byte-preserving
end-to-end (modulo the initial UTF-8 decode itself, which is captured by
``utf8_replace_invalid`` when applicable).
Adapters MAY add custom transformations beyond the reserved set; third-party
names SHOULD be prefixed with the adapter name (``cursor.composer_ordering``).
Custom transformations MUST expose a reference implementation under
``mempalace.sources.transforms.<adapter_name>_<transform_name>`` so the
conformance suite can locate and apply them.
"""
from __future__ import annotations
import re
from typing import Callable
# ---------------------------------------------------------------------------
# Reserved transformations
# ---------------------------------------------------------------------------
def utf8_replace_invalid(raw: bytes) -> str:
"""Decode bytes as UTF-8; replace invalid sequences with U+FFFD.
Equivalent to ``raw.decode("utf-8", errors="replace")``. This is the one
reserved transformation that operates on bytes rather than decoded text.
"""
return raw.decode("utf-8", errors="replace")
def newline_normalize(text: str) -> str:
"""Convert CRLF and bare-CR line endings to LF."""
return text.replace("\r\n", "\n").replace("\r", "\n")
def whitespace_trim(text: str) -> str:
"""Strip leading and trailing whitespace at the record boundary only."""
return text.strip()
_RUN_OF_THREE_OR_MORE_BLANK = re.compile(r"(?:\n[ \t]*){3,}\n")
def whitespace_collapse_internal(text: str) -> str:
"""Collapse runs of three or more blank lines to exactly two blank lines.
A "blank line" here is a line containing only spaces or tabs. Single and
double blank-line runs are preserved.
"""
# Normalise inputs before collapsing: turn internal blank lines with
# whitespace content into pure \n so the regex matches consistently.
lines = text.split("\n")
normalised = "\n".join(line if line.strip() else "" for line in lines)
return _RUN_OF_THREE_OR_MORE_BLANK.sub("\n\n\n", normalised)
def line_trim(text: str) -> str:
"""Strip leading and trailing whitespace from each individual line."""
return "\n".join(line.strip() for line in text.split("\n"))
def line_join_spaces(text: str) -> str:
"""Join adjacent non-blank lines with a single space, preserving paragraph breaks.
Two lines separated by at least one blank line remain on separate lines;
runs of non-blank lines collapse into a single space-separated line.
"""
paragraphs = re.split(r"\n[ \t]*\n", text)
joined = [" ".join(line.strip() for line in p.split("\n") if line.strip()) for p in paragraphs]
return "\n\n".join(joined)
def blank_line_drop(text: str) -> str:
"""Drop blank lines between non-blank lines, keeping non-blank lines only."""
return "\n".join(line for line in text.split("\n") if line.strip())
# The following reserved transformations are declared in the spec but are
# deeply adapter-specific. Rather than guess a single reference implementation
# now, we provide identity shims that raise if invoked without adapter-supplied
# context. Adapters that declare these MUST either override with a concrete
# implementation or provide a namespaced reference under
# ``mempalace.sources.transforms.<adapter_name>_<transform_name>`` (per the
# module docstring). The conformance suite looks up the adapter-specific
# implementation first, falling back to these only when none exists.
def strip_tool_chrome(text: str) -> str:
"""Adapter-supplied: remove system tags, hook output, tool UI chrome.
The reference implementation here is intentionally an identity function
because the noise patterns differ per transcript format (Claude Code,
Codex, ChatGPT, Slack). The conversations adapter, when migrated, will
register a concrete reference implementation under
``mempalace.sources.transforms.conversations_strip_tool_chrome``.
"""
return text
def tool_result_truncate(text: str) -> str:
"""Adapter-supplied: head/tail window on tool output with a middle marker."""
return text
def tool_result_omitted(text: str) -> str:
"""Adapter-supplied: fully omit some tool outputs (e.g., Read/Edit/Write)."""
return text
def spellcheck_user(text: str) -> str:
"""Adapter-supplied: rewrite user turns via autocorrect.
Requires the optional ``spellcheck`` extra and a tokenizer; the spec does
not mandate a specific language model, so the reference is adapter-owned.
"""
return text
def synthesized_marker(text: str) -> str:
"""Adapter-supplied: adapter inserts its own strings (e.g., '[N lines omitted]')."""
return text
def speaker_role_assignment(text: str) -> str:
"""Adapter-supplied: multi-party speakers alternately assigned user/assistant."""
return text
# ---------------------------------------------------------------------------
# Registry
# ---------------------------------------------------------------------------
# Reserved transformation name → reference implementation.
# Adapters look up by name to compose a round-trip pipeline during testing.
RESERVED_TRANSFORMATIONS: dict[str, Callable[..., str]] = {
"utf8_replace_invalid": utf8_replace_invalid,
"newline_normalize": newline_normalize,
"whitespace_trim": whitespace_trim,
"whitespace_collapse_internal": whitespace_collapse_internal,
"line_trim": line_trim,
"line_join_spaces": line_join_spaces,
"blank_line_drop": blank_line_drop,
"strip_tool_chrome": strip_tool_chrome,
"tool_result_truncate": tool_result_truncate,
"tool_result_omitted": tool_result_omitted,
"spellcheck_user": spellcheck_user,
"synthesized_marker": synthesized_marker,
"speaker_role_assignment": speaker_role_assignment,
}
def get_transformation(name: str) -> Callable[..., str]:
"""Resolve a reserved transformation by name.
Raises :class:`KeyError` if the name is neither reserved nor registered as
an adapter-namespaced reference (``<adapter>_<transform>``). Callers
looking for adapter-specific references SHOULD ``getattr`` on this module
first; this helper only covers the reserved names.
"""
try:
return RESERVED_TRANSFORMATIONS[name]
except KeyError as e:
raise KeyError(
f"unknown transformation {name!r}; reserved names: {sorted(RESERVED_TRANSFORMATIONS)}"
) from e
+6
View File
@@ -40,6 +40,12 @@ mempalace = "mempalace.cli:main"
[project.entry-points."mempalace.backends"] [project.entry-points."mempalace.backends"]
chroma = "mempalace.backends.chroma:ChromaBackend" chroma = "mempalace.backends.chroma:ChromaBackend"
# RFC 002 source-adapter entry-point group. Core publishes no first-party
# adapters under this group yet; ``miner.py`` and ``convo_miner.py`` migrate
# onto ``BaseSourceAdapter`` in a follow-up PR. Third-party adapter packages
# (``mempalace-source-cursor``, ``mempalace-source-git``, …) register here.
[project.entry-points."mempalace.sources"]
[project.optional-dependencies] [project.optional-dependencies]
dev = ["pytest>=7.0", "pytest-cov>=4.0", "ruff>=0.4.0", "psutil>=5.9"] dev = ["pytest>=7.0", "pytest-cov>=4.0", "ruff>=0.4.0", "psutil>=5.9"]
spellcheck = ["autocorrect>=2.0"] spellcheck = ["autocorrect>=2.0"]
+427
View File
@@ -0,0 +1,427 @@
"""Tests for the RFC 002 source-adapter scaffolding."""
import pytest
from mempalace.sources import (
AdapterSchema,
BaseSourceAdapter,
DrawerRecord,
FieldSpec,
PalaceContext,
RouteHint,
SourceItemMetadata,
SourceRef,
SourceSummary,
available_adapters,
get_adapter,
get_adapter_class,
register,
reset_adapters,
resolve_adapter_for_source,
unregister,
)
from mempalace.sources.transforms import (
RESERVED_TRANSFORMATIONS,
blank_line_drop,
get_transformation,
line_join_spaces,
line_trim,
newline_normalize,
utf8_replace_invalid,
whitespace_collapse_internal,
whitespace_trim,
)
# ---------------------------------------------------------------------------
# Minimal conforming adapter used as a fixture across tests
# ---------------------------------------------------------------------------
class _TrivialAdapter(BaseSourceAdapter):
name = "_trivial"
adapter_version = "0.1.0"
capabilities = frozenset({"byte_preserving"})
supported_modes = frozenset({"whole_record"})
declared_transformations = frozenset()
default_privacy_class = "public"
def ingest(self, *, source, palace):
yield SourceItemMetadata(source_file=source.uri or "x", version="v1")
yield DrawerRecord(content="hello", source_file=source.uri or "x", chunk_index=0)
def describe_schema(self):
return AdapterSchema(
version="1.0",
fields={"example": FieldSpec(type="string", required=False, description="x")},
)
@pytest.fixture(autouse=True)
def _isolate_registry():
yield
reset_adapters()
for name in list(available_adapters()):
unregister(name)
# ---------------------------------------------------------------------------
# base.py — ABC + typed records
# ---------------------------------------------------------------------------
def test_base_adapter_is_abstract_without_required_methods():
with pytest.raises(TypeError):
class Incomplete(BaseSourceAdapter):
name = "incomplete"
Incomplete()
def test_conforming_adapter_instantiates_and_yields_typed_records():
adapter = _TrivialAdapter()
results = list(adapter.ingest(source=SourceRef(uri="foo"), palace=None))
assert len(results) == 2
assert isinstance(results[0], SourceItemMetadata)
assert isinstance(results[1], DrawerRecord)
assert results[1].content == "hello"
def test_is_current_default_is_false_always_reextracts():
adapter = _TrivialAdapter()
item = SourceItemMetadata(source_file="f", version="v1")
assert adapter.is_current(item=item, existing_metadata=None) is False
assert adapter.is_current(item=item, existing_metadata={"version": "v1"}) is False
def test_source_summary_default_uses_adapter_name():
adapter = _TrivialAdapter()
summary = adapter.source_summary(source=SourceRef(uri="x"))
assert isinstance(summary, SourceSummary)
assert summary.description == "_trivial"
def test_source_ref_options_default_is_empty_dict():
# Frozen dataclass must not share default_factory=list instance across instances.
a = SourceRef(uri="a")
b = SourceRef(uri="b")
a.options["touched"] = True
assert "touched" not in b.options
# ---------------------------------------------------------------------------
# transforms.py
# ---------------------------------------------------------------------------
def test_reserved_transformations_registry_has_all_13():
expected = {
"utf8_replace_invalid",
"newline_normalize",
"whitespace_trim",
"whitespace_collapse_internal",
"line_trim",
"line_join_spaces",
"blank_line_drop",
"strip_tool_chrome",
"tool_result_truncate",
"tool_result_omitted",
"spellcheck_user",
"synthesized_marker",
"speaker_role_assignment",
}
assert set(RESERVED_TRANSFORMATIONS) == expected
def test_utf8_replace_invalid_handles_bad_bytes():
# A lone 0xff byte is never valid UTF-8; U+FFFD should replace it.
assert utf8_replace_invalid(b"ok \xff end") == "ok \ufffd end"
def test_newline_normalize_converts_crlf_and_cr():
assert newline_normalize("a\r\nb\rc\nd") == "a\nb\nc\nd"
def test_whitespace_trim_strips_boundaries():
assert whitespace_trim(" hello\n\n") == "hello"
def test_whitespace_collapse_internal_caps_at_two_blanks():
# Five blanks collapses to exactly three newlines (two blank lines).
text = "a\n\n\n\n\nb"
assert whitespace_collapse_internal(text) == "a\n\n\nb"
def test_line_trim_strips_each_line():
assert line_trim(" a \n\t b \n c") == "a\nb\nc"
def test_line_join_spaces_preserves_paragraph_breaks():
text = "foo\nbar\nbaz\n\nqux\nquux"
assert line_join_spaces(text) == "foo bar baz\n\nqux quux"
def test_blank_line_drop_removes_blanks_only():
assert blank_line_drop("a\n\nb\n\n\nc") == "a\nb\nc"
def test_get_transformation_resolves_reserved_and_rejects_unknown():
assert get_transformation("newline_normalize") is newline_normalize
with pytest.raises(KeyError):
get_transformation("not_a_real_transformation")
# ---------------------------------------------------------------------------
# registry.py
# ---------------------------------------------------------------------------
def test_register_and_get_adapter_roundtrip():
register("_trivial", _TrivialAdapter)
assert "_trivial" in available_adapters()
inst = get_adapter("_trivial")
assert isinstance(inst, _TrivialAdapter)
# Cached: repeated calls return the same instance.
assert get_adapter("_trivial") is inst
def test_get_adapter_class_returns_class_not_instance():
register("_trivial", _TrivialAdapter)
assert get_adapter_class("_trivial") is _TrivialAdapter
def test_get_adapter_unknown_raises_key_error():
with pytest.raises(KeyError):
get_adapter("does-not-exist")
def test_unregister_drops_registration_and_cached_instance():
register("_trivial", _TrivialAdapter)
get_adapter("_trivial")
unregister("_trivial")
assert "_trivial" not in available_adapters()
with pytest.raises(KeyError):
get_adapter("_trivial")
def test_resolve_adapter_priority_order():
# Explicit wins over everything.
assert resolve_adapter_for_source(explicit="cursor", config_value="git") == "cursor"
# Config wins over default.
assert resolve_adapter_for_source(config_value="git") == "git"
# Default is filesystem (preserves existing ``mempalace mine <path>`` behavior).
assert resolve_adapter_for_source() == "filesystem"
# ---------------------------------------------------------------------------
# PalaceContext
# ---------------------------------------------------------------------------
class _FakeCollection:
def __init__(self):
self.upserts = []
def add(self, **kwargs):
pass
def upsert(self, **kwargs):
self.upserts.append(kwargs)
def query(self, **kwargs):
return {}
def get(self, **kwargs):
return {}
def delete(self, **kwargs):
pass
def count(self):
return 0
class _FakeKG:
def __init__(self):
self.triples = []
def add_triple(self, subject, predicate, obj, **kwargs):
self.triples.append((subject, predicate, obj, kwargs))
def test_palace_context_upsert_drawer_stamps_adapter_metadata():
drawers = _FakeCollection()
kg = _FakeKG()
ctx = PalaceContext(
drawer_collection=drawers,
knowledge_graph=kg,
palace_path="/tmp/palace",
adapter_name="test-adapter",
adapter_version="0.1.0",
)
record = DrawerRecord(
content="hello",
source_file="/abs/path/file.txt",
chunk_index=2,
metadata={"wing": "proj"},
)
ctx.upsert_drawer(record)
assert len(drawers.upserts) == 1
kwargs = drawers.upserts[0]
assert kwargs["documents"] == ["hello"]
assert len(kwargs["ids"]) == 1
meta = kwargs["metadatas"][0]
assert meta["wing"] == "proj"
assert meta["adapter_name"] == "test-adapter"
assert meta["adapter_version"] == "0.1.0"
assert meta["source_file"] == "/abs/path/file.txt"
assert meta["chunk_index"] == 2
def test_palace_context_skip_current_item_sets_flag():
ctx = PalaceContext(
drawer_collection=_FakeCollection(),
knowledge_graph=_FakeKG(),
palace_path="/tmp/p",
)
assert ctx._skip_requested is False
ctx.skip_current_item()
assert ctx._skip_requested is True
def test_palace_context_emit_dispatches_to_hooks_and_swallows_errors():
calls = []
err_calls = []
def good_hook(event, **details):
calls.append((event, details))
def bad_hook(event, **details):
err_calls.append(event)
raise RuntimeError("hook exploded")
ctx = PalaceContext(
drawer_collection=_FakeCollection(),
knowledge_graph=_FakeKG(),
palace_path="/tmp/p",
progress_hooks=[good_hook, bad_hook],
)
ctx.emit("mined_file", path="a.txt", bytes=42)
assert calls == [("mined_file", {"path": "a.txt", "bytes": 42})]
assert err_calls == ["mined_file"] # was invoked; error was swallowed
def test_palace_context_uses_route_hint_when_present():
# Route hints are frozen dataclasses the adapter passes through.
hint = RouteHint(wing="proj", room="backend", hall="general")
assert hint.wing == "proj"
assert hint.room == "backend"
# ---------------------------------------------------------------------------
# KnowledgeGraph new provenance params (RFC 002 §5.5)
# ---------------------------------------------------------------------------
def test_knowledge_graph_add_triple_accepts_source_drawer_id_and_adapter_name(tmp_path):
from mempalace.knowledge_graph import KnowledgeGraph
kg = KnowledgeGraph(db_path=str(tmp_path / "kg.sqlite3"))
try:
triple_id = kg.add_triple(
"Ben",
"committed",
"PR-567",
valid_from="2026-03-12",
source_file="github.com/org/repo#pr=567",
source_drawer_id="abc123_0",
adapter_name="git",
)
assert triple_id is not None
import sqlite3
conn = sqlite3.connect(str(tmp_path / "kg.sqlite3"))
conn.row_factory = sqlite3.Row
row = conn.execute(
"SELECT source_drawer_id, adapter_name FROM triples WHERE id=?", (triple_id,)
).fetchone()
assert row["source_drawer_id"] == "abc123_0"
assert row["adapter_name"] == "git"
conn.close()
finally:
kg.close()
def test_knowledge_graph_migration_adds_missing_columns_to_old_schema(tmp_path):
"""An old-schema triples table (pre-RFC 002) should auto-migrate on open."""
import sqlite3
db_path = tmp_path / "legacy.sqlite3"
conn = sqlite3.connect(str(db_path))
conn.executescript("""
CREATE TABLE entities (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
type TEXT DEFAULT 'unknown',
properties TEXT DEFAULT '{}',
created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE triples (
id TEXT PRIMARY KEY,
subject TEXT NOT NULL,
predicate TEXT NOT NULL,
object TEXT NOT NULL,
valid_from TEXT,
valid_to TEXT,
confidence REAL DEFAULT 1.0,
source_closet TEXT,
source_file TEXT,
extracted_at TEXT DEFAULT CURRENT_TIMESTAMP
);
""")
conn.commit()
conn.close()
from mempalace.knowledge_graph import KnowledgeGraph
kg = KnowledgeGraph(db_path=str(db_path))
try:
# New columns must be present after _init_db runs the migration.
conn = sqlite3.connect(str(db_path))
cols = {row[1] for row in conn.execute("PRAGMA table_info(triples)")}
conn.close()
assert "source_drawer_id" in cols
assert "adapter_name" in cols
# New-column insert works.
kg.add_triple("a", "rel", "b", source_drawer_id="d0", adapter_name="x")
finally:
kg.close()
def test_knowledge_graph_add_triple_backwards_compatible_without_new_kwargs(tmp_path):
"""Existing callers that omit the RFC 002 kwargs keep working unchanged."""
from mempalace.knowledge_graph import KnowledgeGraph
kg = KnowledgeGraph(db_path=str(tmp_path / "kg.sqlite3"))
try:
triple_id = kg.add_triple("Max", "likes", "trains")
assert triple_id is not None
finally:
kg.close()
# ---------------------------------------------------------------------------
# pyproject entry-point group is discoverable even when empty
# ---------------------------------------------------------------------------
def test_entry_point_group_exists_and_returns_zero_or_more_adapters():
# No in-tree first-party adapters yet (miners migrate in a follow-up PR),
# but the ``mempalace.sources`` entry-point group is declared so third-
# party packages can register. ``available_adapters`` MUST NOT raise.
adapters = available_adapters()
assert isinstance(adapters, list)