552e9927b7
Lands the read-side contract so third-party adapter authors (@Perseusxrltd, @JakobSachs, @adv3nt3, @zendesk-thittesdorf, @mfhens, @roip, @MrDys) have a stable target matching what RFC 001 §10 landed on the write side in #995. Scope (this PR): - mempalace/sources/base.py: BaseSourceAdapter ABC with kwargs-only ingest() / describe_schema() and default is_current() / source_summary() / close() (§1.1–1.2). Typed records: SourceRef, SourceItemMetadata, DrawerRecord, RouteHint, SourceSummary, AdapterSchema, FieldSpec (§1.3, §5.2). Error classes: SourceNotFoundError, AuthRequiredError, AdapterClosedError, TransformationViolationError, SchemaConformanceError (§2.7). Class-level identity contract: name / adapter_version / capabilities / supported_modes / declared_transformations / default_privacy_class (§2.1, §1.4, §1.5, §6). - mempalace/sources/transforms.py: reference implementations of the 13 reserved transformations (§1.4) — utf8_replace_invalid, newline_normalize, whitespace_trim, whitespace_collapse_internal, line_trim, line_join_spaces, blank_line_drop — as pure functions, plus identity shims for the six adapter-specific ones (strip_tool_chrome, tool_result_truncate, tool_result_omitted, spellcheck_user, synthesized_marker, speaker_role_assignment) that the conversations adapter will override when migrated. get_transformation(name) resolves by reserved name. - mempalace/sources/registry.py: entry-point discovery via importlib.metadata.entry_points(group="mempalace.sources") + explicit register()/unregister() surface (§3.1–3.2). resolve_adapter_for_source() implements the §3.3 priority order; crucially, no auto-detection on the read side (§3.3 is explicit about that — user intent never inferred from on-disk artifacts). - mempalace/sources/context.py: PalaceContext facade (§9) bundling the drawer/closet collections, knowledge graph, palace path, adapter identity, and progress hooks core passes into adapter.ingest(). upsert_drawer() applies the spec-mandated adapter_name/adapter_version stamps from §5.1. skip_current_item() signals laziness; emit() dispatches to hooks and swallows hook exceptions. - mempalace/knowledge_graph.py: add_triple() gains optional source_drawer_id and adapter_name kwargs (§5.5). Backwards-compatible column migration auto-adds the new columns on open of a pre-RFC 002 palace (PRAGMA table_info then ALTER TABLE ADD COLUMN), matching the pattern used for any new palace-side provenance fields. - pyproject.toml: mempalace.sources entry-point group declared. Empty on the first-party side for now — miners migrate in a follow-up; the group being present means third-party packages can begin registering today. Out of scope (explicit follow-ups): - miner.py → mempalace/sources/filesystem.py. Behavior-preserving rename that also moves READABLE_EXTENSIONS, detect_room(), detect_hall() into the adapter (§9). Larger refactor; lands separately. - convo_miner.py + normalize.py → mempalace/sources/conversations.py. The format-detection if-chain in normalize.py becomes per-format plugins; declared_transformations enumerates what the current pipeline already does to source bytes (§1.4 existing-code mapping). - Closet post-step wired into the conversations adapter (§1.7). - CLI --source flag + --mode deprecation alias (§3.3). - MCP mempalace_mine tool source parameter. - AbstractSourceAdapterContractSuite (§7.1–7.3): byte-preservation round- trip and declared-transformation round-trip tests. - Privacy-class floor enforcement (§6.2); depends on #389 for secrets_possible scanning. Tests: 1018 passed (up from ~990 on develop), +27 targeted tests covering the ABC instantiation rules, typed records, all reserved transformations, the registry register/get/unregister surface, PalaceContext upsert + skip + emit semantics, and both the new KG provenance kwargs and backwards- compatible legacy-schema migration. Refs: #989 (RFC 002 tracking), #990 (RFC 002 spec), #995 (RFC 001 §10 cleanup — sibling PR on the write side).
246 lines
8.3 KiB
Python
246 lines
8.3 KiB
Python
"""Source adapter contract for MemPalace (RFC 002).
|
||
|
||
Mirrors what ``mempalace/backends/base.py`` does for the write side: it defines
|
||
the read-side surface every source adapter must implement. A source adapter
|
||
extracts content from a specific origin (filesystem, git, Slack, Cursor …) and
|
||
yields typed records (``SourceItemMetadata`` / ``DrawerRecord``) that core
|
||
routes into the palace.
|
||
|
||
This module is spec scaffolding. The first-party miners (``mempalace/miner.py``
|
||
and ``mempalace/convo_miner.py``) are migrated onto it in a follow-up PR;
|
||
in this PR we publish the contract so third-party adapters can begin building
|
||
against a stable surface.
|
||
|
||
See ``docs/rfcs/002-source-adapter-plugin-spec.md`` for the authoritative
|
||
spec text.
|
||
"""
|
||
|
||
from __future__ import annotations
|
||
|
||
from abc import ABC, abstractmethod
|
||
from dataclasses import dataclass, field
|
||
from typing import TYPE_CHECKING, ClassVar, Iterator, Literal, Optional
|
||
|
||
if TYPE_CHECKING:
|
||
from .context import PalaceContext # noqa: F401 (used in string annotation)
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Errors
|
||
# ---------------------------------------------------------------------------
|
||
|
||
|
||
class SourceAdapterError(Exception):
|
||
"""Base class for every source-adapter error raised by core."""
|
||
|
||
|
||
class SourceNotFoundError(SourceAdapterError):
|
||
"""Raised when a ``SourceRef`` does not resolve to a readable source."""
|
||
|
||
|
||
class AuthRequiredError(SourceAdapterError):
|
||
"""Raised when an adapter needs credentials that were not provided.
|
||
|
||
The message MUST name the env vars (or other supported mechanism) the
|
||
operator needs to set.
|
||
"""
|
||
|
||
|
||
class AdapterClosedError(SourceAdapterError):
|
||
"""Raised when an adapter method is called after ``close()``."""
|
||
|
||
|
||
class TransformationViolationError(SourceAdapterError):
|
||
"""Raised by the conformance suite when round-tripping a drawer requires
|
||
an undeclared transformation (RFC 002 §7.2–7.3)."""
|
||
|
||
|
||
class SchemaConformanceError(SourceAdapterError):
|
||
"""Raised when a ``DrawerRecord.metadata`` violates the adapter schema
|
||
returned by :meth:`BaseSourceAdapter.describe_schema`."""
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Value objects
|
||
# ---------------------------------------------------------------------------
|
||
|
||
|
||
@dataclass(frozen=True)
|
||
class SourceRef:
|
||
"""A handle to the source a user wants to ingest.
|
||
|
||
``local_path`` is for filesystem-rooted sources (project dir, mbox file).
|
||
``uri`` is for URL-like references (``github.com/org/repo``,
|
||
``slack://workspace/channel``).
|
||
``options`` carries adapter-specific non-secret config. Secrets MUST NOT
|
||
be placed here; see §4.2.
|
||
"""
|
||
|
||
local_path: Optional[str] = None
|
||
uri: Optional[str] = None
|
||
options: dict = field(default_factory=dict)
|
||
|
||
|
||
@dataclass(frozen=True)
|
||
class RouteHint:
|
||
"""Adapter-supplied routing hint (RFC 002 §2.5)."""
|
||
|
||
wing: Optional[str] = None
|
||
room: Optional[str] = None
|
||
hall: Optional[str] = None
|
||
|
||
|
||
@dataclass(frozen=True)
|
||
class SourceItemMetadata:
|
||
"""Lightweight pointer yielded before drawers for lazy-fetch adapters.
|
||
|
||
Core inspects ``version`` via :meth:`BaseSourceAdapter.is_current` to
|
||
decide whether to skip extraction; an adapter that responds positively
|
||
stops yielding drawers for this item and moves to the next.
|
||
"""
|
||
|
||
source_file: str
|
||
version: str
|
||
size_hint: Optional[int] = None
|
||
route_hint: Optional[RouteHint] = None
|
||
|
||
|
||
@dataclass(frozen=True)
|
||
class DrawerRecord:
|
||
"""One drawer's worth of extracted content plus flat metadata.
|
||
|
||
``metadata`` values MUST be flat scalars (``str``/``int``/``float``/``bool``)
|
||
per RFC 001 §1.4 — the chroma constraint. Nested data belongs on the
|
||
knowledge graph (§5.5) or in a declared ``json_string`` field (§5.4).
|
||
"""
|
||
|
||
content: str
|
||
source_file: str
|
||
chunk_index: int = 0
|
||
metadata: dict = field(default_factory=dict)
|
||
route_hint: Optional[RouteHint] = None
|
||
|
||
|
||
@dataclass(frozen=True)
|
||
class SourceSummary:
|
||
"""High-level description of a source returned by :meth:`source_summary`."""
|
||
|
||
description: str
|
||
item_count: Optional[int] = None
|
||
|
||
|
||
IngestMode = Literal["chunked_content", "whole_record", "metadata_only"]
|
||
|
||
|
||
@dataclass(frozen=True)
|
||
class FieldSpec:
|
||
"""Declared shape of a single per-adapter metadata field (§5.2)."""
|
||
|
||
type: Literal["string", "int", "float", "bool", "delimiter_joined_string", "json_string"]
|
||
required: bool
|
||
description: str
|
||
indexed: bool = False
|
||
delimiter: str = ";"
|
||
json_schema: Optional[dict] = None
|
||
|
||
|
||
@dataclass(frozen=True)
|
||
class AdapterSchema:
|
||
"""The per-adapter metadata schema returned by :meth:`describe_schema`."""
|
||
|
||
fields: dict[str, FieldSpec]
|
||
version: str
|
||
|
||
|
||
# The union type adapters yield from ``ingest``.
|
||
IngestResult = object # intentionally broad; runtime checks in core
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Adapter contract
|
||
# ---------------------------------------------------------------------------
|
||
|
||
|
||
class BaseSourceAdapter(ABC):
|
||
"""Long-lived adapter serving many ``SourceRef`` invocations (RFC 002 §2).
|
||
|
||
Instances are lightweight on construction — no I/O, no network, no
|
||
credential fetch. All work is deferred to :meth:`ingest`. Instances are
|
||
thread-safe for concurrent ``ingest`` calls across different ``SourceRef``
|
||
values (v1 serializes within a single ``SourceRef``).
|
||
|
||
Class attributes form the adapter's identity contract:
|
||
|
||
* ``name`` — stable adapter name used for registration and drawer metadata.
|
||
* ``adapter_version`` — adapter's own version, independent of
|
||
``spec_version``. Recorded on every drawer so re-extract workflows can
|
||
target drawers from a known-buggy adapter version.
|
||
* ``capabilities`` — free-form tokens; core inspects a documented subset.
|
||
* ``supported_modes`` — subset of ``chunked_content``, ``whole_record``,
|
||
``metadata_only``.
|
||
* ``declared_transformations`` — set of transformation names the adapter
|
||
applies to source bytes. The empty set marks a byte-preserving adapter.
|
||
* ``default_privacy_class`` — privacy class level (§6) applied unless the
|
||
palace config overrides it.
|
||
"""
|
||
|
||
name: ClassVar[str]
|
||
spec_version: ClassVar[str] = "1.0"
|
||
adapter_version: ClassVar[str] = "0.0.0"
|
||
capabilities: ClassVar[frozenset[str]] = frozenset()
|
||
supported_modes: ClassVar[frozenset[str]] = frozenset({"chunked_content"})
|
||
declared_transformations: ClassVar[frozenset[str]] = frozenset()
|
||
default_privacy_class: ClassVar[str] = "pii_potential"
|
||
|
||
# ------------------------------------------------------------------
|
||
# Required methods
|
||
# ------------------------------------------------------------------
|
||
|
||
@abstractmethod
|
||
def ingest(
|
||
self,
|
||
*,
|
||
source: SourceRef,
|
||
palace: "PalaceContext",
|
||
) -> Iterator[IngestResult]:
|
||
"""Enumerate and extract content from a source.
|
||
|
||
Yields a stream of ``SourceItemMetadata`` and ``DrawerRecord`` values.
|
||
Lazy adapters yield ``SourceItemMetadata`` ahead of the drawers for
|
||
that item so core can check :meth:`is_current` before committing to
|
||
the fetch. Eager adapters MAY interleave freely.
|
||
"""
|
||
|
||
@abstractmethod
|
||
def describe_schema(self) -> AdapterSchema:
|
||
"""Declare the structured metadata this adapter attaches.
|
||
|
||
The returned schema MUST be stable for a given ``adapter_version``.
|
||
Enterprises index on it; core uses it to validate adapter output.
|
||
"""
|
||
|
||
# ------------------------------------------------------------------
|
||
# Optional methods with default implementations
|
||
# ------------------------------------------------------------------
|
||
|
||
def is_current(
|
||
self,
|
||
*,
|
||
item: SourceItemMetadata,
|
||
existing_metadata: Optional[dict],
|
||
) -> bool:
|
||
"""Return True if the palace already has an up-to-date copy of ``item``.
|
||
|
||
Default: always returns False (re-extract every time). Adapters
|
||
advertising ``supports_incremental`` MUST override.
|
||
"""
|
||
return False
|
||
|
||
def source_summary(self, *, source: SourceRef) -> SourceSummary:
|
||
"""Describe a source without extracting."""
|
||
return SourceSummary(description=self.name)
|
||
|
||
def close(self) -> None:
|
||
"""Release any resources the adapter holds. Default: no-op."""
|
||
return None
|