Files
mempalace/mempalace/sources/base.py
T
Igor Lins e Silva 552e9927b7 refactor(sources): RFC 002 §9 scaffolding — BaseSourceAdapter, registry, PalaceContext
Lands the read-side contract so third-party adapter authors (@Perseusxrltd,
@JakobSachs, @adv3nt3, @zendesk-thittesdorf, @mfhens, @roip, @MrDys) have a
stable target matching what RFC 001 §10 landed on the write side in #995.

Scope (this PR):

- mempalace/sources/base.py: BaseSourceAdapter ABC with kwargs-only
  ingest() / describe_schema() and default is_current() / source_summary()
  / close() (§1.1–1.2). Typed records: SourceRef, SourceItemMetadata,
  DrawerRecord, RouteHint, SourceSummary, AdapterSchema, FieldSpec (§1.3,
  §5.2). Error classes: SourceNotFoundError, AuthRequiredError,
  AdapterClosedError, TransformationViolationError, SchemaConformanceError
  (§2.7). Class-level identity contract: name / adapter_version /
  capabilities / supported_modes / declared_transformations /
  default_privacy_class (§2.1, §1.4, §1.5, §6).

- mempalace/sources/transforms.py: reference implementations of the 13
  reserved transformations (§1.4) — utf8_replace_invalid, newline_normalize,
  whitespace_trim, whitespace_collapse_internal, line_trim, line_join_spaces,
  blank_line_drop — as pure functions, plus identity shims for the six
  adapter-specific ones (strip_tool_chrome, tool_result_truncate,
  tool_result_omitted, spellcheck_user, synthesized_marker,
  speaker_role_assignment) that the conversations adapter will override
  when migrated. get_transformation(name) resolves by reserved name.

- mempalace/sources/registry.py: entry-point discovery via
  importlib.metadata.entry_points(group="mempalace.sources") + explicit
  register()/unregister() surface (§3.1–3.2). resolve_adapter_for_source()
  implements the §3.3 priority order; crucially, no auto-detection on the
  read side (§3.3 is explicit about that — user intent never inferred from
  on-disk artifacts).

- mempalace/sources/context.py: PalaceContext facade (§9) bundling the
  drawer/closet collections, knowledge graph, palace path, adapter identity,
  and progress hooks core passes into adapter.ingest(). upsert_drawer()
  applies the spec-mandated adapter_name/adapter_version stamps from §5.1.
  skip_current_item() signals laziness; emit() dispatches to hooks and
  swallows hook exceptions.

- mempalace/knowledge_graph.py: add_triple() gains optional source_drawer_id
  and adapter_name kwargs (§5.5). Backwards-compatible column migration
  auto-adds the new columns on open of a pre-RFC 002 palace (PRAGMA
  table_info then ALTER TABLE ADD COLUMN), matching the pattern used for
  any new palace-side provenance fields.

- pyproject.toml: mempalace.sources entry-point group declared. Empty on
  the first-party side for now — miners migrate in a follow-up; the group
  being present means third-party packages can begin registering today.

Out of scope (explicit follow-ups):

- miner.py → mempalace/sources/filesystem.py. Behavior-preserving rename
  that also moves READABLE_EXTENSIONS, detect_room(), detect_hall() into
  the adapter (§9). Larger refactor; lands separately.
- convo_miner.py + normalize.py → mempalace/sources/conversations.py. The
  format-detection if-chain in normalize.py becomes per-format plugins;
  declared_transformations enumerates what the current pipeline already
  does to source bytes (§1.4 existing-code mapping).
- Closet post-step wired into the conversations adapter (§1.7).
- CLI --source flag + --mode deprecation alias (§3.3).
- MCP mempalace_mine tool source parameter.
- AbstractSourceAdapterContractSuite (§7.1–7.3): byte-preservation round-
  trip and declared-transformation round-trip tests.
- Privacy-class floor enforcement (§6.2); depends on #389 for
  secrets_possible scanning.

Tests: 1018 passed (up from ~990 on develop), +27 targeted tests covering
the ABC instantiation rules, typed records, all reserved transformations,
the registry register/get/unregister surface, PalaceContext upsert + skip +
emit semantics, and both the new KG provenance kwargs and backwards-
compatible legacy-schema migration.

Refs: #989 (RFC 002 tracking), #990 (RFC 002 spec), #995 (RFC 001 §10
cleanup — sibling PR on the write side).
2026-04-18 16:05:32 -03:00

246 lines
8.3 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""Source adapter contract for MemPalace (RFC 002).
Mirrors what ``mempalace/backends/base.py`` does for the write side: it defines
the read-side surface every source adapter must implement. A source adapter
extracts content from a specific origin (filesystem, git, Slack, Cursor …) and
yields typed records (``SourceItemMetadata`` / ``DrawerRecord``) that core
routes into the palace.
This module is spec scaffolding. The first-party miners (``mempalace/miner.py``
and ``mempalace/convo_miner.py``) are migrated onto it in a follow-up PR;
in this PR we publish the contract so third-party adapters can begin building
against a stable surface.
See ``docs/rfcs/002-source-adapter-plugin-spec.md`` for the authoritative
spec text.
"""
from __future__ import annotations
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import TYPE_CHECKING, ClassVar, Iterator, Literal, Optional
if TYPE_CHECKING:
from .context import PalaceContext # noqa: F401 (used in string annotation)
# ---------------------------------------------------------------------------
# Errors
# ---------------------------------------------------------------------------
class SourceAdapterError(Exception):
"""Base class for every source-adapter error raised by core."""
class SourceNotFoundError(SourceAdapterError):
"""Raised when a ``SourceRef`` does not resolve to a readable source."""
class AuthRequiredError(SourceAdapterError):
"""Raised when an adapter needs credentials that were not provided.
The message MUST name the env vars (or other supported mechanism) the
operator needs to set.
"""
class AdapterClosedError(SourceAdapterError):
"""Raised when an adapter method is called after ``close()``."""
class TransformationViolationError(SourceAdapterError):
"""Raised by the conformance suite when round-tripping a drawer requires
an undeclared transformation (RFC 002 §7.27.3)."""
class SchemaConformanceError(SourceAdapterError):
"""Raised when a ``DrawerRecord.metadata`` violates the adapter schema
returned by :meth:`BaseSourceAdapter.describe_schema`."""
# ---------------------------------------------------------------------------
# Value objects
# ---------------------------------------------------------------------------
@dataclass(frozen=True)
class SourceRef:
"""A handle to the source a user wants to ingest.
``local_path`` is for filesystem-rooted sources (project dir, mbox file).
``uri`` is for URL-like references (``github.com/org/repo``,
``slack://workspace/channel``).
``options`` carries adapter-specific non-secret config. Secrets MUST NOT
be placed here; see §4.2.
"""
local_path: Optional[str] = None
uri: Optional[str] = None
options: dict = field(default_factory=dict)
@dataclass(frozen=True)
class RouteHint:
"""Adapter-supplied routing hint (RFC 002 §2.5)."""
wing: Optional[str] = None
room: Optional[str] = None
hall: Optional[str] = None
@dataclass(frozen=True)
class SourceItemMetadata:
"""Lightweight pointer yielded before drawers for lazy-fetch adapters.
Core inspects ``version`` via :meth:`BaseSourceAdapter.is_current` to
decide whether to skip extraction; an adapter that responds positively
stops yielding drawers for this item and moves to the next.
"""
source_file: str
version: str
size_hint: Optional[int] = None
route_hint: Optional[RouteHint] = None
@dataclass(frozen=True)
class DrawerRecord:
"""One drawer's worth of extracted content plus flat metadata.
``metadata`` values MUST be flat scalars (``str``/``int``/``float``/``bool``)
per RFC 001 §1.4 — the chroma constraint. Nested data belongs on the
knowledge graph (§5.5) or in a declared ``json_string`` field (§5.4).
"""
content: str
source_file: str
chunk_index: int = 0
metadata: dict = field(default_factory=dict)
route_hint: Optional[RouteHint] = None
@dataclass(frozen=True)
class SourceSummary:
"""High-level description of a source returned by :meth:`source_summary`."""
description: str
item_count: Optional[int] = None
IngestMode = Literal["chunked_content", "whole_record", "metadata_only"]
@dataclass(frozen=True)
class FieldSpec:
"""Declared shape of a single per-adapter metadata field (§5.2)."""
type: Literal["string", "int", "float", "bool", "delimiter_joined_string", "json_string"]
required: bool
description: str
indexed: bool = False
delimiter: str = ";"
json_schema: Optional[dict] = None
@dataclass(frozen=True)
class AdapterSchema:
"""The per-adapter metadata schema returned by :meth:`describe_schema`."""
fields: dict[str, FieldSpec]
version: str
# The union type adapters yield from ``ingest``.
IngestResult = object # intentionally broad; runtime checks in core
# ---------------------------------------------------------------------------
# Adapter contract
# ---------------------------------------------------------------------------
class BaseSourceAdapter(ABC):
"""Long-lived adapter serving many ``SourceRef`` invocations (RFC 002 §2).
Instances are lightweight on construction — no I/O, no network, no
credential fetch. All work is deferred to :meth:`ingest`. Instances are
thread-safe for concurrent ``ingest`` calls across different ``SourceRef``
values (v1 serializes within a single ``SourceRef``).
Class attributes form the adapter's identity contract:
* ``name`` — stable adapter name used for registration and drawer metadata.
* ``adapter_version`` — adapter's own version, independent of
``spec_version``. Recorded on every drawer so re-extract workflows can
target drawers from a known-buggy adapter version.
* ``capabilities`` — free-form tokens; core inspects a documented subset.
* ``supported_modes`` — subset of ``chunked_content``, ``whole_record``,
``metadata_only``.
* ``declared_transformations`` — set of transformation names the adapter
applies to source bytes. The empty set marks a byte-preserving adapter.
* ``default_privacy_class`` — privacy class level (§6) applied unless the
palace config overrides it.
"""
name: ClassVar[str]
spec_version: ClassVar[str] = "1.0"
adapter_version: ClassVar[str] = "0.0.0"
capabilities: ClassVar[frozenset[str]] = frozenset()
supported_modes: ClassVar[frozenset[str]] = frozenset({"chunked_content"})
declared_transformations: ClassVar[frozenset[str]] = frozenset()
default_privacy_class: ClassVar[str] = "pii_potential"
# ------------------------------------------------------------------
# Required methods
# ------------------------------------------------------------------
@abstractmethod
def ingest(
self,
*,
source: SourceRef,
palace: "PalaceContext",
) -> Iterator[IngestResult]:
"""Enumerate and extract content from a source.
Yields a stream of ``SourceItemMetadata`` and ``DrawerRecord`` values.
Lazy adapters yield ``SourceItemMetadata`` ahead of the drawers for
that item so core can check :meth:`is_current` before committing to
the fetch. Eager adapters MAY interleave freely.
"""
@abstractmethod
def describe_schema(self) -> AdapterSchema:
"""Declare the structured metadata this adapter attaches.
The returned schema MUST be stable for a given ``adapter_version``.
Enterprises index on it; core uses it to validate adapter output.
"""
# ------------------------------------------------------------------
# Optional methods with default implementations
# ------------------------------------------------------------------
def is_current(
self,
*,
item: SourceItemMetadata,
existing_metadata: Optional[dict],
) -> bool:
"""Return True if the palace already has an up-to-date copy of ``item``.
Default: always returns False (re-extract every time). Adapters
advertising ``supports_incremental`` MUST override.
"""
return False
def source_summary(self, *, source: SourceRef) -> SourceSummary:
"""Describe a source without extracting."""
return SourceSummary(description=self.name)
def close(self) -> None:
"""Release any resources the adapter holds. Default: no-op."""
return None