552e9927b7
Lands the read-side contract so third-party adapter authors (@Perseusxrltd, @JakobSachs, @adv3nt3, @zendesk-thittesdorf, @mfhens, @roip, @MrDys) have a stable target matching what RFC 001 §10 landed on the write side in #995. Scope (this PR): - mempalace/sources/base.py: BaseSourceAdapter ABC with kwargs-only ingest() / describe_schema() and default is_current() / source_summary() / close() (§1.1–1.2). Typed records: SourceRef, SourceItemMetadata, DrawerRecord, RouteHint, SourceSummary, AdapterSchema, FieldSpec (§1.3, §5.2). Error classes: SourceNotFoundError, AuthRequiredError, AdapterClosedError, TransformationViolationError, SchemaConformanceError (§2.7). Class-level identity contract: name / adapter_version / capabilities / supported_modes / declared_transformations / default_privacy_class (§2.1, §1.4, §1.5, §6). - mempalace/sources/transforms.py: reference implementations of the 13 reserved transformations (§1.4) — utf8_replace_invalid, newline_normalize, whitespace_trim, whitespace_collapse_internal, line_trim, line_join_spaces, blank_line_drop — as pure functions, plus identity shims for the six adapter-specific ones (strip_tool_chrome, tool_result_truncate, tool_result_omitted, spellcheck_user, synthesized_marker, speaker_role_assignment) that the conversations adapter will override when migrated. get_transformation(name) resolves by reserved name. - mempalace/sources/registry.py: entry-point discovery via importlib.metadata.entry_points(group="mempalace.sources") + explicit register()/unregister() surface (§3.1–3.2). resolve_adapter_for_source() implements the §3.3 priority order; crucially, no auto-detection on the read side (§3.3 is explicit about that — user intent never inferred from on-disk artifacts). - mempalace/sources/context.py: PalaceContext facade (§9) bundling the drawer/closet collections, knowledge graph, palace path, adapter identity, and progress hooks core passes into adapter.ingest(). upsert_drawer() applies the spec-mandated adapter_name/adapter_version stamps from §5.1. skip_current_item() signals laziness; emit() dispatches to hooks and swallows hook exceptions. - mempalace/knowledge_graph.py: add_triple() gains optional source_drawer_id and adapter_name kwargs (§5.5). Backwards-compatible column migration auto-adds the new columns on open of a pre-RFC 002 palace (PRAGMA table_info then ALTER TABLE ADD COLUMN), matching the pattern used for any new palace-side provenance fields. - pyproject.toml: mempalace.sources entry-point group declared. Empty on the first-party side for now — miners migrate in a follow-up; the group being present means third-party packages can begin registering today. Out of scope (explicit follow-ups): - miner.py → mempalace/sources/filesystem.py. Behavior-preserving rename that also moves READABLE_EXTENSIONS, detect_room(), detect_hall() into the adapter (§9). Larger refactor; lands separately. - convo_miner.py + normalize.py → mempalace/sources/conversations.py. The format-detection if-chain in normalize.py becomes per-format plugins; declared_transformations enumerates what the current pipeline already does to source bytes (§1.4 existing-code mapping). - Closet post-step wired into the conversations adapter (§1.7). - CLI --source flag + --mode deprecation alias (§3.3). - MCP mempalace_mine tool source parameter. - AbstractSourceAdapterContractSuite (§7.1–7.3): byte-preservation round- trip and declared-transformation round-trip tests. - Privacy-class floor enforcement (§6.2); depends on #389 for secrets_possible scanning. Tests: 1018 passed (up from ~990 on develop), +27 targeted tests covering the ABC instantiation rules, typed records, all reserved transformations, the registry register/get/unregister surface, PalaceContext upsert + skip + emit semantics, and both the new KG provenance kwargs and backwards- compatible legacy-schema migration. Refs: #989 (RFC 002 tracking), #990 (RFC 002 spec), #995 (RFC 001 §10 cleanup — sibling PR on the write side).
180 lines
6.9 KiB
Python
180 lines
6.9 KiB
Python
"""Reference implementations of the reserved content transformations (RFC 002 §1.4).
|
|
|
|
Every source adapter declares the set of transformations it applies to source
|
|
bytes via ``declared_transformations``. The conformance suite then verifies
|
|
that the adapter's output can be reproduced from the source bytes by applying
|
|
*only* the declared transformations in declaration order, using these
|
|
reference implementations.
|
|
|
|
Each transformation is a pure function on strings (text content after UTF-8
|
|
decoding). ``utf8_replace_invalid`` is the one that operates on bytes.
|
|
|
|
The invariant the spec enforces: **no transformation is applied that is not
|
|
declared in the adapter's set**. Adapters with an empty set are byte-preserving
|
|
end-to-end (modulo the initial UTF-8 decode itself, which is captured by
|
|
``utf8_replace_invalid`` when applicable).
|
|
|
|
Adapters MAY add custom transformations beyond the reserved set; third-party
|
|
names SHOULD be prefixed with the adapter name (``cursor.composer_ordering``).
|
|
Custom transformations MUST expose a reference implementation under
|
|
``mempalace.sources.transforms.<adapter_name>_<transform_name>`` so the
|
|
conformance suite can locate and apply them.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import re
|
|
from typing import Callable
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Reserved transformations
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
def utf8_replace_invalid(raw: bytes) -> str:
|
|
"""Decode bytes as UTF-8; replace invalid sequences with U+FFFD.
|
|
|
|
Equivalent to ``raw.decode("utf-8", errors="replace")``. This is the one
|
|
reserved transformation that operates on bytes rather than decoded text.
|
|
"""
|
|
return raw.decode("utf-8", errors="replace")
|
|
|
|
|
|
def newline_normalize(text: str) -> str:
|
|
"""Convert CRLF and bare-CR line endings to LF."""
|
|
return text.replace("\r\n", "\n").replace("\r", "\n")
|
|
|
|
|
|
def whitespace_trim(text: str) -> str:
|
|
"""Strip leading and trailing whitespace at the record boundary only."""
|
|
return text.strip()
|
|
|
|
|
|
_RUN_OF_THREE_OR_MORE_BLANK = re.compile(r"(?:\n[ \t]*){3,}\n")
|
|
|
|
|
|
def whitespace_collapse_internal(text: str) -> str:
|
|
"""Collapse runs of three or more blank lines to exactly two blank lines.
|
|
|
|
A "blank line" here is a line containing only spaces or tabs. Single and
|
|
double blank-line runs are preserved.
|
|
"""
|
|
# Normalise inputs before collapsing: turn internal blank lines with
|
|
# whitespace content into pure \n so the regex matches consistently.
|
|
lines = text.split("\n")
|
|
normalised = "\n".join(line if line.strip() else "" for line in lines)
|
|
return _RUN_OF_THREE_OR_MORE_BLANK.sub("\n\n\n", normalised)
|
|
|
|
|
|
def line_trim(text: str) -> str:
|
|
"""Strip leading and trailing whitespace from each individual line."""
|
|
return "\n".join(line.strip() for line in text.split("\n"))
|
|
|
|
|
|
def line_join_spaces(text: str) -> str:
|
|
"""Join adjacent non-blank lines with a single space, preserving paragraph breaks.
|
|
|
|
Two lines separated by at least one blank line remain on separate lines;
|
|
runs of non-blank lines collapse into a single space-separated line.
|
|
"""
|
|
paragraphs = re.split(r"\n[ \t]*\n", text)
|
|
joined = [" ".join(line.strip() for line in p.split("\n") if line.strip()) for p in paragraphs]
|
|
return "\n\n".join(joined)
|
|
|
|
|
|
def blank_line_drop(text: str) -> str:
|
|
"""Drop blank lines between non-blank lines, keeping non-blank lines only."""
|
|
return "\n".join(line for line in text.split("\n") if line.strip())
|
|
|
|
|
|
# The following reserved transformations are declared in the spec but are
|
|
# deeply adapter-specific. Rather than guess a single reference implementation
|
|
# now, we provide identity shims that raise if invoked without adapter-supplied
|
|
# context. Adapters that declare these MUST either override with a concrete
|
|
# implementation or provide a namespaced reference under
|
|
# ``mempalace.sources.transforms.<adapter_name>_<transform_name>`` (per the
|
|
# module docstring). The conformance suite looks up the adapter-specific
|
|
# implementation first, falling back to these only when none exists.
|
|
|
|
|
|
def strip_tool_chrome(text: str) -> str:
|
|
"""Adapter-supplied: remove system tags, hook output, tool UI chrome.
|
|
|
|
The reference implementation here is intentionally an identity function
|
|
because the noise patterns differ per transcript format (Claude Code,
|
|
Codex, ChatGPT, Slack). The conversations adapter, when migrated, will
|
|
register a concrete reference implementation under
|
|
``mempalace.sources.transforms.conversations_strip_tool_chrome``.
|
|
"""
|
|
return text
|
|
|
|
|
|
def tool_result_truncate(text: str) -> str:
|
|
"""Adapter-supplied: head/tail window on tool output with a middle marker."""
|
|
return text
|
|
|
|
|
|
def tool_result_omitted(text: str) -> str:
|
|
"""Adapter-supplied: fully omit some tool outputs (e.g., Read/Edit/Write)."""
|
|
return text
|
|
|
|
|
|
def spellcheck_user(text: str) -> str:
|
|
"""Adapter-supplied: rewrite user turns via autocorrect.
|
|
|
|
Requires the optional ``spellcheck`` extra and a tokenizer; the spec does
|
|
not mandate a specific language model, so the reference is adapter-owned.
|
|
"""
|
|
return text
|
|
|
|
|
|
def synthesized_marker(text: str) -> str:
|
|
"""Adapter-supplied: adapter inserts its own strings (e.g., '[N lines omitted]')."""
|
|
return text
|
|
|
|
|
|
def speaker_role_assignment(text: str) -> str:
|
|
"""Adapter-supplied: multi-party speakers alternately assigned user/assistant."""
|
|
return text
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Registry
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
# Reserved transformation name → reference implementation.
|
|
# Adapters look up by name to compose a round-trip pipeline during testing.
|
|
RESERVED_TRANSFORMATIONS: dict[str, Callable[..., str]] = {
|
|
"utf8_replace_invalid": utf8_replace_invalid,
|
|
"newline_normalize": newline_normalize,
|
|
"whitespace_trim": whitespace_trim,
|
|
"whitespace_collapse_internal": whitespace_collapse_internal,
|
|
"line_trim": line_trim,
|
|
"line_join_spaces": line_join_spaces,
|
|
"blank_line_drop": blank_line_drop,
|
|
"strip_tool_chrome": strip_tool_chrome,
|
|
"tool_result_truncate": tool_result_truncate,
|
|
"tool_result_omitted": tool_result_omitted,
|
|
"spellcheck_user": spellcheck_user,
|
|
"synthesized_marker": synthesized_marker,
|
|
"speaker_role_assignment": speaker_role_assignment,
|
|
}
|
|
|
|
|
|
def get_transformation(name: str) -> Callable[..., str]:
|
|
"""Resolve a reserved transformation by name.
|
|
|
|
Raises :class:`KeyError` if the name is neither reserved nor registered as
|
|
an adapter-namespaced reference (``<adapter>_<transform>``). Callers
|
|
looking for adapter-specific references SHOULD ``getattr`` on this module
|
|
first; this helper only covers the reserved names.
|
|
"""
|
|
try:
|
|
return RESERVED_TRANSFORMATIONS[name]
|
|
except KeyError as e:
|
|
raise KeyError(
|
|
f"unknown transformation {name!r}; reserved names: {sorted(RESERVED_TRANSFORMATIONS)}"
|
|
) from e
|