Files
mempalace/mempalace/sources/transforms.py
T
Igor Lins e Silva 552e9927b7 refactor(sources): RFC 002 §9 scaffolding — BaseSourceAdapter, registry, PalaceContext
Lands the read-side contract so third-party adapter authors (@Perseusxrltd,
@JakobSachs, @adv3nt3, @zendesk-thittesdorf, @mfhens, @roip, @MrDys) have a
stable target matching what RFC 001 §10 landed on the write side in #995.

Scope (this PR):

- mempalace/sources/base.py: BaseSourceAdapter ABC with kwargs-only
  ingest() / describe_schema() and default is_current() / source_summary()
  / close() (§1.1–1.2). Typed records: SourceRef, SourceItemMetadata,
  DrawerRecord, RouteHint, SourceSummary, AdapterSchema, FieldSpec (§1.3,
  §5.2). Error classes: SourceNotFoundError, AuthRequiredError,
  AdapterClosedError, TransformationViolationError, SchemaConformanceError
  (§2.7). Class-level identity contract: name / adapter_version /
  capabilities / supported_modes / declared_transformations /
  default_privacy_class (§2.1, §1.4, §1.5, §6).

- mempalace/sources/transforms.py: reference implementations of the 13
  reserved transformations (§1.4) — utf8_replace_invalid, newline_normalize,
  whitespace_trim, whitespace_collapse_internal, line_trim, line_join_spaces,
  blank_line_drop — as pure functions, plus identity shims for the six
  adapter-specific ones (strip_tool_chrome, tool_result_truncate,
  tool_result_omitted, spellcheck_user, synthesized_marker,
  speaker_role_assignment) that the conversations adapter will override
  when migrated. get_transformation(name) resolves by reserved name.

- mempalace/sources/registry.py: entry-point discovery via
  importlib.metadata.entry_points(group="mempalace.sources") + explicit
  register()/unregister() surface (§3.1–3.2). resolve_adapter_for_source()
  implements the §3.3 priority order; crucially, no auto-detection on the
  read side (§3.3 is explicit about that — user intent never inferred from
  on-disk artifacts).

- mempalace/sources/context.py: PalaceContext facade (§9) bundling the
  drawer/closet collections, knowledge graph, palace path, adapter identity,
  and progress hooks core passes into adapter.ingest(). upsert_drawer()
  applies the spec-mandated adapter_name/adapter_version stamps from §5.1.
  skip_current_item() signals laziness; emit() dispatches to hooks and
  swallows hook exceptions.

- mempalace/knowledge_graph.py: add_triple() gains optional source_drawer_id
  and adapter_name kwargs (§5.5). Backwards-compatible column migration
  auto-adds the new columns on open of a pre-RFC 002 palace (PRAGMA
  table_info then ALTER TABLE ADD COLUMN), matching the pattern used for
  any new palace-side provenance fields.

- pyproject.toml: mempalace.sources entry-point group declared. Empty on
  the first-party side for now — miners migrate in a follow-up; the group
  being present means third-party packages can begin registering today.

Out of scope (explicit follow-ups):

- miner.py → mempalace/sources/filesystem.py. Behavior-preserving rename
  that also moves READABLE_EXTENSIONS, detect_room(), detect_hall() into
  the adapter (§9). Larger refactor; lands separately.
- convo_miner.py + normalize.py → mempalace/sources/conversations.py. The
  format-detection if-chain in normalize.py becomes per-format plugins;
  declared_transformations enumerates what the current pipeline already
  does to source bytes (§1.4 existing-code mapping).
- Closet post-step wired into the conversations adapter (§1.7).
- CLI --source flag + --mode deprecation alias (§3.3).
- MCP mempalace_mine tool source parameter.
- AbstractSourceAdapterContractSuite (§7.1–7.3): byte-preservation round-
  trip and declared-transformation round-trip tests.
- Privacy-class floor enforcement (§6.2); depends on #389 for
  secrets_possible scanning.

Tests: 1018 passed (up from ~990 on develop), +27 targeted tests covering
the ABC instantiation rules, typed records, all reserved transformations,
the registry register/get/unregister surface, PalaceContext upsert + skip +
emit semantics, and both the new KG provenance kwargs and backwards-
compatible legacy-schema migration.

Refs: #989 (RFC 002 tracking), #990 (RFC 002 spec), #995 (RFC 001 §10
cleanup — sibling PR on the write side).
2026-04-18 16:05:32 -03:00

180 lines
6.9 KiB
Python

"""Reference implementations of the reserved content transformations (RFC 002 §1.4).
Every source adapter declares the set of transformations it applies to source
bytes via ``declared_transformations``. The conformance suite then verifies
that the adapter's output can be reproduced from the source bytes by applying
*only* the declared transformations in declaration order, using these
reference implementations.
Each transformation is a pure function on strings (text content after UTF-8
decoding). ``utf8_replace_invalid`` is the one that operates on bytes.
The invariant the spec enforces: **no transformation is applied that is not
declared in the adapter's set**. Adapters with an empty set are byte-preserving
end-to-end (modulo the initial UTF-8 decode itself, which is captured by
``utf8_replace_invalid`` when applicable).
Adapters MAY add custom transformations beyond the reserved set; third-party
names SHOULD be prefixed with the adapter name (``cursor.composer_ordering``).
Custom transformations MUST expose a reference implementation under
``mempalace.sources.transforms.<adapter_name>_<transform_name>`` so the
conformance suite can locate and apply them.
"""
from __future__ import annotations
import re
from typing import Callable
# ---------------------------------------------------------------------------
# Reserved transformations
# ---------------------------------------------------------------------------
def utf8_replace_invalid(raw: bytes) -> str:
"""Decode bytes as UTF-8; replace invalid sequences with U+FFFD.
Equivalent to ``raw.decode("utf-8", errors="replace")``. This is the one
reserved transformation that operates on bytes rather than decoded text.
"""
return raw.decode("utf-8", errors="replace")
def newline_normalize(text: str) -> str:
"""Convert CRLF and bare-CR line endings to LF."""
return text.replace("\r\n", "\n").replace("\r", "\n")
def whitespace_trim(text: str) -> str:
"""Strip leading and trailing whitespace at the record boundary only."""
return text.strip()
_RUN_OF_THREE_OR_MORE_BLANK = re.compile(r"(?:\n[ \t]*){3,}\n")
def whitespace_collapse_internal(text: str) -> str:
"""Collapse runs of three or more blank lines to exactly two blank lines.
A "blank line" here is a line containing only spaces or tabs. Single and
double blank-line runs are preserved.
"""
# Normalise inputs before collapsing: turn internal blank lines with
# whitespace content into pure \n so the regex matches consistently.
lines = text.split("\n")
normalised = "\n".join(line if line.strip() else "" for line in lines)
return _RUN_OF_THREE_OR_MORE_BLANK.sub("\n\n\n", normalised)
def line_trim(text: str) -> str:
"""Strip leading and trailing whitespace from each individual line."""
return "\n".join(line.strip() for line in text.split("\n"))
def line_join_spaces(text: str) -> str:
"""Join adjacent non-blank lines with a single space, preserving paragraph breaks.
Two lines separated by at least one blank line remain on separate lines;
runs of non-blank lines collapse into a single space-separated line.
"""
paragraphs = re.split(r"\n[ \t]*\n", text)
joined = [" ".join(line.strip() for line in p.split("\n") if line.strip()) for p in paragraphs]
return "\n\n".join(joined)
def blank_line_drop(text: str) -> str:
"""Drop blank lines between non-blank lines, keeping non-blank lines only."""
return "\n".join(line for line in text.split("\n") if line.strip())
# The following reserved transformations are declared in the spec but are
# deeply adapter-specific. Rather than guess a single reference implementation
# now, we provide identity shims that raise if invoked without adapter-supplied
# context. Adapters that declare these MUST either override with a concrete
# implementation or provide a namespaced reference under
# ``mempalace.sources.transforms.<adapter_name>_<transform_name>`` (per the
# module docstring). The conformance suite looks up the adapter-specific
# implementation first, falling back to these only when none exists.
def strip_tool_chrome(text: str) -> str:
"""Adapter-supplied: remove system tags, hook output, tool UI chrome.
The reference implementation here is intentionally an identity function
because the noise patterns differ per transcript format (Claude Code,
Codex, ChatGPT, Slack). The conversations adapter, when migrated, will
register a concrete reference implementation under
``mempalace.sources.transforms.conversations_strip_tool_chrome``.
"""
return text
def tool_result_truncate(text: str) -> str:
"""Adapter-supplied: head/tail window on tool output with a middle marker."""
return text
def tool_result_omitted(text: str) -> str:
"""Adapter-supplied: fully omit some tool outputs (e.g., Read/Edit/Write)."""
return text
def spellcheck_user(text: str) -> str:
"""Adapter-supplied: rewrite user turns via autocorrect.
Requires the optional ``spellcheck`` extra and a tokenizer; the spec does
not mandate a specific language model, so the reference is adapter-owned.
"""
return text
def synthesized_marker(text: str) -> str:
"""Adapter-supplied: adapter inserts its own strings (e.g., '[N lines omitted]')."""
return text
def speaker_role_assignment(text: str) -> str:
"""Adapter-supplied: multi-party speakers alternately assigned user/assistant."""
return text
# ---------------------------------------------------------------------------
# Registry
# ---------------------------------------------------------------------------
# Reserved transformation name → reference implementation.
# Adapters look up by name to compose a round-trip pipeline during testing.
RESERVED_TRANSFORMATIONS: dict[str, Callable[..., str]] = {
"utf8_replace_invalid": utf8_replace_invalid,
"newline_normalize": newline_normalize,
"whitespace_trim": whitespace_trim,
"whitespace_collapse_internal": whitespace_collapse_internal,
"line_trim": line_trim,
"line_join_spaces": line_join_spaces,
"blank_line_drop": blank_line_drop,
"strip_tool_chrome": strip_tool_chrome,
"tool_result_truncate": tool_result_truncate,
"tool_result_omitted": tool_result_omitted,
"spellcheck_user": spellcheck_user,
"synthesized_marker": synthesized_marker,
"speaker_role_assignment": speaker_role_assignment,
}
def get_transformation(name: str) -> Callable[..., str]:
"""Resolve a reserved transformation by name.
Raises :class:`KeyError` if the name is neither reserved nor registered as
an adapter-namespaced reference (``<adapter>_<transform>``). Callers
looking for adapter-specific references SHOULD ``getattr`` on this module
first; this helper only covers the reserved names.
"""
try:
return RESERVED_TRANSFORMATIONS[name]
except KeyError as e:
raise KeyError(
f"unknown transformation {name!r}; reserved names: {sorted(RESERVED_TRANSFORMATIONS)}"
) from e