// Package sanitize provides input-cleaning primitives and safe-emit // builders for each markdown lexical slot. Realm authors wrap user- // supplied strings with these helpers before flowing them into rendered // markdown output. Each helper targets one specific slot (link text, // heading text, URL href, table cell, HTML attribute, fenced code block, // blockquote, footnote definition, link-reference definition, etc.) and // neutralizes the bytes that would otherwise let user content break out // of that slot or inject new top-level structure. // // Pick the right helper from the table under "Picking the right helper" // below, then wrap each user-supplied argument exactly once at the call // site (see "The audit rule"). // // # Wrap once // // Most escapers and safe-emit builders in this package are NOT // idempotent — applying them twice re-escapes bytes the first pass // added (`\*` becomes `\\\*`, `&` becomes `&`, a fenced // block gets re-fenced). Wrap each user-derived string with at most // one sanitize.* call. Block and BlockRich are exceptions — // idempotent by design — but the at-most-once rule is still the // safest default. See the "Idempotence classes" enumeration below // for the full breakdown. // // Some markdown-builder packages (e.g. p/moul/md) sanitize the args of // specific helpers internally — see each builder's package doc for the // per-helper contract. If the builder sanitizes for you, pass the raw // user input; if it doesn't, wrap the input with the right sanitize.* // helper at the call site. // // # Picking the right helper // // Match the helper to the slot the user content lands in: // // slot helper // ------------------------------------------------------------- // [text](url) InlineText (text) // # Heading text InlineText // **bold** _italic_ InlineText // ![alt](src) InlineText (alt) // > [!NOTE] one-line title InlineText // multi-paragraph post body Block // multi-paragraph post body w/ rich block BlockRich // structure (headings, lists, tables, etc.) // multi-line blockquote (`> ` prefixed) Blockquote // multi-line blockquote w/ rich block body BlockquoteRich // [text](url "title") LinkTitle (title) // | cell | TableCell // HTMLEscape //

X

HTMLEscape // any URL going into ](X) URL // any image src going into (X) ImageURL // `inline code` inside running prose InlineCode // multi-line fenced code block CodeBlock // multi-line fenced code with language tag LanguageCodeBlock // [^name]: footnote body FootnoteDefinition // [label]: url "title" reference def LinkReferenceDefinition // r/sys/users handle UserName (validator) // g1.../gpub1... etc. BechString (validator) // footnote / LRD label / {#id} anchor name FootnoteLabel (validator) // fenced-code language tag LanguageName (validator) // prefix arg to md.Nested NestedPrefix (validator) // // # Invariants // // All helpers in this package are panic-free for any string input and // run in O(len(input)) time with bounded allocation. // // Idempotence classes: // // Idempotent (calling twice == calling once): // StripBidiAndZeroWidth, NormalizeBreaks // UserName, BechString, FootnoteLabel, LanguageName, NestedPrefix // URL, ImageURL (accept→identity; reject→"") // Block (bracket walker treats \[/\] as ordinary; // line-leader escapes don't re-fire on // already-escaped `\#` etc.) // BlockRich (TrimLeft/TrimRight + "\n\n" wrap is stable) // // NOT idempotent — never wrap an already-sanitized string: // InlineText, LinkTitle, TableCell (re-escape backslashes) // HTMLEscape (re-escapes `&` → `&`) // Blockquote, BlockquoteRich (re-prefixes `> `, nesting the quote each pass) // InlineCode, CodeBlock, // LanguageCodeBlock (wrap with a fence — calling twice double-wraps) // FootnoteDefinition, // LinkReferenceDefinition (compose Block/InlineText/URL internally — // passing already-sanitized strings double-escapes) // // CodeFence is pure: same inputs always give the same output. // // Validators (UserName / BechString / FootnoteLabel / LanguageName / // NestedPrefix) return either the cleaned input verbatim or "". They // never partially-sanitize: if the input doesn't match the slot's // charset/shape, the answer is rejection. // // # Composition rules // // Direct sanitize use (when emitting markdown without a builder package): // // out := "# " + sanitize.InlineText(userTitle) + "\n\n" + // sanitize.Block(userBody) // out += sanitize.Blockquote(userQuote) // out += sanitize.LanguageCodeBlock(realmLang, userCode) // // Use with a builder package (e.g. p/moul/md): pass raw user input to // the builder helpers that sanitize internally — do NOT pre-wrap with // sanitize.*, or the input gets double-escaped (escapers are not // idempotent). See the builder's package doc for the per-helper // contract. For example, with p/moul/md: // // md.Blockquote(userProse) // good — md.Blockquote sanitizes // md.LanguageCodeBlock(realmLang, userCode) // good — sanitizes both args // md.Link(userText, userURL) // good — sanitizes both slots // // md.Blockquote(sanitize.Block(userProse)) // BAD: double-wrap // md.Link(sanitize.InlineText(t), sanitize.URL(u)) // BAD: double-wrap // // Wrong (across all callers): // // sanitize.InlineText(sanitize.InlineText(s)) double-wrap (re-escape) // sanitize.TableCell(sanitize.InlineText(s)) TableCell already calls InlineText // sanitize.URL(sanitize.InlineText(href)) inline-escape backslash-escapes `.` `-` `_` // inside the URL, corrupting the host/path // sanitize.Blockquote(sanitize.Blockquote(s)) double-wrap — outer would escape the // inner `> ` prefixes // sanitize.Block(sanitize.BlockRich(s)) double-sanitize — strict Block re-escapes // the markers BlockRich preserved (headings, // lists, tables); BlockRich's rich structure // renders as literal text after Block escapes // its line-leaders // sanitize.BlockRich(sanitize.Block(s)) pointless double-sanitize — Block already // escaped every line-leader to `\#`/`\>`/etc.; // BlockRich preserves the backslash escapes // as visible artifacts in user prose // sanitize.Blockquote(sanitize.BlockRich(s)) double-sanitize — Blockquote's Block step // re-escapes the markers BlockRich preserved // sanitize.BlockRich(sanitize.Blockquote(s)) nonsense — Blockquote already line-prefixed // with `> `; BlockRich expects raw user content // sanitize.BlockquoteRich(sanitize.BlockRich(s)) double-wrap — Rich + Rich nests twice // sanitize.BlockRich(sanitize.TableCell(s)) wrong slot — use TableCell for cell content, // BlockRich for multi-paragraph block content // sanitize.TableCell(multiParagraphProse) newlines fold to space silently; use a // non-table layout for multi-paragraph text // // # Threat model // // Sanitizers in this package defend against: // // - bidi/zero-width injection: invisible characters that make // displayed text disagree with stored bytes (e.g. an address `g1abc...` // that renders as `g1xyz...`, or a username that visually collides // with another). Stripped by StripBidiAndZeroWidth, which runs as // the first step of every text-shaped helper. // - line-ending homoglyphs: CR-only and Unicode separators // (U+0085 NEL, U+2028, U+2029) that some renderers treat as line // breaks. Folded uniformly. // - markdown-structure injection: user content opening a heading, // blockquote, list, code fence, link-reference def, setext underline, // gnoweb extension delimiter, or GFM table row at document level. // Strict Block escapes the line-leading `|` of any GFM table row so // user content cannot inject ``-shaped structure; permissive // BlockRich preserves table rows so authors can compose `

` // elements (gnoweb loads extension.Table per render_config.go). // - HTML block type 1-5 absorption: CommonMark §4.6 HTML block types 1 // (`