Redact: multi-doc upload, contextual person naming, aliases

- Extract limit raised from 32K to 128K chars per file (long legal docs now fit)
- Redact API body/text limits raised (400KB / 128K chars) to match
- Upload zone accepts multiple files (up to 5); extracted text concatenated with
  doc separator and combined before redaction; shows per-file char counts
- LLM redact pass now infers contextual person roles (FATHER, MOTHER, CHILD,
  ATTORNEY, JUDGE, etc.) instead of generic [PERSON] for all names; same
  individual gets consistent tag throughout the document
- Tag validation widened to allow any [A-Za-z0-9_- ] pattern (not just the
  five hardcoded tags), supporting contextual and alias tags
- Alias UI added to Redact mode: user maps real names to bracketed aliases
  (e.g. "David Jr" -> [Junior]); aliases injected into LLM system prompt as
  override instructions; max 20 aliases, 100 chars each
- max_tokens raised from 2000 to 4000; timeout from 60s to 90s for larger docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-13 07:17:02 +02:00
parent bbe5307c03
commit 95685862ab
6 changed files with 276 additions and 55 deletions
+43 -20
View File
@@ -330,7 +330,7 @@ PROMPT;
];
}
public function redact(string $text, string $mode = 'standard', string $region = 'nordic', string $language = 'en'): array
public function redact(string $text, string $mode = 'standard', string $region = 'nordic', string $language = 'en', array $aliases = []): array
{
$text = $this->requirePasteText($text);
$mode = $mode === 'strict' ? 'strict' : 'standard';
@@ -357,7 +357,7 @@ PROMPT;
$pass2Counts = [];
$llmDeployment = null;
$llmResult = $this->llmRedactionPass($preRedacted, $language);
$llmResult = $this->llmRedactionPass($preRedacted, $language, $aliases);
if (!empty($llmResult['skipped'])) {
$trace[] = $this->trace('Pass 2 — LLM semantic scan', 'Skipped: ' . ($llmResult['reason'] ?? 'Azure not configured') . '.', 'warning');
@@ -378,7 +378,7 @@ PROMPT;
if ($original === '' || str_starts_with($original, '[')) {
continue;
}
if (!in_array($tag, ['[PERSON]', '[ORG]', '[PLACE]', '[DOB]', '[IDENTIFIER]'], true)) {
if (!preg_match('/^\[[A-Za-z0-9_\- ]+\]$/', $tag)) {
$tag = '[IDENTIFIER]';
}
if (str_contains($finalRedacted, $original)) {
@@ -780,36 +780,59 @@ PROMPT;
]);
}
private function llmRedactionPass(string $preRedacted, string $language = 'en'): array
private function llmRedactionPass(string $preRedacted, string $language = 'en', array $aliases = []): array
{
$missing = $this->azure->missingChatConfig();
if ($missing) {
return ['skipped' => true, 'reason' => 'Azure chat not configured (' . implode(', ', $missing) . ')'];
}
$languageNote = $language === 'no' ? "\nThe document may contain Norwegian or mixed-language content." : '';
$languageNote = $language === 'no' ? "\nThe document may contain Norwegian or mixed-language content." : '';
$aliasBlock = '';
if (!empty($aliases)) {
$lines = [];
foreach ($aliases as $a) {
$orig = str_replace(["\n", "\r", '`', '"', '{', '}'], ' ', substr(trim((string)($a['original'] ?? '')), 0, 100));
$lbl = str_replace(["\n", "\r", '`', '"', '{', '}'], ' ', substr(trim((string)($a['alias'] ?? '')), 0, 100));
if ($orig !== '' && $lbl !== '') {
$lines[] = " \"{$orig}\" → [{$lbl}]";
}
}
if ($lines) {
$aliasBlock = "\n\nALIAS OVERRIDES — use these exact replacement tags for these specific names instead of inferring a role:\n" . implode("\n", $lines);
}
}
$system = <<<PROMPT
You are a privacy redaction assistant for legal documents (ECHR judgements, Norwegian family law cases, EU child welfare documents). The text below has already had mechanical identifiers replaced with placeholder tags in [BRACKETS].
Your task: find any remaining identifiable information — person names, organisation names, specific places at city level or below, dates of birth, and identifying descriptions.
Return ONLY a valid JSON object:
{"redactions":[{"original":"exact text as it appears","type":"person_name","tag":"[PERSON]"}]}
STEP 1 — For person names: identify each individual and infer their role or relationship from context.
Assign each person a consistent contextual tag used for every occurrence of their name:
• Family roles: FATHER, MOTHER, CHILD, CHILD_1, CHILD_2, GRANDPARENT, SIBLING
• Professional roles: ATTORNEY, JUDGE, CASEWORKER, EXPERT_WITNESS
• Generic fallback: PERSON_1, PERSON_2 (use only when role cannot be determined)
The same individual MUST receive the same tag every time they appear.{$aliasBlock}
Allowed type values and their tags:
- person_name → [PERSON]
- org → [ORG]
- place → [PLACE]
- date_of_birth → [DOB]
- other → [IDENTIFIER]
Return ONLY a valid JSON object:
{"redactions":[{"original":"exact text as it appears","type":"person_name","tag":"[FATHER]"}]}
Allowed types and their tag format:
person_name → contextual role tag e.g. [FATHER], [CHILD_1], [ATTORNEY] (or alias tag if provided above)
org [ORG]
place → [PLACE]
date_of_birth → [DOB]
other → [IDENTIFIER]
Rules:
- Include only text that appears verbatim in the input. Do not invent or paraphrase.
- If nothing needs redacting, return {"redactions":[]}.
- Do not redact text already inside [BRACKETS].
- Legal citations, statute names, article numbers, and institution names (e.g. "the European Court of Human Rights", "Barnevernloven § 4-12") are NOT PII.
- Short common words, conjunctions, and prepositions are NOT PII.{$languageNote}
Include only text that appears verbatim in the input. Do not invent or paraphrase.
• The same person MUST get the same tag every time they appear.
• If nothing needs redacting, return {"redactions":[]}.
• Do not redact text already inside [BRACKETS].
• Legal citations, statute names, article numbers, and institution names (e.g. "the European Court of Human Rights", "Barnevernloven § 4-12") are NOT PII.
• Short common words, conjunctions, and prepositions are NOT PII.{$languageNote}
PROMPT;
try {
@@ -818,9 +841,9 @@ PROMPT;
['role' => 'user', 'content' => $preRedacted],
], [
'temperature' => 0.1,
'max_tokens' => 2000,
'max_tokens' => 4000,
'json' => true,
'timeout' => 60,
'timeout' => 90,
]);
$content = (string)($response['choices'][0]['message']['content'] ?? '');