Redact: multi-doc upload, contextual person naming, aliases
- Extract limit raised from 32K to 128K chars per file (long legal docs now fit) - Redact API body/text limits raised (400KB / 128K chars) to match - Upload zone accepts multiple files (up to 5); extracted text concatenated with doc separator and combined before redaction; shows per-file char counts - LLM redact pass now infers contextual person roles (FATHER, MOTHER, CHILD, ATTORNEY, JUDGE, etc.) instead of generic [PERSON] for all names; same individual gets consistent tag throughout the document - Tag validation widened to allow any [A-Za-z0-9_- ] pattern (not just the five hardcoded tags), supporting contextual and alias tags - Alias UI added to Redact mode: user maps real names to bracketed aliases (e.g. "David Jr" -> [Junior]); aliases injected into LLM system prompt as override instructions; max 20 aliases, 100 chars each - max_tokens raised from 2000 to 4000; timeout from 60s to 90s for larger docs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
+43
-20
@@ -330,7 +330,7 @@ PROMPT;
|
||||
];
|
||||
}
|
||||
|
||||
public function redact(string $text, string $mode = 'standard', string $region = 'nordic', string $language = 'en'): array
|
||||
public function redact(string $text, string $mode = 'standard', string $region = 'nordic', string $language = 'en', array $aliases = []): array
|
||||
{
|
||||
$text = $this->requirePasteText($text);
|
||||
$mode = $mode === 'strict' ? 'strict' : 'standard';
|
||||
@@ -357,7 +357,7 @@ PROMPT;
|
||||
$pass2Counts = [];
|
||||
$llmDeployment = null;
|
||||
|
||||
$llmResult = $this->llmRedactionPass($preRedacted, $language);
|
||||
$llmResult = $this->llmRedactionPass($preRedacted, $language, $aliases);
|
||||
|
||||
if (!empty($llmResult['skipped'])) {
|
||||
$trace[] = $this->trace('Pass 2 — LLM semantic scan', 'Skipped: ' . ($llmResult['reason'] ?? 'Azure not configured') . '.', 'warning');
|
||||
@@ -378,7 +378,7 @@ PROMPT;
|
||||
if ($original === '' || str_starts_with($original, '[')) {
|
||||
continue;
|
||||
}
|
||||
if (!in_array($tag, ['[PERSON]', '[ORG]', '[PLACE]', '[DOB]', '[IDENTIFIER]'], true)) {
|
||||
if (!preg_match('/^\[[A-Za-z0-9_\- ]+\]$/', $tag)) {
|
||||
$tag = '[IDENTIFIER]';
|
||||
}
|
||||
if (str_contains($finalRedacted, $original)) {
|
||||
@@ -780,36 +780,59 @@ PROMPT;
|
||||
]);
|
||||
}
|
||||
|
||||
private function llmRedactionPass(string $preRedacted, string $language = 'en'): array
|
||||
private function llmRedactionPass(string $preRedacted, string $language = 'en', array $aliases = []): array
|
||||
{
|
||||
$missing = $this->azure->missingChatConfig();
|
||||
if ($missing) {
|
||||
return ['skipped' => true, 'reason' => 'Azure chat not configured (' . implode(', ', $missing) . ')'];
|
||||
}
|
||||
|
||||
$languageNote = $language === 'no' ? "\nThe document may contain Norwegian or mixed-language content." : '';
|
||||
$languageNote = $language === 'no' ? "\n • The document may contain Norwegian or mixed-language content." : '';
|
||||
|
||||
$aliasBlock = '';
|
||||
if (!empty($aliases)) {
|
||||
$lines = [];
|
||||
foreach ($aliases as $a) {
|
||||
$orig = str_replace(["\n", "\r", '`', '"', '{', '}'], ' ', substr(trim((string)($a['original'] ?? '')), 0, 100));
|
||||
$lbl = str_replace(["\n", "\r", '`', '"', '{', '}'], ' ', substr(trim((string)($a['alias'] ?? '')), 0, 100));
|
||||
if ($orig !== '' && $lbl !== '') {
|
||||
$lines[] = " \"{$orig}\" → [{$lbl}]";
|
||||
}
|
||||
}
|
||||
if ($lines) {
|
||||
$aliasBlock = "\n\nALIAS OVERRIDES — use these exact replacement tags for these specific names instead of inferring a role:\n" . implode("\n", $lines);
|
||||
}
|
||||
}
|
||||
|
||||
$system = <<<PROMPT
|
||||
You are a privacy redaction assistant for legal documents (ECHR judgements, Norwegian family law cases, EU child welfare documents). The text below has already had mechanical identifiers replaced with placeholder tags in [BRACKETS].
|
||||
|
||||
Your task: find any remaining identifiable information — person names, organisation names, specific places at city level or below, dates of birth, and identifying descriptions.
|
||||
|
||||
Return ONLY a valid JSON object:
|
||||
{"redactions":[{"original":"exact text as it appears","type":"person_name","tag":"[PERSON]"}]}
|
||||
STEP 1 — For person names: identify each individual and infer their role or relationship from context.
|
||||
Assign each person a consistent contextual tag used for every occurrence of their name:
|
||||
• Family roles: FATHER, MOTHER, CHILD, CHILD_1, CHILD_2, GRANDPARENT, SIBLING
|
||||
• Professional roles: ATTORNEY, JUDGE, CASEWORKER, EXPERT_WITNESS
|
||||
• Generic fallback: PERSON_1, PERSON_2 (use only when role cannot be determined)
|
||||
The same individual MUST receive the same tag every time they appear.{$aliasBlock}
|
||||
|
||||
Allowed type values and their tags:
|
||||
- person_name → [PERSON]
|
||||
- org → [ORG]
|
||||
- place → [PLACE]
|
||||
- date_of_birth → [DOB]
|
||||
- other → [IDENTIFIER]
|
||||
Return ONLY a valid JSON object:
|
||||
{"redactions":[{"original":"exact text as it appears","type":"person_name","tag":"[FATHER]"}]}
|
||||
|
||||
Allowed types and their tag format:
|
||||
person_name → contextual role tag e.g. [FATHER], [CHILD_1], [ATTORNEY] (or alias tag if provided above)
|
||||
org → [ORG]
|
||||
place → [PLACE]
|
||||
date_of_birth → [DOB]
|
||||
other → [IDENTIFIER]
|
||||
|
||||
Rules:
|
||||
- Include only text that appears verbatim in the input. Do not invent or paraphrase.
|
||||
- If nothing needs redacting, return {"redactions":[]}.
|
||||
- Do not redact text already inside [BRACKETS].
|
||||
- Legal citations, statute names, article numbers, and institution names (e.g. "the European Court of Human Rights", "Barnevernloven § 4-12") are NOT PII.
|
||||
- Short common words, conjunctions, and prepositions are NOT PII.{$languageNote}
|
||||
• Include only text that appears verbatim in the input. Do not invent or paraphrase.
|
||||
• The same person MUST get the same tag every time they appear.
|
||||
• If nothing needs redacting, return {"redactions":[]}.
|
||||
• Do not redact text already inside [BRACKETS].
|
||||
• Legal citations, statute names, article numbers, and institution names (e.g. "the European Court of Human Rights", "Barnevernloven § 4-12") are NOT PII.
|
||||
• Short common words, conjunctions, and prepositions are NOT PII.{$languageNote}
|
||||
PROMPT;
|
||||
|
||||
try {
|
||||
@@ -818,9 +841,9 @@ PROMPT;
|
||||
['role' => 'user', 'content' => $preRedacted],
|
||||
], [
|
||||
'temperature' => 0.1,
|
||||
'max_tokens' => 2000,
|
||||
'max_tokens' => 4000,
|
||||
'json' => true,
|
||||
'timeout' => 60,
|
||||
'timeout' => 90,
|
||||
]);
|
||||
|
||||
$content = (string)($response['choices'][0]['message']['content'] ?? '');
|
||||
|
||||
Reference in New Issue
Block a user