feat(timeline): tighten prompt for accuracy — year inference, month names, actor normalization, confidence calibration

- Add 4-step year inference rule for DD.MM. entries (scan backward/forward for anchor year)
- Add Norwegian month-name formats (18. september, den 18. september 2025, etc.) with month lookup table
- Add $relativeInstruction to tell LLM upfront when relative dates are excluded (not just PHP-filtered post-hoc)
- Define confidence calibration criteria explicitly (high/medium/low)
- Improve source_excerpt guidance: most diagnostic phrase, not just any verbatim phrase
- Add actor normalization for Norwegian institutions (Barnevernstjenesten, Fylkesnemnda, Statsforvalteren, etc.)
- Add deduplication rule for events appearing across multiple documents
- Add end_date field for date_type=period events
- Improve what_we_found schema hint to require count/range/actors/gaps
- Increase max_tokens to 8000 for azure_full (gpt-4o) to avoid truncation on large documents
- Tighten system prompt with Norwegian CPS legal chain context

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-18 07:11:31 +02:00
parent f2fbb69e0a
commit e32ee60e78
+73 -29
View File
@@ -312,45 +312,68 @@ PROMPT;
? "\nAlso extract BACKGROUND and NARRATIVE events: dates embedded in contextual paragraphs, historical facts, year-only references, and approximate years (e.g. \"rundt 2011/2012\", \"David ble født den 30.07.2015\", \"familien i 2015\"). These are valid timeline events even when they appear in introductory or background text — do NOT skip them."
: "\nDo NOT include purely historical background or narrative context dates. Focus only on operational events, deadlines, and milestones that are directly actionable in the case.";
$relativeInstruction = $includeRelative
? ''
: "\nDo NOT extract relative, recurring, or conditional date references — extract only events with determinable absolute dates (date_type=absolute).";
$prompt = <<<PROMPT
Build a chronological timeline from the pasted text in {$locale}.
Extract ALL dates, deadlines, milestones, and temporal references.{$focusInstruction}{$backgroundInstruction}
Extract ALL dates, deadlines, milestones, and temporal references.{$focusInstruction}{$backgroundInstruction}{$relativeInstruction}
IMPORTANT — Norwegian date and time formats to recognise:
- DD.MM.YYYY (e.g. 18.09.2025 → 2025-09-18)
- DD.MM.YY (e.g. 18.09.25 = 2025-09-18, 09.04.25 = 2025-04-09)
- D.M.YY (e.g. 6.1.25 = 2025-01-06)
- DD.MM. (e.g. 18.09. — day and month without year; infer year from surrounding context)
- D.M. (e.g. 6.1. — day and month only)
- DD.MM.YYYY (e.g. 18.09.2025)
- Two-digit years: always interpret as 20YY (25 → 2025, 24 → 2024).
- "den DD. Month YYYY" (e.g. "den 18. september 2025" → 2025-09-18)
- "DD. Month YYYY" (e.g. "18. september 2025" → 2025-09-18)
- "DD. Month" (e.g. "18. september" → infer year per the rule below)
- Norwegian month names: januar=01 februar=02 mars=03 april=04 mai=05 juni=06
juli=07 august=08 september=09 oktober=10 november=11 desember=12
- DD.MM. (e.g. 18.09.) and D.M. (e.g. 6.1.) — day and month WITHOUT year:
Step 1: scan BACKWARD in the same document section for the nearest absolute year.
Step 2: if none found before, scan FORWARD for the nearest absolute year.
Step 3: use that year and set confidence=medium.
Step 4: if the resulting date would be in the future relative to the document's apparent writing date, subtract one year.
Only use "year unknown" when no year anchor exists within 300 words.
- Times: "kl. 14:30", "kl 09.00", "14:30", "14.30" → extract as "14:30" (HH:MM 24-hour).
- Diary / log format: lines that begin with a date followed by a colon or space are ALWAYS events.
Example: "18.09.25: Samtale med Davids lærer" → date 2025-09-18, event "Samtale med Davids lærer".
Example: "6.1. Samtaler med David" → date unknown-year-01-06, event "Samtaler med David".
Example: "6.1. Samtaler med David" → infer year from context, event "Samtaler med David".
Example: "18.09.25 kl. 09.00: Møte på skolen" → date 2025-09-18, time "09:00", event "Møte på skolen".
- Do NOT skip a line just because the year is ambiguous — record what you can and set confidence accordingly.
- Do NOT skip a line just because the year is ambiguous — infer from context, record it, and set confidence accordingly.
For each temporal reference provide:
- "date": ISO 8601 date (YYYY-MM-DD) if determinable, otherwise a human-readable description such as "06 Jan (year unknown)"
- "end_date": end date (YYYY-MM-DD) for date_type=period; null for all other types
- "time": time of day in HH:MM (24-hour) if present in the source text, otherwise null
- "date_type": one of absolute | relative | recurring | conditional | period
- "actor": person, institution, or party involved — or "unknown"
- "actor": person, institution, or party involved — or "unknown".
Normalize Norwegian institutional actors: Barnevernstjenesten/BV → "Barnevernstjenesten",
Fylkesnemnda → "Fylkesnemnda", Statsforvalteren/Statsforvalter → "Statsforvalteren",
Tingrett → "Tingrett", Lagmannsrett → "Lagmannsrett", Høyesterett → "Høyesterett",
NAV → "NAV", BUP → "BUP", PPT → "PPT".
- "event": concise description of what happened or is due
- "source_excerpt": the verbatim phrase from the text that grounds this event (≤ 30 words)
- "source_excerpt": the most diagnostic verbatim phrase (≤ 30 words) that directly establishes both
the date and the event — prefer the phrase that would be least ambiguous out of context
- "confidence": high | medium | low
high = explicit date + event, no ambiguity, verbatim in the text
medium = year derived from context, date approximate, or event description is paraphrased
low = no explicit date, year is unknown, or event is implied rather than stated
Sort events chronologically (absolute dates first, then relative, then recurring).
Keep uncertain dates explicit — do not invent dates not in the text.
If multiple documents are separated by "--- Document: … ---" markers, note the source document in the event description where helpful.
If the same event appears in multiple documents, create ONE entry — use the most specific date and note both sources in the event description.
Pasted text:
{$text}
Return JSON only:
{
"what_we_found": "short overview",
"events": [{"date":"...","time":"HH:MM or null","date_type":"absolute","actor":"...","event":"...","source_excerpt":"...","confidence":"high|medium|low"}],
"what_we_found": "total events found; earliest and latest dates; main actors; any notable gaps",
"events": [{"date":"...","end_date":"YYYY-MM-DD or null","time":"HH:MM or null","date_type":"absolute","actor":"...","event":"...","source_excerpt":"...","confidence":"high|medium|low"}],
"evidence_trail": [{"title":"...","excerpt":"..."}],
"what_remains_uncertain": ["..."],
"next_practical_step": "..."
@@ -362,7 +385,7 @@ PROMPT;
['role' => 'system', 'content' => $system],
['role' => 'user', 'content' => $prompt],
];
$chatOptions = ['json' => true, 'temperature' => 0.1, 'max_tokens' => 4000, 'timeout' => 120];
$chatOptions = ['json' => true, 'temperature' => 0.1, 'max_tokens' => ($engine === 'azure_full' ? 8000 : 4000), 'timeout' => 120];
$deployLabel = $this->azure->chatDeployment();
try {
@@ -519,7 +542,15 @@ PROMPT;
if (!preg_match('/^\[[A-Za-z0-9_\- ]+(?::\s*[^\]]+)?\]$/', $tag)) {
$tag = '[IDENTIFIER]';
}
if (str_contains($finalRedacted, $original)) {
// Try word-boundary match first to avoid partial-word substitutions (e.g. "Per" inside "Persson")
$escaped = preg_quote($original, '/');
$replaced = preg_replace('/\b' . $escaped . '\b/u', $tag, $finalRedacted);
if ($replaced !== null && $replaced !== $finalRedacted) {
$finalRedacted = $replaced;
$pass2Counts[$type] = ($pass2Counts[$type] ?? 0) + 1;
$applied++;
} elseif (str_contains($finalRedacted, $original)) {
// Fallback for names adjacent to punctuation or non-word characters
$finalRedacted = str_replace($original, $tag, $finalRedacted);
$pass2Counts[$type] = ($pass2Counts[$type] ?? 0) + 1;
$applied++;
@@ -607,7 +638,8 @@ PROMPT;
{
$locale = dbnToolsLanguageName($language);
return <<<PROMPT
You are Do Better Norge Legal Tools in a source-grounded legal preparation workflow.
You are Do Better Norge Legal Tools a source-grounded Norwegian legal preparation assistant.
Norwegian legal context: CPS cases follow the chain Barnevernstjenesten → Fylkesnemnda → Statsforvalteren → Tingrett → Lagmannsrett → Høyesterett. Key order types: akuttvedtak (emergency removal), omsorgsvedtak (care order), samværsvedtak (contact order). Relevant bodies: BUP (child psychiatry), PPT (educational psychology), NAV (welfare).
Use the DBN legal guardrails:
- Answer only from provided source excerpts or pasted text.
- Treat your role as legal information and issue-spotting, not final legal advice.
@@ -1011,35 +1043,47 @@ PROMPT;
}
$system = <<<PROMPT
You are a privacy redaction assistant for legal documents (ECHR judgements, Norwegian family law cases, EU child welfare documents). The text below has already had mechanical identifiers replaced with placeholder tags in [BRACKETS].
You are a privacy redaction assistant for legal documents (ECHR judgements, Norwegian family law cases, EU child welfare documents). The text has already had mechanical identifiers (phone numbers, emails, national ID numbers, addresses) replaced with placeholder tags in [BRACKETS].
Your task: find any remaining identifiable information — person names, organisation names, specific places at city level or below, dates and dates of birth (including soft references like "i 2015", "august 2018", "rundt 2011/2012", "spring of 2019"), and identifying descriptions.
Your task: find ALL remaining identifiable information — person names, organisation names, specific places at city level or below, and dates/years that could identify when events occurred.
STEP 1 — For person names: identify each individual and infer their role or relationship from context.
Assign each person a consistent contextual tag used for every occurrence of their name:
STEP 1 — Identify persons and assign consistent role tags.
Infer each person's role from context and assign a tag used for EVERY occurrence of their name:
• Family roles: FATHER, MOTHER, CHILD, CHILD_1, CHILD_2, GRANDPARENT, SIBLING
• Professional roles: ATTORNEY, JUDGE, CASEWORKER, EXPERT_WITNESS
• Generic fallback: PERSON_1, PERSON_2 (use only when role cannot be determined)
The same individual MUST receive the same tag every time they appear.{$aliasBlock}{$exemptBlock}{$officialsNote}{$skipNote}{$allowedTypesNote}
• Generic fallback: PERSON_1, PERSON_2 (only when role is unclear from context)
The same individual MUST receive the same tag every time they appear.{$aliasBlock}{$exemptBlock}{$officialsNote}
STEP 2 — Name variants: for each person, add a SEPARATE entry for every distinct textual form their name takes in the document. All variants of the same person receive the SAME tag.
Example: if "Per Hansen" also appears as "Per" alone and "Hansen" alone, return three entries: "Per Hansen", "Per", "Hansen" — all tagged [FATHER] (or whichever role applies).
Skip a short form only if it is also a common Norwegian or English word used in a clearly different sense elsewhere in the text.{$skipNote}{$allowedTypesNote}
Return ONLY a valid JSON object:
{"redactions":[{"original":"exact text as it appears","type":"person_name","tag":"[FATHER]"}]}
{"redactions":[{"original":"exact text as it appears in input","type":"person_name","tag":"[FATHER]"}]}
Allowed types and their tag format:
person_name → contextual role tag e.g. [FATHER], [CHILD_1], [ATTORNEY] (or alias tag if provided above)
person_name → contextual role tag e.g. [FATHER], [CHILD_1], [ATTORNEY] (or alias tag if overridden above)
org → [ORG]
place → [PLACE]
place → [PLACE] (city, town, neighbourhood, named location — NOT country names)
date_of_birth → [DOB]
date → [DATE] (years, year ranges, month+year, soft temporal references — e.g. "i 2015" → "i [DATE]", "rundt 2011/2012" → "rundt [DATE]")
date → [DATE] (standalone years, year ranges, month+year references that could identify events)
CRITICAL: "original" must be the date token ONLY — never include surrounding prepositions.
✓ text "i 2015" → original:"2015", tag:"[DATE]"
✓ text "rundt 2011/2012" → original:"2011/2012", tag:"[DATE]"
✓ text "august 2018" → original:"august 2018", tag:"[DATE]"
✓ text "spring of 2019" → original:"spring of 2019", tag:"[DATE]"
✗ WRONG: original:"i 2015" — preposition included, do NOT do this
other → [IDENTIFIER]
Rules:
Include only text that appears verbatim in the input. Do not invent or paraphrase.
The same person MUST get the same tag every time they appear.
If nothing needs redacting, return {"redactions":[]}.
Do not redact text already inside [BRACKETS].
Legal citations, statute names, article numbers, and institution names (e.g. "the European Court of Human Rights", "Barnevernloven § 4-12") are NOT PII.
Short common words, conjunctions, and prepositions are NOT PII.{$languageNote}
"original" must be verbatim text from the input — exact case, no paraphrasing or alterations.
Do not return entries for text already inside [BRACKETS].
The same person MUST get the same tag in every entry.
If nothing remains to redact, return {"redactions":[]}.
NOT PII: legal citations, statute names, article numbers (e.g. "Barnevernloven § 4-12", "Article 8 ECHR").
NOT PII: national institution names ("Barnevernet", "Fylkesnemnda", "Oslo tingrett", "the Court").
• NOT PII: country names. City districts and named locations ARE PII.
• NOT PII: short common words, conjunctions, prepositions.{$languageNote}
PROMPT;
$messages = [