feat(transcribe): GPT cleanup pass + advanced options i18n

Adds optional post-transcription cleanup via GPT-4o/GPT-4o-mini to fix
mishearing errors, punctuation, and domain terms. Speaker role labelling
now accepts a deployment param. Adds i18n strings for advanced options
panel (task, VAD filter, Whisper model, AI cleanup) in all four languages.
Updates BvjAnalyzerAgent and DeepResearchAgent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-18 07:23:01 +02:00
parent e32ee60e78
commit c4362738c1
5 changed files with 345 additions and 112 deletions
+131 -62
View File
@@ -493,7 +493,7 @@ PROMPT;
private function extractParties(string $docText, string $language): array
{
$locale = dbnToolsLanguageName($language);
$excerpt = mb_substr($docText, 0, 12000, 'UTF-8');
$excerpt = mb_substr($docText, 0, 20000, 'UTF-8');
$prompt = <<<PROMPT
You are analysing a Norwegian child welfare (Barnevernet) document.
@@ -502,15 +502,16 @@ Identify ALL named parties — every person or institution referred to by name o
Respond in {$locale}. Return a JSON object with a single key "parties" containing an array of objects.
Each object must have these four fields:
- "name": full name or institution name (string)
- "role": their role in the case, e.g. Biological mother, Child, Barnevernarbeider, Saksbehandler, Melder, Politi, Lege, Advokat, Foster carer, Rusklinikk
- "role": their role in the case, e.g. Biological mother, Biological father, Child, Barnevernarbeider, Saksbehandler, Leder, Melder, Politi, Lege, Psykolog, Advokat, Talsperson for barnet, Tilsynsfører, Sakkyndig, Foster carer (fosterforelder), Rusklinikk, Statsforvalter
- "organization": employer or institution if mentioned, otherwise null
- "relationship_to_child": relationship to the child in the document, e.g. Mother, Father, Caseworker, Melder, or null
- "relationship_to_child": relationship to the child in the document, e.g. Mother, Father, Sibling, Caseworker, Melder, Supervisor, or null
Rules:
- Include every named person and named institution — even peripheral ones.
- Include Barnevernvakta (bvv) as an institution even if no individual caseworkers are named.
- If a name appears to be redacted or anonymised (e.g. "mor", "far", "barnet", initials like "A.B."), include them with role inferred from context.
- Do not invent parties not present in the text.
- Maximum 20 parties.
- Maximum 25 parties.
Document text:
{$excerpt}
@@ -520,14 +521,14 @@ PROMPT;
$raw = $this->azure->chatText([
['role' => 'system', 'content' => 'You return valid JSON only. No markdown fences.'],
['role' => 'user', 'content' => $prompt],
], ['json' => true, 'temperature' => 0.05, 'max_tokens' => 1500, 'timeout' => 40]);
], ['json' => true, 'temperature' => 0.05, 'max_tokens' => 2000, 'timeout' => 45]);
$json = $this->azure->decodeJsonObject($raw);
if (is_array($json) && is_array($json['parties'] ?? null)) {
return array_slice($json['parties'], 0, 20);
return array_slice($json['parties'], 0, 25);
}
// Fallback: model returned an array at root level instead of {parties:[...]}
if (is_array($json) && isset($json[0]['name'])) {
return array_slice($json, 0, 20);
return array_slice($json, 0, 25);
}
error_log('BVJ extractParties unexpected structure: ' . substr($raw, 0, 300));
} catch (Throwable $e) {
@@ -541,7 +542,7 @@ PROMPT;
private function extractTimeline(string $docText, string $language): array
{
$locale = dbnToolsLanguageName($language);
$excerpt = mb_substr($docText, 0, 12000, 'UTF-8');
$excerpt = mb_substr($docText, 0, 20000, 'UTF-8');
$prompt = <<<PROMPT
Build a chronological timeline from this Norwegian child welfare (Barnevernet) document in {$locale}.
@@ -557,14 +558,24 @@ IMPORTANT — Norwegian date and time formats to recognise:
- Diary/log format: lines beginning with a date or time are always events.
- Two-digit years: interpret as 20YY (20 → 2020, 21 → 2021).
Barnevernet-specific events that are ALWAYS high significance:
- Akuttvedtak (emergency placement) under §4-6 or §4-25
- Omsorgsovertakelse (care order) under §4-12
- Police involvement or assistance (politibistand)
- Formal decision (vedtak) or court order (kjennelse)
- Deadline breaches: bekymringsmelding not processed within 7 days; investigation not opened within 6 weeks
- Forhandlingsmøte (negotiation hearing) or Fylkesnemnda hearing
- Supervised contact visits (samvær) being reduced or denied
- Placement in foster care or institution (fosterhjem, institusjon)
For each event provide:
- "date": ISO 8601 date (YYYY-MM-DD) if determinable, otherwise best-effort description
- "time_of_day": HH:MM if present, otherwise null
- "actor": person, institution, or party involved
- "action": concise description (≤ 80 chars) of what happened
- "significance": high (acute measure, removal, police involvement, formal decision) | medium (home visit, phone call, meeting) | low (minor update, note)
- "significance": high (acute measure, removal, police involvement, formal decision, statutory deadline breach) | medium (home visit, phone call, meeting, assessment) | low (minor update, note)
Sort chronologically. Maximum 30 events.
Sort chronologically. Maximum 40 events.
Document text:
{$excerpt}
@@ -579,10 +590,10 @@ PROMPT;
$raw = $this->azure->chatText([
['role' => 'system', 'content' => 'You return valid JSON only. No markdown fences.'],
['role' => 'user', 'content' => $prompt],
], ['json' => true, 'temperature' => 0.05, 'max_tokens' => 3000, 'timeout' => 45]);
], ['json' => true, 'temperature' => 0.05, 'max_tokens' => 4000, 'timeout' => 55]);
$json = $this->azure->decodeJsonObject($raw);
if (is_array($json) && is_array($json['events'] ?? null)) {
return array_slice($json['events'], 0, 30);
return array_slice($json['events'], 0, 40);
}
} catch (Throwable $e) {
error_log('BVJ extractTimeline failed: ' . $e->getMessage());
@@ -600,52 +611,84 @@ PROMPT;
int $count,
string $language
): array {
$locale = dbnToolsLanguageName($language);
$docType = $docMeta['doc_type'] ?? 'BVJ document';
$roleStr = $advocateRole !== '' ? $advocateRole : 'the affected party';
$locale = dbnToolsLanguageName($language);
$docType = $docMeta['doc_type'] ?? 'BVJ document';
$docDate = $docMeta['doc_date'] ?? 'unknown date';
$authority = $docMeta['issuing_authority'] ?? 'the municipality';
$roleStr = $advocateRole !== '' ? $advocateRole : 'the affected party';
// Summarise the top events to give the model context
// Summarise high-significance events first, then others
$highEvents = array_values(array_filter($timelineEvents, fn($e) => ($e['significance'] ?? '') === 'high'));
$otherEvents = array_values(array_filter($timelineEvents, fn($e) => ($e['significance'] ?? '') !== 'high'));
$topEvents = array_slice(array_merge($highEvents, $otherEvents), 0, 12);
$eventSummary = '';
$highEvents = array_filter($timelineEvents, fn($e) => ($e['significance'] ?? '') === 'high');
$topEvents = array_slice(array_merge(array_values($highEvents),
array_values(array_filter($timelineEvents, fn($e) => ($e['significance'] ?? '') !== 'high'))), 0, 8);
foreach ($topEvents as $ev) {
$eventSummary .= sprintf("- %s: %s (%s)\n", $ev['date'] ?? '?', $ev['action'] ?? '', $ev['actor'] ?? '');
$sig = ($ev['significance'] ?? 'low') === 'high' ? '[HIGH] ' : '';
$eventSummary .= sprintf("- %s %s%s (%s)\n",
$ev['date'] ?? '?', $sig, $ev['action'] ?? '', $ev['actor'] ?? '');
}
// Summarise parties
$partyList = '';
foreach (array_slice($parties, 0, 8) as $p) {
$partyList .= sprintf("- %s (%s)\n", $p['name'] ?? '', $p['role'] ?? '');
foreach (array_slice($parties, 0, 10) as $p) {
$org = !empty($p['organization']) ? ' at ' . $p['organization'] : '';
$partyList .= sprintf("- %s (%s%s)\n", $p['name'] ?? '?', $p['role'] ?? '?', $org);
}
$angleGuidance = match (true) {
$count >= 5 => <<<ANGLES
Cover these five distinct legal angles (one per question):
1. Statutory rights and obligations under Barnevernloven (e.g. §4-2, §4-6, §4-12) specific to the measures taken
2. ECHR Article 8 proportionality and procedural safeguards cite the specific measures and dates from this case
3. Procedural obligations BVV must fulfil (advance notice, documentation, hearing rights) anchor to documented events
4. Bufdir/Statsforvalter guidance on investigation standards and thresholds for intervention
5. Norwegian appellate court decisions on comparable measures and family circumstances
ANGLES,
$count === 4 => <<<ANGLES
Cover these four distinct legal angles (one per question):
1. Statutory rights under Barnevernloven anchored to the specific measures and dates in this case
2. ECHR Article 8 proportionality of the specific intervention and any procedural violations
3. BVV's procedural obligations — documentation, notice, and hearing rights — as evidenced by the timeline
4. Bufdir guidance and Norwegian court decisions on comparable fact patterns
ANGLES,
default => <<<ANGLES
Cover three distinct legal angles (one per question):
1. Statutory rights under Barnevernloven for the specific type of measure documented
2. ECHR Article 8 proportionality and procedural safeguards
3. BVV's procedural obligations and whether the documented timeline shows any breach
ANGLES,
};
$prompt = <<<PROMPT
You are a Norwegian family-law research assistant building a case for: {$roleStr}.
A {$docType} has been uploaded. Key events:
Case facts extracted from the uploaded document:
- Document type: {$docType}
- Date: {$docDate}
- Issuing authority: {$authority}
- Key events (chronological):
{$eventSummary}
Key parties:
- Key parties:
{$partyList}
Generate exactly {$count} targeted sub-questions to research the legal corpus for arguments that SUPPORT {$roleStr}'s position. Each question should explore a different angle:
1. Statutory rights and obligations (Barnevernloven, Barneloven)
2. ECHR Article 8 and 9 precedents vs Norway
3. Procedural requirements BVV must follow (notice, documentation, proportionality)
4. Bufdir guidance on case handling standards
5. Norwegian court decisions on similar fact patterns
Generate exactly {$count} sub-questions to search the Norwegian legal corpus for arguments that SUPPORT {$roleStr}'s position.
{$angleGuidance}
CRITICAL: Every question MUST embed specific facts from this case — use the actual authority name, document date, type of measure, and parties where relevant. Generic questions ("What are parental rights?") are useless for retrieval. Specific questions ("What notice requirements must {$authority} meet before issuing an emergency placement under Barnevernloven §4-6?") are highly effective.
Return JSON only in {$locale}:
{
"sub_questions": [
{"id":"q1","question":"...","rationale":"how this angle strengthens {$roleStr}'s position (≤ 120 chars)"}
{"id":"q1","question":"...","rationale":"why this angle strengthens {$roleStr}'s position (≤ 120 chars)"}
]
}
Rules:
- Exactly {$count} sub-questions, no more no fewer.
- Every question must be answerable from Norwegian family-law, child-welfare, or ECHR sources.
- Each question must cover a DIFFERENT legal angle.
- Questions must be self-contained without needing the raw document.
- Exactly {$count} sub-questions.
- Each question targets a DIFFERENT legal angle.
- Include specific case details (authority, date, measure type) in each question.
- Questions must be self-contained and answerable from Norwegian family-law, child-welfare, or ECHR sources.
- Respond in {$locale}.
PROMPT;
@@ -734,16 +777,16 @@ PROMPT;
// Build parties summary (top 8)
$partiesSummary = '';
foreach (array_slice($parties, 0, 8) as $i => $p) {
foreach (array_slice($parties, 0, 12) as $i => $p) {
$org = $p['organization'] ? ' (' . $p['organization'] . ')' : '';
$rel = $p['relationship_to_child'] ? ' — rel: ' . $p['relationship_to_child'] : '';
$partiesSummary .= sprintf("%d. %s — %s%s%s\n", $i + 1, $p['name'] ?? '', $p['role'] ?? '', $org, $rel);
}
// Build timeline summary (top 15 most significant events)
// Build timeline summary (top 20 most significant events)
$highEvents = array_values(array_filter($timelineEvents, fn($e) => ($e['significance'] ?? '') === 'high'));
$otherEvents = array_values(array_filter($timelineEvents, fn($e) => ($e['significance'] ?? '') !== 'high'));
$topEvents = array_slice(array_merge($highEvents, $otherEvents), 0, 15);
$topEvents = array_slice(array_merge($highEvents, $otherEvents), 0, 20);
$timelineSummary = '';
foreach ($topEvents as $ev) {
$time = $ev['time_of_day'] ? ' kl.' . $ev['time_of_day'] : '';
@@ -783,14 +826,17 @@ PROMPT;
? "\n== ADDITIONAL CONTEXT FROM ADVOCATE ==\n{$additionalNotes}\n"
: '';
$docExcerpt = mb_substr($docText, 0, 3000, 'UTF-8');
$docExcerpt = mb_substr($docText, 0, 8000, 'UTF-8');
$prompt = <<<PROMPT
You are Do Better Norge Legal Tools producing a structured Barnevernet case analysis brief.
You are representing: {$roleStr}
You are Do Better Norge Legal Tools. Produce a structured Barnevernet case analysis for: {$roleStr}.
HALLUCINATION RULES — READ FIRST:
- You may ONLY cite statute sections (§), ECHR article numbers, ECHR application numbers, case names, and Bufdir/Statsforvalter circular references that appear verbatim in the numbered corpus sources below.
- Do NOT cite statute sections, case names, or ECHR applications from your training memory — they may be misremembered or no longer in force.
- If no source supports a claim, omit the claim rather than invent support.
- Every factual legal claim in advocacy_brief MUST end with at least one [n] or [DOC] citation. Unsupported claims are a liability for the client.
Ground every claim in the numbered corpus sources below using [n] markers, OR in the uploaded document using [DOC].
Do NOT invent statutes, paragraph numbers, case names, ECHR applications, dates, or parties.
Return valid JSON only. No markdown fences.
== DOCUMENT METADATA ==
@@ -805,51 +851,74 @@ Child: {$childInfo}
== TIMELINE (from document) ==
{$timelineSummary}
== CORPUS SOURCES ({$sourceCount} numbered) ==
== CORPUS SOURCES ({$sourceCount} numbered — cite as [n]) ==
{$sourcesText}
{$notesSection}
{$subQText}
== DOCUMENT EXCERPT (first 3000 chars — use [DOC] to cite) ==
== DOCUMENT EXCERPT (first 8000 chars — cite as [DOC]) ==
{$docExcerpt}
Return JSON in {$locale}:
== ADVOCACY BRIEF FORMAT ==
Write the advocacy_brief as a Markdown document with these sections:
## Case Overview
Summarise what happened: document type, issuing authority, key events from the timeline. Every factual statement must cite [DOC].
## {$roleStr}'s Core Legal Position
The strongest statutory and ECHR arguments in favour of {$roleStr}. Cite [n] for each legal point. Only cite statutes and cases that appear in the corpus sources above.
## Procedural Compliance Issues
Where BVV/the authority may have failed their own procedural obligations. Ground each point in a specific documented action from [DOC] and the applicable statute or guidance from [n].
## Client Strengths
3-6 factual and legal advantages for {$roleStr}, each anchored with [n] or [DOC].
## Counter-Arguments and Responses
The most likely opposing arguments and how to rebut them. Cite [n] for rebuttal sources.
## Recommended Next Steps
2-4 concrete legal actions {$roleStr} should take now.
End with one line: "*This brief is AI-assisted and for discussion purposes only — verify all legal references with a qualified Norwegian family-law lawyer.*"
Target length: 600-1000 words.
== JSON OUTPUT ==
{
"advocacy_brief": "Partisan legal brief in Markdown. Structure:\n## Case Overview\n(What happened according to [DOC] — doc type, authority, key events)\n\n## {$roleStr}'s Core Legal Position\n(Strongest statutory and ECHR arguments — cite [n] and [DOC])\n\n## Procedural Compliance Issues\n(Where BVV may have failed their own procedural obligations — cite [DOC][n])\n\n## Client Strengths\n(Factual and legal advantages for {$roleStr} — cite [n][DOC])\n\n## Counter-Arguments and Responses\n(Likely opposing arguments and how to rebut — cite [n])\n\n## Recommended Next Steps\n(Concrete legal actions)\n\nEnd with a one-line disclaimer. Length: 500-1000 words.",
"advocacy_brief": "<the Markdown brief following the format above>",
"procedural_red_flags": [
{
"description": "Concise description of the potential procedural violation",
"legal_basis": "Statute or ECHR article potentially violated, e.g. Barnevernloven §6-1, ECHR Art.8",
"severity": "high",
"legal_basis": "Statute or ECHR article from a corpus source — e.g. Barnevernloven §4-2 [3]",
"severity": "high|medium|low",
"source_refs": ["[n]", "[DOC]"],
"what_to_check": "Specific document text or action requiring legal verification"
"what_to_check": "Exact document text or action to verify with a lawyer"
}
],
"client_strengths": ["3-6 items anchored with [n] or [DOC]"],
"opposing_weaknesses": ["2-5 vulnerabilities in BVV or opposing party position — omit if unsupported by sources"],
"what_we_found": "2-sentence plain-language summary of the most critical finding",
"what_remains_uncertain": ["3-5 specific gaps — missing information, unclear authority, conflicting sources"],
"next_practical_step": "The single most important concrete legal action for {$roleStr}"
"client_strengths": ["3-6 items, each ending with [n] or [DOC]"],
"opposing_weaknesses": ["2-5 documented vulnerabilities in BVV or opposing position — OMIT if not supported by at least one [n]"],
"what_we_found": "2-sentence plain-language summary of the single most critical finding",
"what_remains_uncertain": ["3-5 specific information gaps or legal questions that need clarification"],
"next_practical_step": "The single most important concrete legal action for {$roleStr} to take within the next 7 days"
}
Rules:
- Every factual claim in advocacy_brief must end with [n] or [DOC].
- procedural_red_flags must be grounded in documented BVV actions — no speculation.
- severity: high = likely violation of a codified right; medium = procedural irregularity; low = best-practice gap.
- If no corpus source supports a claimed weakness, omit it from opposing_weaknesses.
- Cite statute sections and ECHR articles as they appear in the corpus excerpts.
- severity: high = likely violation of a codified statutory right or ECHR guarantee; medium = procedural irregularity; low = best-practice gap only.
- procedural_red_flags must be grounded in documented BVV actions visible in [DOC] or the timeline.
- If fewer than 2 corpus sources support opposing_weaknesses, return an empty array.
- Respond in {$locale}.
PROMPT;
$sysPrompt = 'You return valid JSON only. No markdown fences.';
$sysPrompt = 'You return valid JSON only. No markdown fences. Every legal citation must come from the provided corpus sources, not from training memory.';
$messages = [
['role' => 'system', 'content' => $sysPrompt],
['role' => 'user', 'content' => $prompt],
];
$opts = ['json' => true, 'temperature' => $temperature, 'max_tokens' => 3000, 'timeout' => 200];
$opts = ['json' => true, 'temperature' => $temperature, 'max_tokens' => 4500, 'timeout' => 240];
$deployLabel = match ($engine) {
'gpu' => 'GPU (cuttlefish)',