Adds /corpus.php — a data transparency page showing what powers the
legal tools: 9 coverage categories with live doc counts, a full
sources table pulled from the corpus DB, the AI stack (LLMs, Whisper,
Qdrant, Azure AI Search, embeddings, chunking), and a pipeline flow
diagram. Stats are live via a new /api/corpus-stats.php endpoint
(queries dobetter_rag + bnl_admin). The reasoning sidebar is repurposed
as a Corpus health panel on this page.
Also ships the in-progress timeline background events toggle:
API and UI wired together via include_background param.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AiGateway uses getenv(LITELLM_MASTER_KEY) + stream_context HTTP which was
failing on the chloe virtualhost process. New dbnToolsLiteLLMEmbedBatch()
helper mirrors dbnToolsCallGpuLlm — hardcoded URL + key, cURL-first, same
pattern already proven for LLM calls. Removes AiGateway dependency from
DeepResearchAgent entirely.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three user-flagged issues after the first real run with a 920KB sakkyndig PDF:
1. dobetternorge.no marketing-website chunks leaked into the retrieval pool.
ClientRagPipeline::searchAll defaults include_beta_website=true; we now
pass false for both website flags, AND defensively drop any returned
chunk whose source_name contains "website" or title contains
"dobetternorge.no" before it can pollute synthesis.
2. Brief returned was "just a paragraph". Bumped synthesis max_tokens
2200→3200, raised timeout 120→180s, and rewrote the prompt to require
400-900 words with min 4 paragraphs when source_count>=3, covering EACH
sub-question in its own paragraph. Now also passes authority + jurisdiction
into the sources block so the model can pinpoint statutes correctly.
3. No way to see what each "sub-question agent" researched or click through
to the source articles. Restructured the results panel so per-sub-question
report cards now render ABOVE the synthesised brief. Each report shows the
question, the rationale, and the top 3 retrieved sources for that sub-Q
with title→deep link + 1-line excerpt. Brief follows. Consolidated
numbered sources list at the bottom, with titles as deep links too.
Deep-link construction: source_url is hydrated via dbnV6QueryDocumentMeta
in a single batched call after retrieval. For Lovdata sources with a
section_title containing §<n>, the link is path-anchored to that section
(/§43). For other hosts (HUDOC, Regjeringen, Bufdir, etc.) we link to the
document root URL.
Telemetry: trace_metadata now carries retrieval_counts {raw_corpus,
filtered_website, post_filter_corpus, raw_upload, after_dedupe, after_topk}
so future regressions are diagnosable from the metadata.jsonl log alone.
The completion status pill surfaces the corpus/website/upload split.
Previously the endpoint returned a single JSON object at the end. Apache+
PHP-FPM buffers the entire body until PHP exits, so a 160s azure_full run
caused the browser to drop the fetch as "Failed to fetch" while the server
was still synthesising — the response then arrived to a dead socket.
Switch to application/x-ndjson with one event per line. The endpoint emits
'progress', 'start', 'step' (running/complete/warning/error), 'subq', and a
final 'final' event carrying the full result payload. Output buffering is
explicitly disabled so each line flushes through Apache as soon as the
agent emits it.
DbnDeepResearchAgent::run() now accepts an optional ?callable $emit and
fires step:running before each step + step:complete after, plus a subq
event per sub-question retrieval round.
JS reads response.body as a stream, splits on newlines, updates the
trace panel live, and renders the final result when the final event
arrives. Status pill shows live progress detail (e.g. "Synthesising with
Azure gpt-4o — this is the slowest step…").
Engine row in the form now shows expected duration per engine
(~15-45s mini, ~60-180s full, ~30-90s GPU) so users know what they're in
for before clicking Run.
New surface at /deep-research.php where the user pastes a question or
uploads PDF/DOCX/TXT case files and a LLM-orchestrated agent researches
the Do Better Norge legal corpus from 3-5 angles, with hybrid retrieval,
cross-encoder rerank, and synthesis that emits an inline-[n]-cited
markdown brief plus a numbered sources panel.
Uploaded documents are chunked + embedded in memory only (nomic-embed-text
via LiteLLM) and searched alongside the shared corpus during the same
request — never persisted to disk, DB, or Qdrant.
Reuses ClientRagPipeline::searchAll (hybrid + rerank), dbnV6 slice
helpers, and the existing extract.php text-extraction logic via a new
dbnToolsExtractUploadedFile() helper. Also adds dbnToolsCallGpuLlm()
helper in bootstrap.php — fixes a latent bug where LegalTools.php
was already calling that name with no definition.
Search.php is unchanged.
The .env default DBN_AZURE_OPENAI_CHAT_DEPLOYMENT is gpt-4o, so the
azure_mini branch (which just called ->chat() without withDeployment)
was silently hitting gpt-4o too. Both UI engine options resolved to
the same model, and timed out together on long Norwegian documents.
Fix: explicitly route azure_mini → gpt-4o-mini in both timeline and
redact paths.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Timeline was using no explicit timeout, falling back to the gateway's
45s default, which timed out on long Norwegian legal documents.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds Nordic-pack regex patterns for:
- DD.MM.YYYY / DD/MM/YYYY / YYYY-MM-DD
- Year ranges (2011/2012, 2018-2019)
- Month + year (Norwegian + English, with optional day)
- Year preceded by temporal preposition (i 2015, fra 2019, rundt 2018)
Also renames the entity toggle from "Dates of birth" to "Dates" (broader
scope) in all four languages, and expands the LLM prompt so soft date
references in free text are caught even when regex misses them.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New api/feedback.php stores rating + correction text to tool_feedback
table in bnl_admin. renderFeedbackWidget() appended to all tool results
(timeline, redact, transcribe, ask, summarize, search). Thumbs reveal
a textarea for missed/wrong items on click; submit POSTs asynchronously.
Engine from last run is stored alongside the rating.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add DD.MM.YY, D.M., diary-line format instructions so the model doesn't
skip short Norwegian dates like 18.09.25 or 6.1. Two-digit years always
treated as 20YY. Lines starting with date+colon are always events.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add 4-language switcher (EN/NO/UK/PL), engine choice (Azure mini/full,
GPU/cuttlefish), and expandable Advanced panel (Focus, Confidence filter,
Date types) to timeline.php. Wire new params through api/timeline.php and
LegalTools::timeline() with engine routing, focus-aware prompt injection,
and confidence/date-type post-filters. Add TIMELINE_I18N to tools.js with
improved renderTimeline() confidence colour-coding and new CSS classes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Wrap Mode/Region/Entities/Officials/Output/Exempt/Aliases in a
<details> toggle so the form opens clean with only engine + input visible
- After redaction: Copy, Download .txt, Download .docx buttons appear
below the redacted output (all four languages translated)
- New api/redact-download.php: returns plain text or a minimal valid
DOCX built from scratch with ZipArchive (no external dependencies)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Custom inline form (EN/NO/UK/PL lang switcher) replacing generic stub
- Engine selector: Azure gpt-4o-mini (default), gpt-4o, GPU cuttlefish, regex-only
- Entity type toggles: names, organisations, places, dates of birth
- Output formats: contextual role tags, generic [PERSON], Norwegian pseudonyms
- Keep officials mode: judges/experts kept as [JUDGE: Andersen] format
- Exempt names list: specific names excluded from redaction
- Hint paragraphs explaining each option in all four languages
- Backend: engine routing, callGpuLlm(), applyGenericTags(), applyPseudonymization()
- AzureOpenAiGateway: withDeployment() clone pattern for per-call model override
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- api/transcribe.php falls back to DBN_AZURE_SPEECH_KEY/REGION env vars so BYOK not required
- JS hides Azure key input when DBN_AZURE_SPEECH_CONFIGURED is true
- Remove Translate to English task option from Advanced settings
- Add explanatory hint text for Beam size and VAD filter in all 4 languages
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Default UI language to English; lang switcher (EN/NO/UK/PL) persisted in localStorage
- Rename 'rettssak/tingrett' preset to 'Mediation / legal meeting' — court recording is illegal
- Add Ukrainian (uk) and Polish (pl) as selectable audio transcription languages
- TRANSCRIBE_I18N translation object drives all status messages, labels, and trace text
- Apache ProxyTimeout raised to 1800s on server (was 300s — caused 504 on large files)
- set_time_limit(0) + ignore_user_abort(true) in api/transcribe.php
- applyTranscribeI18n() patches data-i18n / data-i18n-placeholder / data-i18n-aria attrs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Default language → nb (Bokmål); auto-detect demoted with warning note
- Default model → large-v3; VAD filter on by default
- Vocabulary prompt promoted to main form with 4 preset buttons
(Barnerett/CPS, Rettssak/tingrett, Generell norsk, Egendefinert)
- Multi-file upload queue: drop/select multiple clips, numbered list UI
- Sequential queue processing with cumulative time_offset per clip
- Backend shifts segment timestamps so SRT/VTT covers full court day
- Merged transcript + segments across all clips for single download
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bootstrap.php: dbnToolsValidateSsoToken(), SSO session check in dbnToolsIsAuthenticated()
- index.php: SSO handler at top, Do Better Norge member panel in login card
- .env: DBN_SSO_SECRET placeholder
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Split monolithic index.php into per-tool pages (ask, search, summarize,
timeline, redact, transcribe), each with its own URL and bookmarkable state
- Shared shell: includes/layout.php + layout_footer.php; shared form:
includes/tool_form.php used by all text-tool pages
- index.php now redirects authenticated users to ask.php; unauthenticated
users see the login gate only
- transcribe.php: engine selector (GPU/OpenAI/Azure), model size (small/
medium/large-v3), diarize, language, expert settings (beam, VAD, task,
initial prompt)
- api/transcribe.php: engine routing — GPU (cuttlefish), OpenAI BYOK,
Azure AI Speech; passes model/beam/task/vad/prompt to Whisper server
- tools.js: data-active-tool body attr drives setTool() on load; <a> nav
tabs skip click listeners; null guards on form/passcodeForm; engine radio
toggle shows/hides BYOK key inputs and model selector; RTF shown in status
- tools.css: styles for BYOK inputs, expert settings panel, prompt textarea
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Guard against INPUT clicks bubbling up to zone handler,
which caused the file picker to open twice.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The hidden textarea still had required=true, so browser-native form
validation silently blocked submit when no audio was the only input.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New sixth tool in the hub. Accepts MP3/WAV/OGG/M4A/FLAC/WEBM up to 200 MB,
proxies to Whisper on cuttlefish GPU. Optional speaker separation with LLM
role labelling (dommer, advokat, forelder, sakkyndig, etc. via GPT-4o-mini).
Client-side TXT / SRT / VTT download from segment data.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extract limit raised from 32K to 128K chars per file (long legal docs now fit)
- Redact API body/text limits raised (400KB / 128K chars) to match
- Upload zone accepts multiple files (up to 5); extracted text concatenated with
doc separator and combined before redaction; shows per-file char counts
- LLM redact pass now infers contextual person roles (FATHER, MOTHER, CHILD,
ATTORNEY, JUDGE, etc.) instead of generic [PERSON] for all names; same
individual gets consistent tag throughout the document
- Tag validation widened to allow any [A-Za-z0-9_- ] pattern (not just the
five hardcoded tags), supporting contextual and alias tags
- Alias UI added to Redact mode: user maps real names to bracketed aliases
(e.g. "David Jr" -> [Junior]); aliases injected into LLM system prompt as
override instructions; max 20 aliases, 100 chars each
- max_tokens raised from 2000 to 4000; timeout from 60s to 90s for larger docs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
api/extract.php — new endpoint accepting .pdf/.docx/.txt up to 4 MB;
pdftotext for PDFs, ZipArchive+DOMXPath for DOCX, mb_convert_encoding
for TXT; truncates to 32 000 chars to stay within redact limit.
index.php — drop/browse upload zone above the textarea, visible only
in Redact mode.
tools.js — setupUpload(), handleFileUpload(), resetUpload(); drag-and-drop
and file picker both call the extract endpoint then populate the textarea.
tools.css — upload zone, drag-over, file-info, clear button styles.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- index.php: public showcase landing page (hero, how-it-works, capabilities,
evidence mock, login form) visible to unauthenticated visitors; full OG/SEO
meta; app shell hidden behind auth as before
- tools.css: showcase section styles (gradient hero, step cards, capability
grid, CTA button, evidence mock, footer)
- LegalTools.php: sourceFromChunk() batch-fetches doc_summaries from RAG DB
for non-private chunks; excerpt shows doc summary when available, falls back
to raw chunk text; chunk_text field always carries the raw excerpt
- tools.js: renderEvidenceItem() shows doc summary as card body; adds a
collapsible "View chunk" toggle when summary differs from raw chunk text
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pass 1: deterministic regex with Nordic/European/ECHR/Global packs
covering fødselsnummer, Swedish personnummer, Danish/Finnish CPR,
UK NI, French INSEE, IBAN, EU phones, ECHR application numbers, DOB,
and national ID label patterns.
Pass 2: LLM semantic scan (Azure OpenAI) finds names, orgs, places
and identifying descriptions missed by regex. Runs on pre-redacted
text so no raw PII reaches the LLM.
Adds region selector (Nordic/European/ECHR/Global) to the Redact UI.
Falls back gracefully when Azure is not yet configured.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>