Commit Graph

23 Commits

Author SHA1 Message Date
daveadmin d5e61d656a Fix MariaDB LIMIT/OFFSET bound-parameter error in corpus API
MariaDB rejects ? placeholders for LIMIT/OFFSET when emulate_prepares=false.
Interpolate $limit and $offset as ints directly into SQL strings in both
corpus-documents.php and corpus-search.php BM25 paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 12:31:20 +02:00
daveadmin 640778454f Add Case Advocate tab — partisan brief grounded in Norwegian law
New /advocate.php tab: user selects who they represent (biological
father, mother, foster carer, CWS, etc.) and the agent takes their
side entirely. Adversarial sub-questions target supporting Lovdata
statutes + ECHR precedents; synthesis returns client_strengths[] and
opposing_weaknesses[] alongside the advocate brief.

- DeepResearchAgent: add advocateRole param to run(), interpretSeed(),
  expandQueries(), synthesise(). Neutral path unchanged (empty string).
- api/deep-research.php: extract + validate advocate_role from payload;
  telemetry logs tool='advocate' vs 'deep_research'.
- advocate.php: new page with role dropdown (presets + custom), same
  corpus slices/engine/controls/upload zone as deep research.
- assets/js/advocate.js: page-scoped JS; renders advocate banner,
  client strengths card (teal), advocate brief, opposing weaknesses
  card (amber), sub-Q cards, sources, uncertainty, next step.
- assets/css/tools.css: append .adv-* rules (~120 lines).
- includes/layout.php: add Advocate nav tab between Deep research and
  Summarize.
- index.php: add Advocate cap-card tile.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 12:26:05 +02:00
daveadmin 85a6bc8134 Exclude dobetternorge.no docs from all corpus search modes
BM25: adds NOT LIKE filter to SQL WHERE in both FULLTEXT and LIKE paths.
Hybrid + Vector: post-filter hits array by source_url after results return.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 12:10:46 +02:00
daveadmin 38255669a9 Add corpus explorer: search bar (Hybrid/BM25/Vector), category drill-down, source row expand
- api/corpus-search.php: new endpoint with three search modes (hybrid RAG, BM25 keyword, Qdrant vector)
- api/corpus-documents.php: paginated document browser by category or source name
- corpus.php: search bar with mode+language pills, Browse docs button on each category card with drill-down panel, expand toggle on each source row showing doc count and scraper class
- tools.css: all new corpus interactive styles appended

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 11:55:54 +02:00
daveadmin d2f9831472 feat: Corpus Intelligence page + timeline background events
Adds /corpus.php — a data transparency page showing what powers the
legal tools: 9 coverage categories with live doc counts, a full
sources table pulled from the corpus DB, the AI stack (LLMs, Whisper,
Qdrant, Azure AI Search, embeddings, chunking), and a pipeline flow
diagram. Stats are live via a new /api/corpus-stats.php endpoint
(queries dobetter_rag + bnl_admin). The reasoning sidebar is repurposed
as a Corpus health panel on this page.

Also ships the in-progress timeline background events toggle:
API and UI wired together via include_background param.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 11:31:24 +02:00
daveadmin a1a7f442a7 Deep Research: NDJSON streaming so the connection survives long runs
Previously the endpoint returned a single JSON object at the end. Apache+
PHP-FPM buffers the entire body until PHP exits, so a 160s azure_full run
caused the browser to drop the fetch as "Failed to fetch" while the server
was still synthesising — the response then arrived to a dead socket.

Switch to application/x-ndjson with one event per line. The endpoint emits
'progress', 'start', 'step' (running/complete/warning/error), 'subq', and a
final 'final' event carrying the full result payload. Output buffering is
explicitly disabled so each line flushes through Apache as soon as the
agent emits it.

DbnDeepResearchAgent::run() now accepts an optional ?callable $emit and
fires step:running before each step + step:complete after, plus a subq
event per sub-question retrieval round.

JS reads response.body as a stream, splits on newlines, updates the
trace panel live, and renders the final result when the final event
arrives. Status pill shows live progress detail (e.g. "Synthesising with
Azure gpt-4o — this is the slowest step…").

Engine row in the form now shows expected duration per engine
(~15-45s mini, ~60-180s full, ~30-90s GPU) so users know what they're in
for before clicking Run.
2026-05-15 10:47:35 +02:00
daveadmin 4cbe0a4ac4 Add Deep Research tool — agent + rank/rerank RAG
New surface at /deep-research.php where the user pastes a question or
uploads PDF/DOCX/TXT case files and a LLM-orchestrated agent researches
the Do Better Norge legal corpus from 3-5 angles, with hybrid retrieval,
cross-encoder rerank, and synthesis that emits an inline-[n]-cited
markdown brief plus a numbered sources panel.

Uploaded documents are chunked + embedded in memory only (nomic-embed-text
via LiteLLM) and searched alongside the shared corpus during the same
request — never persisted to disk, DB, or Qdrant.

Reuses ClientRagPipeline::searchAll (hybrid + rerank), dbnV6 slice
helpers, and the existing extract.php text-extraction logic via a new
dbnToolsExtractUploadedFile() helper. Also adds dbnToolsCallGpuLlm()
helper in bootstrap.php — fixes a latent bug where LegalTools.php
was already calling that name with no definition.

Search.php is unchanged.
2026-05-15 10:30:47 +02:00
daveadmin d429e785e8 feat(feedback): thumbs up/down + missed-items widget across all tools
New api/feedback.php stores rating + correction text to tool_feedback
table in bnl_admin. renderFeedbackWidget() appended to all tool results
(timeline, redact, transcribe, ask, summarize, search). Thumbs reveal
a textarea for missed/wrong items on click; submit POSTs asynchronously.
Engine from last run is stored alongside the rating.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 01:13:42 +02:00
daveadmin 7690ed17ee feat(timeline): full form UI with engine selection and advanced settings
Add 4-language switcher (EN/NO/UK/PL), engine choice (Azure mini/full,
GPU/cuttlefish), and expandable Advanced panel (Focus, Confidence filter,
Date types) to timeline.php. Wire new params through api/timeline.php and
LegalTools::timeline() with engine routing, focus-aware prompt injection,
and confidence/date-type post-filters. Add TIMELINE_I18N to tools.js with
improved renderTimeline() confidence colour-coding and new CSS classes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 00:59:12 +02:00
daveadmin 30915bcb09 Redact: collapsible advanced settings, download TXT/DOCX/copy
- Wrap Mode/Region/Entities/Officials/Output/Exempt/Aliases in a
  <details> toggle so the form opens clean with only engine + input visible
- After redaction: Copy, Download .txt, Download .docx buttons appear
  below the redacted output (all four languages translated)
- New api/redact-download.php: returns plain text or a minimal valid
  DOCX built from scratch with ZipArchive (no external dependencies)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 00:33:50 +02:00
daveadmin 8c12d5e778 Redact tool: rich UI, multilingual, engine choice, output formats
- Custom inline form (EN/NO/UK/PL lang switcher) replacing generic stub
- Engine selector: Azure gpt-4o-mini (default), gpt-4o, GPU cuttlefish, regex-only
- Entity type toggles: names, organisations, places, dates of birth
- Output formats: contextual role tags, generic [PERSON], Norwegian pseudonyms
- Keep officials mode: judges/experts kept as [JUDGE: Andersen] format
- Exempt names list: specific names excluded from redaction
- Hint paragraphs explaining each option in all four languages
- Backend: engine routing, callGpuLlm(), applyGenericTags(), applyPseudonymization()
- AzureOpenAiGateway: withDeployment() clone pattern for per-call model override

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 00:20:16 +02:00
daveadmin e3d8daf6ca feat(transcribe): Azure Speech server-side key, remove translate option, add beam/VAD hints
- api/transcribe.php falls back to DBN_AZURE_SPEECH_KEY/REGION env vars so BYOK not required
- JS hides Azure key input when DBN_AZURE_SPEECH_CONFIGURED is true
- Remove Translate to English task option from Advanced settings
- Add explanatory hint text for Beam size and VAD filter in all 4 languages

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 23:23:33 +02:00
daveadmin c77efa241c feat(transcribe): English UI default, language switcher (NO/UK/PL), fix 504 timeout
- Default UI language to English; lang switcher (EN/NO/UK/PL) persisted in localStorage
- Rename 'rettssak/tingrett' preset to 'Mediation / legal meeting' — court recording is illegal
- Add Ukrainian (uk) and Polish (pl) as selectable audio transcription languages
- TRANSCRIBE_I18N translation object drives all status messages, labels, and trace text
- Apache ProxyTimeout raised to 1800s on server (was 300s — caused 504 on large files)
- set_time_limit(0) + ignore_user_abort(true) in api/transcribe.php
- applyTranscribeI18n() patches data-i18n / data-i18n-placeholder / data-i18n-aria attrs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 22:47:32 +02:00
daveadmin 26f4e2231b feat(transcribe): Norwegian defaults, vocabulary presets, multi-file court day queue
- Default language → nb (Bokmål); auto-detect demoted with warning note
- Default model → large-v3; VAD filter on by default
- Vocabulary prompt promoted to main form with 4 preset buttons
  (Barnerett/CPS, Rettssak/tingrett, Generell norsk, Egendefinert)
- Multi-file upload queue: drop/select multiple clips, numbered list UI
- Sequential queue processing with cumulative time_offset per clip
- Backend shifts segment timestamps so SRT/VTT covers full court day
- Merged transcript + segments across all clips for single download

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 22:20:11 +02:00
daveadmin eaff2a4d86 Per-tool pages + multi-engine transcribe with expert controls
- Split monolithic index.php into per-tool pages (ask, search, summarize,
  timeline, redact, transcribe), each with its own URL and bookmarkable state
- Shared shell: includes/layout.php + layout_footer.php; shared form:
  includes/tool_form.php used by all text-tool pages
- index.php now redirects authenticated users to ask.php; unauthenticated
  users see the login gate only
- transcribe.php: engine selector (GPU/OpenAI/Azure), model size (small/
  medium/large-v3), diarize, language, expert settings (beam, VAD, task,
  initial prompt)
- api/transcribe.php: engine routing — GPU (cuttlefish), OpenAI BYOK,
  Azure AI Speech; passes model/beam/task/vad/prompt to Whisper server
- tools.js: data-active-tool body attr drives setTool() on load; <a> nav
  tabs skip click listeners; null guards on form/passcodeForm; engine radio
  toggle shows/hides BYOK key inputs and model selector; RTF shown in status
- tools.css: styles for BYOK inputs, expert settings panel, prompt textarea

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-13 22:14:20 +02:00
daveadmin d425c99e8e Transcribe: audio-to-text tool with diarization and speaker role labelling
New sixth tool in the hub. Accepts MP3/WAV/OGG/M4A/FLAC/WEBM up to 200 MB,
proxies to Whisper on cuttlefish GPU. Optional speaker separation with LLM
role labelling (dommer, advokat, forelder, sakkyndig, etc. via GPT-4o-mini).
Client-side TXT / SRT / VTT download from segment data.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-13 18:43:22 +02:00
daveadmin bddafea049 Timeline: document upload, upgraded prompt, CSV export, date_type badge 2026-05-13 08:10:40 +02:00
daveadmin 95685862ab Redact: multi-doc upload, contextual person naming, aliases
- Extract limit raised from 32K to 128K chars per file (long legal docs now fit)
- Redact API body/text limits raised (400KB / 128K chars) to match
- Upload zone accepts multiple files (up to 5); extracted text concatenated with
  doc separator and combined before redaction; shows per-file char counts
- LLM redact pass now infers contextual person roles (FATHER, MOTHER, CHILD,
  ATTORNEY, JUDGE, etc.) instead of generic [PERSON] for all names; same
  individual gets consistent tag throughout the document
- Tag validation widened to allow any [A-Za-z0-9_- ] pattern (not just the
  five hardcoded tags), supporting contextual and alias tags
- Alias UI added to Redact mode: user maps real names to bracketed aliases
  (e.g. "David Jr" -> [Junior]); aliases injected into LLM system prompt as
  override instructions; max 20 aliases, 100 chars each
- max_tokens raised from 2000 to 4000; timeout from 60s to 90s for larger docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-13 07:17:02 +02:00
daveadmin bbe5307c03 Add document upload to Redact tool
api/extract.php — new endpoint accepting .pdf/.docx/.txt up to 4 MB;
pdftotext for PDFs, ZipArchive+DOMXPath for DOCX, mb_convert_encoding
for TXT; truncates to 32 000 chars to stay within redact limit.

index.php — drop/browse upload zone above the textarea, visible only
in Redact mode.

tools.js — setupUpload(), handleFileUpload(), resetUpload(); drag-and-drop
and file picker both call the extract endpoint then populate the textarea.

tools.css — upload zone, drag-over, file-info, clear button styles.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-13 06:52:14 +02:00
daveadmin 3c8d7ebc34 feat: pass temporal_mode and as_of_date through DBN search API
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 18:45:54 +02:00
daveadmin 62dbb8d900 Gate tools login with Caveau access 2026-05-08 17:12:38 +02:00
daveadmin 9b22947eb2 Two-pass PII redaction with multi-country pattern packs
Pass 1: deterministic regex with Nordic/European/ECHR/Global packs
covering fødselsnummer, Swedish personnummer, Danish/Finnish CPR,
UK NI, French INSEE, IBAN, EU phones, ECHR application numbers, DOB,
and national ID label patterns.

Pass 2: LLM semantic scan (Azure OpenAI) finds names, orgs, places
and identifying descriptions missed by regex. Runs on pre-redacted
text so no raw PII reaches the LLM.

Adds region selector (Nordic/European/ECHR/Global) to the Redact UI.
Falls back gracefully when Azure is not yet configured.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 01:27:52 +02:00
daveadmin 2d8d1c7409 Initial release: Do Better Norge Legal Tools Hub
Five MVP tools (Ask, Search, Summarize, Timeline, Redact) with
email+password auth, Azure OpenAI gateway, evidence trail panel,
and process-and-forget privacy default.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 00:01:07 +02:00