# AI Extraction Module Improvements ## Status: Complete This document captures planned improvements to the AI extraction module (`extraction/extractor-module.tsx`) based on analysis of community patterns and best practices. --- ## Summary | Improvement | Complexity | Impact | Priority | Status | | ------------------------ | ---------- | ------ | -------- | ---------- | | Prompt Improvements | S | High | 1 | ✅ Done | | Validation Improvements | M | High | 2 | ✅ Done | | Confidence Scoring | M | High | 3 | ✅ Done | | Per-Source Extraction | M | Medium | 4 | ⚠️ Revised | | Schema Selection Pattern | M | Medium | 5 | ✅ Done | --- ## 1. Prompt Improvements **Complexity:** Small (15-30 min) **Impact:** High - improves extraction accuracy immediately ### Problem Current prompts lack: - Field-specific pattern examples - Explicit stopping rules - Example input/output - Guidance on field aliases and normalization ### Solution Enhance `EXTRACTION_SYSTEM_PROMPT` with: ```typescript const EXTRACTION_SYSTEM_PROMPT = `You are a precise data extractor... === FIELD EXTRACTION PATTERNS === EMAIL (field: "address"): - Patterns: user@domain.com, name+tag@company.org - Example: "reach me at john.doe@acme.com" -> "john.doe@acme.com" PHONE (field: "number"): - PRESERVE original formatting exactly - Example: "Cell: (415) 555-1234" -> "(415) 555-1234" BIRTHDAY (fields: "birthMonth", "birthDay", "birthYear"): - Extract as SEPARATE components, not combined date - Example: "March 15, 1990" -> birthMonth: "3", birthDay: "15", birthYear: "1990" ADDRESS (fields: "street", "city", "state", "zip"): - state: Use 2-letter abbreviation (CA, NY, TX) - Example: "123 Main St, San Francisco, CA 94102" SOCIAL MEDIA (fields: "platform", "handle"): - platform: Normalize to lowercase (twitter, linkedin, github) - handle: WITHOUT the @ prefix === STOPPING RULES === 1. Extract ONLY fields in the schema 2. Return null for fields without data 3. Do NOT infer or fabricate data `; ``` ### Files to Modify - `extraction/extractor-module.tsx` lines 94-106 (EXTRACTION_SYSTEM_PROMPT) - `extraction/extractor-module.tsx` lines 114-139 (NOTES_CLEANUP_SYSTEM_PROMPT - minor) --- ## 2. Validation Improvements **Complexity:** Medium (~430 lines) **Impact:** High - catches bugs before apply ### Problem Current validation: - Binary pass/fail only - No handling for LLM returning string `"null"` instead of actual `null` - No duplicate detection - No format validation - Silent failures (only console.warn) ### Solution #### New ValidationResult Interface ```typescript interface ValidationIssue { code: string; // "TYPE_MISMATCH", "STRING_NULL", "DUPLICATE" message: string; severity: "error" | "warning" | "info"; suggestion?: string; } interface FieldValidationResult { valid: boolean; transformed: boolean; // true if value was sanitized sanitizedValue: unknown; issues: ValidationIssue[]; } interface ExtractionValidationResult { canApply: boolean; // false if any errors errorCount: number; warningCount: number; fieldResults: Record; globalIssues: ValidationIssue[]; // e.g., duplicates } ``` #### Key Validations 1. **String "null" handling**: Convert `"null"` string to actual `null` 2. **Duplicate detection**: Warn if extracting duplicate email/phone 3. **Format validation**: Check email format, URL format, date format 4. **Type checking**: Enhanced with suggestions #### UI Changes - Show validation summary before field list - Per-field validation indicators (green/yellow/red) - Disable Apply button when errors exist - Show warnings (non-blocking) distinctly from errors (blocking) ### Files to Modify/Create - `extraction/types.ts` - Add ValidationResult interfaces - `extraction/validation.ts` (NEW) - Pure validation functions - `extraction/extractor-module.tsx` - Integrate validation, add UI --- ## 3. Confidence Scoring **Complexity:** Medium (4-6 hours) **Impact:** High - user trust and quality filtering ### Problem Users have no visibility into extraction quality. Low-confidence extractions look the same as high-confidence ones. ### Solution #### Schema Changes ```typescript interface ExtractedField { fieldName: string; targetModule: string; extractedValue: unknown; currentValue?: unknown; confidence: number; // NEW: 0-1 score confidenceLevel: "high" | "medium" | "low"; confidenceReason?: string; // NEW: LLM explanation } ``` #### LLM Schema Wrapping Instead of `{ email: { type: "string" } }`, use: ```typescript { email: { type: "object", properties: { value: { type: "string" }, confidence: { type: "number", minimum: 0, maximum: 1 }, reason: { type: "string" } } } } ``` #### UI Components - Confidence badges: High (green checkmark), Medium (yellow dot), Low (red warning) - Filter buttons: All | High | Medium | Review - Auto-deselect low-confidence fields by default - Show reason tooltip on hover ### Files to Modify - `extraction/types.ts` - Add ConfidenceLevel, update ExtractedField - `extraction/schema-utils.ts` - Add wrapSchemaWithConfidence() - `extraction/extractor-module.tsx` - Prompt, buildPreview, UI --- ## 4. Per-Source Extraction **Complexity:** Medium (6-7 hours) **Impact:** Medium - performance and caching **Status:** Revised - using combined extraction instead ### Original Problem Current approach combines all sources into single LLM call. Changing one source re-extracts everything. No per-source caching. ### Attempted Solution Use "dumb map approach" from folk wisdom - one `generateObject()` per source: ```typescript // Attempted: Per-source calls with .map() const sourceExtractions = selectedSources.map((source) => ({ sourceIndex: source.index, sourceType: source.type, extraction: generateObject({ prompt: computed(() => source.content), // Per-source prompt schema, model, }), })); ``` ### Why It Didn't Work Creating `computed()` inside `.map()` callbacks caused infinite reactive loops. The issue was specific to this case because: 1. `selectedSourcesForExtraction` depends on `extractPhase` (volatile state) 2. When `generateObject()` completes, state changes propagate back 3. This triggers re-evaluation of the `.map()`, creating new computed nodes 4. New nodes invalidate subscriptions, causing another evaluation loop **Note:** OCR in the same file successfully uses `.map()` + `computed()` because `photoSources` is stable (photos don't change during extraction flow). See: `community-docs/superstitions/2026-01-08-computed-inside-map-callback-infinite-loop.md` ### Current Implementation: Combined Extraction Instead of per-source extraction, we use a single combined prompt: ```typescript // Build single prompt from all sources const combinedExtractionPrompt = computed(() => { const promptParts: string[] = []; for (const source of selectedSources) { if (source.type === "photo") { promptParts.push( `--- ${source.label} (OCR) ---\n${ocrResults[source.index]}`, ); } else { promptParts.push(`--- ${source.label} ---\n${source.content}`); } } return promptParts.join("\n\n"); }); // Single extraction call const singleExtraction = generateObject({ prompt: combinedExtractionPrompt, schema: RECOMMENDATIONS_SCHEMA, model: "anthropic:claude-haiku-4-5", }); ``` ### Trade-offs **Lost benefits:** - No per-source caching (changing one source re-extracts all) - No incremental progress (can't see each source complete) **Gained benefits:** - Works reliably (no infinite loops) - Simpler code - Single LLM call is often faster than multiple small calls - LLM can cross-reference data between sources in combined context ### Files Modified - `extraction/types.ts` - Added SourceExtraction interface (still used for status) - `extraction/extractor-module.tsx` - Uses combined extraction with synthetic per-source array for UI compatibility --- ## 5. Schema Selection Pattern **Complexity:** Medium (12-17 hours) **Impact:** Medium - better UX, follows framework philosophy ### Problem Current "mega-schema" approach: - Combines all module types into one huge schema - No explanation of WHY fields were extracted - All-or-nothing acceptance - Field collision between similar module types ### Solution Replace combined schema with recommendations array: ```typescript interface ExtractionRecommendation { type: string; // "email", "birthday", "phone" score: number; // 0-100 confidence explanation: string; // "Found email format in signature" extractedData: Record; sourceExcerpt?: string; // "Contact: john@example.com" } // LLM returns { recommendations: [ { type: "email", score: 95, explanation: "Explicit email address found", extractedData: { address: "john@example.com" }, }, { type: "birthday", score: 60, explanation: "Date found but unclear if birthday", extractedData: { birthMonth: "3", birthDay: "15" }, }, ]; } ``` #### UI Preview ``` +--------------------------------------------------+ | [x] 📧 Email 95% confidence | | "Found email format in signature" | | address: john@example.com | +--------------------------------------------------+ | [x] 📱 Phone 88% confidence | | "Phone in standard US format" | | number: (415) 555-1234 | +--------------------------------------------------+ | [ ] 🎂 Birthday 45% confidence | | "Date found but may not be birthday" | | ⚠ Low confidence - review before accepting | +--------------------------------------------------+ | Auto-accept threshold: [====70%======] | +--------------------------------------------------+ ``` #### Benefits - Users see WHY each field was extracted - Sorted by confidence (high first) - Low-confidence auto-deselected - Individual accept/reject per recommendation - Follows "data-up" philosophy (framework author recommendation) ### Migration Path 1. Add recommendations mode behind feature flag 2. Test in parallel with existing approach 3. Make default after validation 4. Remove old combined schema code ### Files to Modify - `extraction/types.ts` - Add ExtractionRecommendation interface - `extraction/schema-utils.ts` - Add buildRecommendationsSchema() - `extraction/extractor-module.tsx` - New prompt, UI, apply logic --- ## Implementation Order Rationale 1. **Prompts first**: Quick win, immediate accuracy improvement, no structural changes 2. **Validation second**: Catches bugs like string "null", improves reliability 3. **Confidence third**: Builds on validation, gives users quality visibility 4. **Per-source fourth**: Performance improvement, leverages framework caching 5. **Schema selection last**: Largest change, best UX but requires most testing --- ## References - `community-docs/superstitions/2025-12-18-llm-extraction-schema-selection-not-combined.md` (community-patterns repo) - `community-docs/superstitions/2025-11-29-llm-generateObject-returns-string-null.md` (community-patterns repo) - `community-docs/folk_wisdom/llm.md` (community-patterns repo) - `patterns/jkomoros/hotel-membership-gmail-agent.tsx` (community-patterns repo) --- ## Changelog - 2026-01-08: Per-source extraction revised - changed to combined extraction due to infinite loop issues with `.map()` + `computed()`. Documented trade-offs. - 2026-01-08: All improvements completed - per-source extraction and schema selection pattern with confidence scoring implemented - 2026-01-08: v2 improvements completed (prompts, validation, error messages, progress feedback, helper refactoring) - 2026-01-06: Initial document created from research and sub-agent planning