3.1 System Components
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client (Browser) β
β βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββββ β
β β Exam Setup β β Chat Interface β β Export/Review β β
β β Modal β β (Streaming) β β Panel β β
β ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ βββββββββ¬βββββββββ β
βββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend β
β βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββββ β
β β POST /api/exam β β POST /api/chat β β GET /api/exportβ β
β β /session/start β β (streaming) β β /{session_id} β β
β ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ βββββββββ¬βββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SQLite Database (Enhanced) β β
β β β’ conversations β’ messages β’ exam_sessions β β
β β β’ audit_logs (NEW) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Azure AI Foundry β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Claude Sonnet 4.5 (Anthropic) β β
β β Endpoint: services.ai.azure.com β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
##3.2 Data Flow
1. Examiner configures exam session
βββΊ Subject, AI student level, weaknesses, language
2. System generates AI student persona prompt
βββΊ Stored in exam_sessions table
3. Student starts interview
βββΊ Each message logged with timestamp
4. AI responds as "student"
βββΊ Low temperature for consistency
βββΊ Role-play instructions enforced
5. Session ends
βββΊ Full transcript available for export
βββΊ Examiner can review via JSON/PDF
---
#4. Database Schema
##4.1 Existing Tables (unchanged)
sql
-- Conversations table
CREATE TABLE conversations (
id TEXT PRIMARY KEY,
created_at TEXT NOT NULL
);
-- Messages table
CREATE TABLE messages (
id INTEGER PRIMARY KEY AUTOINCREMENT,
conversation_id TEXT NOT NULL,
role TEXT NOT NULL,
content TEXT NOT NULL,
created_at TEXT NOT NULL,
FOREIGN KEY (conversation_id) REFERENCES conversations(id)
);
##4.2 New Tables
sql
-- Exam sessions configuration
CREATE TABLE exam_sessions (
id TEXT PRIMARY KEY,
conversation_id TEXT NOT NULL UNIQUE,
examiner_id TEXT, -- Optional: who configured this
subject TEXT NOT NULL, -- Course/subject name
student_level TEXT NOT NULL, -- 'weak', 'medium', 'strong'
language TEXT DEFAULT 'no', -- 'no', 'en'
weaknesses TEXT, -- JSON array of configured weaknesses
temperature REAL DEFAULT 0.3, -- Lower = more deterministic
system_prompt TEXT NOT NULL, -- Generated system prompt
started_at TEXT NOT NULL,
ended_at TEXT,
status TEXT DEFAULT 'active', -- 'active', 'completed', 'exported'
FOREIGN KEY (conversation_id) REFERENCES conversations(id)
);
-- Audit log for all actions
CREATE TABLE audit_logs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
session_id TEXT,
action TEXT NOT NULL, -- 'session_start', 'message_sent', 'session_end', 'export'
actor TEXT, -- 'student', 'system', 'examiner'
details TEXT, -- JSON with action-specific data
ip_address TEXT,
user_agent TEXT,
created_at TEXT NOT NULL,
FOREIGN KEY (session_id) REFERENCES exam_sessions(id)
);
---
#5. API Endpoints
##5.1 Existing Endpoints (enhanced)
| Method | Endpoint | Description | Changes |
|--------|----------|-------------|---------|
| POST | `/api/chat` | Send message, stream response | Add exam mode system prompt |
| GET | `/api/conversations` | List conversations | Filter by exam sessions |
| GET | `/api/history/{id}` | Get conversation history | No changes |
##5.2 New Endpoints
###POST `/api/exam/session/start`
Start a new exam session with configuration.
**Request:**
json
{
"subject": "INF-101 Grunnleggende programmering",
"student_level": "medium",
"language": "no",
"weaknesses": ["kildehenvisning", "feilhΓ₯ndtering"],
"examiner_id": "ola.nordmann@uia.no"
}
**Response:**
json
{
"session_id": "exam-uuid-here",
"conversation_id": "conv-uuid-here",
"status": "active",
"system_prompt_preview": "Du er en middels god student...",
"started_at": "2026-02-10T14:30:00Z"
}
###POST `/api/exam/session/{session_id}/end`
End an exam session.
**Response:**
json
{
"session_id": "exam-uuid-here",
"status": "completed",
"duration_minutes": 45,
"message_count": 24,
"ended_at": "2026-02-10T15:15:00Z"
}
###GET `/api/exam/export/{session_id}`
Export session for examiner review.
**Query params:**
- `format`: `json` (default) or `pdf`
**Response (JSON):**
json
{
"session_id": "exam-uuid-here",
"metadata": {
"subject": "INF-101",
"student_level": "medium",
"started_at": "2026-02-10T14:30:00Z",
"ended_at": "2026-02-10T15:15:00Z",
"duration_minutes": 45
},
"configuration": {
"weaknesses": ["kildehenvisning"],
"temperature": 0.3,
"system_prompt": "..."
},
"transcript": [
{
"timestamp": "2026-02-10T14:30:15Z",
"role": "user",
"content": "Kan du forklare hva en for-lΓΈkke er?"
},
{
"timestamp": "2026-02-10T14:30:45Z",
"role": "assistant",
"content": "En for-lΓΈkke er sΓ₯nn... du bruker den for Γ₯ gjenta ting..."
}
],
"audit_trail": [...]
}
###GET `/api/exam/sessions`
List all exam sessions (for examiners).
**Response:**
json
{
"sessions": [
{
"session_id": "...",
"subject": "INF-101",
"status": "completed",
"started_at": "...",
"message_count": 24
}
]
}
---
#6. AI Student System Prompt
##6.1 Prompt Template (Norwegian)
Du er en {level_description} student i faget "{subject}".
VIKTIGE INSTRUKSJONER:
1. Du skal ALDRI opptre som lærer eller ekspert
2. Du svarer som en student som prΓΈver Γ₯ forklare det du har lΓ¦rt
3. Du kan være usikker, nøle, og gjøre feil
4. Du skal IKKE gi perfekte, læreboksvar
DITT NIVΓ
: {level_description}
- Svak student: GjΓΈr mange feil, husker dΓ₯rlig, blander begreper
- Middels student: ForstΓ₯r grunnleggende, men sliter med detaljer
- Sterk student: God forstΓ₯else, men kan fortsatt ha hull
SPESIFIKKE SVAKHETER DU HAR:
{weaknesses_list}
SPRΓ
K: Svar alltid pΓ₯ {language}.
OPPFΓRSEL:
- Bruk uformelt sprΓ₯k som en student ville gjort
- Si "ehh", "jeg tror...", "var det ikke sΓ₯nn at..."
- Hvis du blir presset pΓ₯ noe du ikke kan, innrΓΈm at du er usikker
- Ikke gi kilder med mindre du blir spurt, og da kan du finne pΓ₯ feil kilder
EKSEMPEL pΓ₯ hvordan du svarer som {level} student:
SpΓΈrsmΓ₯l: "Hva er en variabel?"
Svak: "Det er sΓ₯nn... en boks? Der du legger ting i. Tror jeg."
Middels: "En variabel er en navngitt plass i minnet der du kan lagre verdier. Som en boks med en lapp pΓ₯."
Sterk: "En variabel er en symbolsk referanse til en minnelokasjon som holder en verdi. Den har en type, et navn og en verdi."
##6.2 Level Configurations
| Level | Temperature | Error Rate | Characteristics |
|-------|-------------|------------|-----------------|
| `weak` | 0.5 | High | Confused, wrong facts, poor structure |
| `medium` | 0.3 | Moderate | Basic understanding, gaps in details |
| `strong` | 0.2 | Low | Good grasp, occasional minor errors |
##6.3 Configurable Weaknesses
| Weakness ID | Norwegian | Description |
|-------------|-----------|-------------|
| `kildehenvisning` | DΓ₯rlig kildebruk | Makes up sources, cites incorrectly |
| `feilhaandtering` | Svak feilhΓ₯ndtering | Doesn't consider edge cases |
| `syntaks` | Syntaksfeil | Makes syntax errors in code |
| `begreper` | Blander begreper | Confuses related concepts |
| `struktur` | DΓ₯rlig struktur | Unorganized explanations |
| `dybde` | Overfladisk forstΓ₯else | Lacks deep understanding |
---
#7. UI Components
##7.1 Exam Setup Modal
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β οΈ Flipped Exam - PoC Mode β
β β
β Fag/Emne: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β INF-101 Grunnleggende programmering β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β KI-student nivΓ₯: β
β β Svak student (mange feil, usikker) β
β β Middels student (grunnleggende forstΓ₯else) β
β β Sterk student (god forstΓ₯else) β
β β
β Svakheter (velg 1-3): β
β β DΓ₯rlig kildebruk β
β β Svak feilhΓ₯ndtering β
β β Blander begreper β
β β Syntaksfeil i kode β
β β
β SprΓ₯k: β
β β Norsk β English β
β β
β βββββββββββββββ ββββββββββββββββββββ β
β β Avbryt β β Start eksamen β β
β βββββββββββββββ ββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
##7.2 Active Exam Session UI
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β οΈ FLIPPED EXAM PoC - Ikke for offisiell bruk β
β Fag: INF-101 | NivΓ₯: Middels | Tid: 23:45 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β [Chat messages appear here as normal] β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Du: β β
β β "Kan du forklare hva arv betyr i OOP?" β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β KI-Student: β β
β β "Arv er sΓ₯nn... nΓ₯r en klasse fΓ₯r egenskapene β β
β β fra en annen klasse. Som at en Bil-klasse kan β β
β β arve fra KjΓΈretΓΈy eller noe sΓ₯nt." β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β βββββββββββββββββββββββββββββββββββββ ββββββββββββββ β
β β Still et spΓΈrsmΓ₯l... β β Send β β
β βββββββββββββββββββββββββββββββββββββ ββββββββββββββ β
β β
β ββββββββββββββββββ ββββββββββββββββββββββ β
β β Avslutt ΓΈkt β β Eksporter (JSON) β β
β ββββββββββββββββββ ββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
---
#8. Security & Privacy Considerations
##8.1 Data Handling
| Aspect | Implementation |
|--------|----------------|
| Data storage | SQLite on local server (PoC only) |
| Data retention | Manual deletion, no auto-purge in PoC |
| Access control | None in PoC (internal testing only) |
| Encryption | Not implemented in PoC |
##8.2 PoC Limitations (to address before production)
- [ ] No authentication/authorization
- [ ] No HTTPS enforcement
- [ ] No session timeouts
- [ ] No rate limiting
- [ ] Single-tenant architecture
- [ ] No backup/recovery
---
#9. Testing Strategy
##9.1 Internal Testing Phases
| Phase | Duration | Participants | Goal |
|-------|----------|--------------|------|
| 1 | 1 week | Dev team | Validate basic functionality |
| 2 | 2 weeks | Extended team | Test AI student behaviors |
| 3 | 2 weeks | Select faculty | Pedagogical feedback |
##9.2 Metrics to Measure
| Metric | Target | Measurement Method |
|--------|--------|-------------------|
| Response consistency | <20% variance | Same prompts, multiple runs |
| Latency | <3s first token | Timestamp logging |
| Error rate (system) | <1% | Exception logging |
| AI "in-character" rate | >90% | Manual review of transcripts |
##9.3 Stop/Go Criteria
**Stop conditions:**
- AI frequently breaks character (>10% of responses)
- System errors affect >5% of sessions
- Faculty feedback indicates unusable for learning
**Go conditions:**
- All Phase 3 testers recommend continuation
- Consistency metrics met
- Clear value demonstrated for learning (even if not for exams)
---
#10. Implementation Checklist
##Phase 1: Core Implementation (This PoC)
- [ ] Enhanced database schema with exam_sessions table
- [ ] Exam setup modal in UI
- [ ] AI student system prompt generator
- [ ] `/api/exam/session/start` endpoint
- [ ] `/api/exam/session/{id}/end` endpoint
- [ ] `/api/exam/export/{id}` endpoint
- [ ] PoC disclaimer banner
- [ ] Basic audit logging
##Phase 2: Enhancements (Future)
- [ ] PDF export with formatting
- [ ] Session replay functionality
- [ ] Multi-examiner support
- [ ] Performance dashboards
- [ ] Consistency analysis tools
##Phase 3: Production Readiness (If approved)
- [ ] Authentication integration (Feide)
- [ ] LMS integration (Canvas)
- [ ] Proper security audit
- [ ] GDPR compliance review
- [ ] Load testing
- [ ] Deployment to production Azure
---
#11. File Structure
azure-ai-foundry-claudesonnet-45/
βββ app/
β βββ __init__.py
β βββ main.py FastAPI app (enhanced)
β βββ exam/ NEW: Exam module
β β βββ __init__.py
β β βββ models.py Pydantic models
β β βββ prompts.py System prompt templates
β β βββ routes.py Exam-specific endpoints
β βββ database.py NEW: DB utilities
βββ docs/
β βββ FLIPPED_EXAM_POC_TECHNICAL_SPEC.md This document
βββ tests/ NEW: Test suite
β βββ test_exam_session.py
β βββ test_ai_student.py
βββ requirements.txt
βββ .env.example
βββ README.md
---
#12. Deployment
##12.1 Local Development
bash
Clone and setup
cd azure-ai-foundry-claudesonnet-45
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Configure
cp .env.example .env
Edit .env with Azure credentials
Run
uvicorn app.main:app --reload --port 8000
##12.2 Environment Variables
env
Azure AI Foundry
AZURE_FOUNDRY_ENDPOINT=https://your-resource.services.ai.azure.com
AZURE_API_KEY=your-api-key
CLAUDE_DEPLOYMENT_NAME=claude-sonnet-4-5
Database
CHAT_DB_PATH=chat.db
Exam PoC Settings (NEW)
EXAM_MODE_ENABLED=true
EXAM_DEFAULT_TEMPERATURE=0.3
EXAM_MAX_DURATION_MINUTES=60
---
#13. Monitoring and Operations
##13.1 AI Monitor Verification
The AI Monitor runs automatically in the background after each AI response during an exam session. To verify it's working:
###13.1.1 Check Pod Logs (Kubernetes)
View recent monitor activity:
bash
Check for monitor logs (last 50 lines)
kubectl logs -n flipped-exam -l app=flipped-exam-chat --tail=50 | grep -i "MONITOR"
Watch logs in real-time during testing
kubectl logs -n flipped-exam -l app=flipped-exam-chat -f | grep "MONITOR"
Check specific session logs (replace session ID)
kubectl logs -n flipped-exam deployment/flipped-exam-chat --tail=200 | grep -E "(MONITOR|exam-[session-id])"
Expected log output:
[MONITOR] Analyzing response for session exam-6ca810dd-..., message 24
[MONITOR] Found 3 issues in response 24
- level_mismatch (medium): Too sophisticated strategic analysis...
- character_break (low): Sounds like a teacher organizing a lesson...
[MONITOR] Response 28 passed quality check (score: 0.85)
###13.1.2 Monitor Log Patterns
**Success patterns:**
- `[MONITOR] Analyzing response for session , message ` - Monitor started
- `[MONITOR] Response passed quality check (score: X.XX)` - No issues found
- `[MONITOR] Found N issues in response ` - Issues detected
**Quality flag types:**
- `hallucination` - AI invented facts or sources
- `character_break` - AI stepped out of student role (e.g., acting as teacher)
- `level_mismatch` - Response too sophisticated or too weak for configured level
- `source_fabrication` - Cited non-existent sources
- `inappropriate` - Inappropriate content
**Severity levels:**
- `low` - Minor issue, acceptable variation
- `medium` - Noticeable problem, affects quality
- `high` - Significant issue, undermines session
- `critical` - Severe problem, session should be reviewed
###13.1.3 Understanding Monitor Results
Example from production logs:
[MONITOR] Found 3 issues in response 26
- level_mismatch (medium): Providing specific number '28 ships of the line'
is quite detailed knowledge that borders on expert-level recall
- character_break (low): Phrase 'This changed things in a few key ways. First...'
sounds slightly more like a teacher organizing a lesson
- level_mismatch (low): Very polished conclusion showing almost no gaps
in understanding
**Interpreting results:**
- **Multiple medium flags** - AI may not be maintaining character consistently
- **Repeated character_break** - System prompt may need adjustment
- **Level_mismatch flags** - Temperature or prompt configuration may need tuning
- **No flags (score 0.7-0.9)** - Healthy student-like response
###13.1.4 Testing Monitor Locally
Run the test script to verify monitoring logic:
bash
cd /home/robert/Documents/dev/openAi/azure-ai-foundry-claudesonnet-45
python test_monitor.py
This tests with intentionally good and bad responses to ensure the monitor can detect issues.
##13.2 Database Inspection
The monitor saves quality assessments to the database. To inspect:
**Via application export:**
1. Complete an exam session
2. Click "π Export Session" button
3. Review the exported JSON for quality assessments
**Direct database access (local development):**
bash
sqlite3 chat.db "SELECT * FROM quality_assessments LIMIT 5;"
##13.3 Production Monitoring
**Key metrics to track:**
- Monitor execution rate (should match message count during exams)
- Average quality scores per session
- Flag distribution by type and severity
- Monitor errors or failures
**Health indicators:**
- `[MONITOR] Analyzing response` logs appear for every AI response
- Background tasks complete without errors
- Quality reports accessible via export
---
#14. Appendix
##A. Example Exam Session Transcript
[Session Start: 2026-02-10T14:30:00Z]
[Config: Subject=INF-101, Level=medium, Weaknesses=[kildehenvisning]]
Student: Hei! Kan du forklare hva objektorientert programmering er?
KI-Student: Hei! Eh, objektorientert programmering er en mΓ₯te Γ₯
programmere pΓ₯ der man bruker objekter. Et objekt er liksom... en ting
som har egenskaper og ting den kan gjΓΈre. Som en bil har farge og kan
kjΓΈre. Tror det var noe med klasser ogsΓ₯, at en klasse er som en
oppskrift pΓ₯ objekter.
Student: Bra start! Hvor lærte du dette?
KI-Student: Det stΓ₯r i boka vi brukte... tror det var "Java for
nybegynnere" eller noe sΓ₯nt? Kapittel 3 tror jeg. [MERK: Fiktiv kilde]
Student: Kan du gi et kodeeksempel?
KI-Student: Ja, ehh...
java
class Bil {
String farge;
void kjΓΈr() {
System.out.println("Brum brum");
}
}
SΓ₯nn omtrent? Jeg husker ikke helt syntaksen, men det var noe sΓ₯nt.
[Session End: 2026-02-10T15:15:00Z]
[Duration: 45 minutes, Messages: 24]
##B. Risk Assessment
| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| AI breaks character | Medium | High | Strict prompts, low temperature |
| System downtime | Low | High | Local SQLite, minimal deps |
| Student games the AI | Medium | Medium | Not for actual grading in PoC |
| Data loss | Low | Medium | Regular manual backups |
| Inconsistent responses | High | Medium | Accept as PoC limitation |
---
**Document End**
*For questions, contact the UiA Development Team.*