DTSC 5082 · Group Project · Model Deployment
Model online
DTSC 5082 · Seminar in Research & Research Methodology · Group Project

Model Deployment
Resume Optimization SLM

Low-Cost Resume Optimization via Distillation of Large Language Model Behavior into a Fine-Tuned Small Language Model (SLM)  ·  Professor: Clifford K. Whitworth

Qwen3-4BBase Model
QLoRAFine-tuning method
1,530Training samples
1,876Mean resume words
94%+Cost reduction
01
Production ReadinessReal-time vs. Batch Inference
Part 1
02
API Interface MockupConceptual Workflow
Part 1
03
Monitoring & MaintenanceOperational Metrics
Part 1
04
Team ContributionsIndividual Responsibilities
Credits
Part 1 — Model Deployment
Section 01
Production Readiness
Real-Time vs. Batch Inference
⚠ Architecture Constraint

Our trained artifact is a LoRA adapter, not a completely standalone model. To generate outputs in production, the system must load both the base Qwen/Qwen3-4B-Instruct-2507 model and the fine-tuned adapter together at runtime. That setup requires suitable hardware, sufficient storage, and a model-serving environment capable of handling the merged inference path. We will only do model deployment.

This project is production-ready in the form of a backend resume-optimization API. In a real-world application, a user would submit a resume together with a target job description, and the service would return a tailored resume in structured JSON format. The project already includes this deployment path through a FastAPI backend with endpoints for health monitoring, schema access, file-based inference, and text-based inference.

The most appropriate production mode for this project is real-time inference. Resume optimization is typically an interactive task in which a user wants one tailored result for one specific job posting, so the main design is request-response based. A secondary batch mode could also be used in future scenarios such as running one resume against multiple job postings, generating offline benchmark outputs, or performing larger-scale testing.

🔴 Real-Time Inference — Primary Mode

User-Facing Interactive Requests

When a candidate submits their resume and a job description, they expect a result promptly. The system handles this through synchronous inference on the self-hosted Qwen3-4B model. At 4-bit quantization, the model requires only 2–3 GB of VRAM, meaning a single A10G or RTX 4090 GPU can serve multiple concurrent users simultaneously — a critical requirement for SaaS scalability.

🔵 Batch Inference — Secondary Mode

Enterprise & Bulk Processing

Batch mode enables an enterprise recruiting use case: a firm wants to optimize many candidate resumes against a single job description overnight. All jobs queue to the same self-hosted GPU, process sequentially during off-hours, and return results by morning — at a marginal infrastructure cost near zero beyond the existing GPU rental.

The recommended deployment design is:

The backend is designed to be run on suitable hardware where the base Qwen model and the project's fine-tuned LoRA adapter can be loaded together. The API layer handles input validation, resume parsing, optional job-description scraping, inference, and JSON response formatting.

Cost Comparison — API vs. Self-Hosted

Deployment ModeMonthly CostLatencyControlVerdict
GPT-4 API (10K tokens/req) ~$8,000/month 3–8s None — external dependency Not viable at scale
Self-hosted Qwen3-4B (A10G GPU) ~$300–500/month 15–40s streamed Full — data stays on-premise Target architecture
Self-hosted Qwen3-4B (RTX 4090) ~$150–250/month 20–50s streamed Full Budget early-stage option

Production Architecture Stack

User Browser
  
FastAPI Backend
  ├── Resume Parser (PDF/DOCX → text) ← OCR pipeline
  ├── Text Cleaner (non-ASCII removal, whitespace normalization)
  ├── Prompt Builder (resume + JD → chat-format JSONL)
  
Inference Server (vLLM / TGI)
  ├── Qwen3-4B-Instruct-2507 base + QLoRA LoRA adapter (4-bit, rank r=16)
  ├── Streaming token output
  
JSON Schema Validator
  ├── Required field presence check
  ├── Hallucination detection (cross-reference vs original resume)
  
Response → User (streamed tailored resume + quality metrics)
Section 02
API Interface Mockup
Conceptual Workflow & Interactive Demo

The following mockup shows exactly how a candidate interacts with the system — from uploading their resume to receiving a structured, validated tailored resume. Every element maps directly to a component of the actual pipeline built in the project's API backend.

Available Live API Endpoints

GET/Root health check
GET/healthModel & hardware status
GET/schemaJSON output schema definition
POST/optimizeFile-based resume inference
POST/optimize/textText-based resume inference
ResumeAI — Qwen3-4B SLM  Self-hosted · LoRA adapter · 4-bit quantized
Model online
Upload resume
Job description
3Configure
4View results
Resume Input
mp_abhinav_resume.pdf 142 KB · 2 pages · text extracted · click to preview ↗
Drop a new file or click to upload (PDF, DOCX)
Resume tokens 1,847 / 10k
JD tokens 0 / 10k
Total prompt 1,847 / 10k
Job Description
JD tokens 312 / 10k
Output Mode
Structured JSON
Plain text
Inference Mode
Real-time (stream)
Batch (queue)
Resume preview
JSON schema
Metrics

MALLARAPU PAVAN ABHINAV

abhinav@email.com · linkedin.com/in/abhinav1426 · Hyderabad, India
Summary

Software Engineer with 3+ years at ADP building scalable microservices, Java/Spring Boot backends, and GenAI-powered data pipelines on AWS and GCP. Experienced in LLM integration, document parsing, and agentic workflows.

Experience
Software Engineer · ADP
June 2022 – Present · Hyderabad, India
  • Built GenAI-powered data extraction system (Project Manthan), cutting onboarding from 6 months to 15 days
  • Designed asynchronous Java application gateway to manage high-traffic microservice routing
  • Managed Spring Boot & JDK lifecycle upgrades across multiple production microservices
  • Contributed to LLM training for complex document parsing and data extraction workflows
  • Resolved security vulnerabilities and optimized Smart Compliance Canada application performance
Skills
JavaSpring BootPython AWSGCPLLM MCPReactJsDocker KubernetesPostgres SQLMongoDB
POST /optimize  ·  REST API endpoint
resume_file: mp_abhinav_resume.pdf
mode: "realtime"
stream: true

← returns: tailored_resume (JSON)
           + schema_valid: true
           + quality_metrics

Conceptual Workflow — How Data Moves Through the System

1

User Input

Upload resume (PDF/DOCX). Paste job description.

2

Parse & Build Prompt

OCR extraction. Chat-format prompt construction.

3

SLM Inference

Qwen3-4B + LoRA adapter (4-bit) generates structured JSON.

4

Validate & Score

Schema check, hallucination detection, quality metrics.

5

Return Result

Streamed resume preview + JSON + quality metrics.

The project also includes saved deployment artifacts that demonstrate the API workflow: the uploaded input file and the generated JSON response show that the backend stores both uploaded content and generated tailored-resume outputs — useful in a production-like environment for traceability, debugging, and quality review.

Section 03
Monitoring & Maintenance
Model Drift, Performance Tracking & Retraining Strategy

A deployed model must be monitored continuously to ensure that it remains reliable and does not become stale as inputs evolve over time. A resume optimizer faces a particularly aggressive form of model drift because the job market vocabulary changes constantly. Skills like "prompt engineering" and "RAG pipelines" barely existed as resume keywords two years ago. A model trained on 1,530 pairs from 2024–2025 will gradually produce outputs that feel dated — missing emerging skills, over-representing obsolete ones, and misaligning with how job descriptions are written today.

⚠ Data Drift

Incoming resumes and job descriptions start looking different from your training distribution. New technology terms enter job postings; shared token ratio calculations become unreliable because the vocabulary the model learned to match is no longer representative of the current market.

⚠ Concept Drift

The relationship between inputs and desired outputs changes even if inputs look similar. A "Senior Data Scientist" posting in 2025 emphasizes LLMs and vector databases; in 2023 it emphasized Spark and Hadoop. The model learned 2024 patterns and anchors outputs to those — not current market reality.

⭐ Output Quality Drift — The Most Dangerous Type

Schema validity and hallucination rates remain stable, but semantic alignment scores quietly decline because the model's vocabulary no longer matches current job market language. This is invisible to structural validators — the JSON is valid, but the content is stale. Without active monitoring of semantic metrics, this drift goes undetected until users start complaining.

Key Operational Metrics

Request Latency

End-to-end response time per API call. Target: under 50s streamed for real-time mode.

API Failure Rate

Percentage of requests returning 4xx/5xx errors. Alert threshold: >2%.

📋

JSON Parse Success Rate

Proportion of outputs successfully parsed against the defined schema structure.

Schema Validity Rate

Whether required schema fields are consistently present in every generated output.

💻

Model & Hardware Status

GPU utilization, VRAM usage, and adapter loading status from the /health endpoint.

🧠

Hallucination Rate

Cross-referencing generated bullets and skills against the original source resume.

Live Monitoring Dashboard — Qwen 4B SLM
All systems nominal
97.2%
Schema validity (7-day avg)
↑ above 93% threshold
29.4
Mean skill items (7-day avg)
≈ baseline 29.95
0.154
Mean shared token ratio
≈ training baseline 0.157
0.4%
Hallucination rate
↓ below 2% threshold
Schema validity rate — 30-day rolling (alert threshold: 93%)Baseline: 97.2%
Mean skill items generated — 30-day rolling (alert threshold: <24)Baseline: 29.95
Active alerts 0 active
Schema validity 97.2% — above 93% threshold. No action needed.
Skill item count 29.4 — above minimum threshold of 24.
Shared token ratio 0.154 — within ±20% of training baseline 0.157.
Hallucination rate 0.4% — below 2% alert threshold.
Input distribution drift — vs training baseline Green = stable · Amber = watch · Red = alert
Resume word count drift
+8% shift
JD word count drift
+5% shift
Shared token ratio drift
−2% shift
Lexical diversity drift
−1% shift
PDF vs DOCX ratio shift
+4% PDF share
Retraining trigger status
Vocabulary drift trigger
Not triggered
18% new vocabThreshold: 30%
Quality degradation trigger
Not triggered
0 of 3 flagsThreshold: 2+ flags
Quarterly scheduled refresh
Due in 47 days
Q2 2026Every 90 days
Retraining Strategy — Trigger-Based, Not Schedule-Based

Rather than retraining on a fixed calendar, the system uses three independent triggers. Any one reaching its threshold initiates a retraining cycle using only the QLoRA adapter weights — not the full Qwen 4B base model. Since adapter weights at rank r=16 are a tiny fraction of the full model, a retraining cycle runs on the same GPU used for inference during off-peak hours at near-zero additional cost.

TriggerConditionActionEst. Cost
Vocabulary drift 30%+ of top-100 JD tokens absent from training vocab over 30-day scrape Queue retraining with fresh Gemini teacher outputs on new JD sample ~$15–30 GPU hours
Quality degradation Schema validity <93% OR skill count <24 OR hallucination >1.5% sustained 14 days Emergency retraining cycle, flag outputs for manual review in interim ~$15–30 GPU hours
Quarterly refresh Every 90 days regardless of alert status Lightweight adapter update on 300–500 new resume–JD pairs ~$8–15 GPU hours

Practical Maintenance Plan

The 94% cost reduction versus GPT-4 API only holds if the self-hosted SLM produces outputs of comparable quality. Monitoring and retraining is what makes the cost savings permanent rather than temporary. A stale model producing mediocre outputs at $300/month is not a better product than a current model producing excellent outputs at $8,000/month.
Section 04
Team Contributions
Deployment Assignment — Individual Responsibilities
VB
Vaishnav Busha
Software Engineer Lead
Led backend architecture design and FastAPI endpoint development. Managed LoRA adapter integration, Spring Boot microservices experience, and GenAI workflow design including the Project Manthan data extraction system.
NM
Nishith Chowdary Mareddy
Model Development
Led QLoRA fine-tuning methodology, adapter training configuration (rank r=16, 4-bit quantization), and vLLM inference server selection. Provided model architecture context for deployment writeup.
JS
Jyothi Swaroop Ganapavarapu
Data & Preprocessing Lead
Managed dataset acquisition (1,530 training pairs), OCR-based resume parsing, multi-format PDF/DOCX handling, and prompt construction pipeline. Built the Gemini batch request workflow for teacher output generation.
TK
Talha Khan
Evaluation & Cost Analysis
Designed the quality metrics framework (schema validity, semantic similarity, skill coverage, hallucination rate). Led the cost modeling comparison showing 94%+ reduction versus GPT-4 API at production scale.
Collaborative Contributions — All Members

Problem framing for production deployment design, API endpoint specification, architecture stack selection, monitoring metric definition, maintenance plan authoring, final document review and integration.