Model Deployment — Resume Optimization SLM

Section 01

Production Readiness

Real-Time vs. Batch Inference

⚠ Architecture Constraint

Our trained artifact is a LoRA adapter, not a completely standalone model. To generate outputs in production, the system must load both the base Qwen/Qwen3-4B-Instruct-2507 model and the fine-tuned adapter together at runtime. That setup requires suitable hardware, sufficient storage, and a model-serving environment capable of handling the merged inference path. We will only do model deployment.

This project is production-ready in the form of a backend resume-optimization API. In a real-world application, a user would submit a resume together with a target job description, and the service would return a tailored resume in structured JSON format. The project already includes this deployment path through a FastAPI backend with endpoints for health monitoring, schema access, file-based inference, and text-based inference.

The most appropriate production mode for this project is real-time inference. Resume optimization is typically an interactive task in which a user wants one tailored result for one specific job posting, so the main design is request-response based. A secondary batch mode could also be used in future scenarios such as running one resume against multiple job postings, generating offline benchmark outputs, or performing larger-scale testing.

🔴 Real-Time Inference — Primary Mode

User-Facing Interactive Requests

When a candidate submits their resume and a job description, they expect a result promptly. The system handles this through synchronous inference on the self-hosted Qwen3-4B model. At 4-bit quantization, the model requires only 2–3 GB of VRAM, meaning a single A10G or RTX 4090 GPU can serve multiple concurrent users simultaneously — a critical requirement for SaaS scalability.

🔵 Batch Inference — Secondary Mode

Enterprise & Bulk Processing

Batch mode enables an enterprise recruiting use case: a firm wants to optimize many candidate resumes against a single job description overnight. All jobs queue to the same self-hosted GPU, process sequentially during off-hours, and return results by morning — at a marginal infrastructure cost near zero beyond the existing GPU rental.

The recommended deployment design is:

Primary mode: Real-time API inference for normal users
Secondary mode: Batch processing for large-scale evaluation or bulk usage

The backend is designed to be run on suitable hardware where the base Qwen model and the project's fine-tuned LoRA adapter can be loaded together. The API layer handles input validation, resume parsing, optional job-description scraping, inference, and JSON response formatting.

Cost Comparison — API vs. Self-Hosted

Deployment Mode	Monthly Cost	Latency	Control	Verdict
GPT-4 API (10K tokens/req)	~$8,000/month	3–8s	None — external dependency	Not viable at scale
Self-hosted Qwen3-4B (A10G GPU)	~$300–500/month	15–40s streamed	Full — data stays on-premise	Target architecture
Self-hosted Qwen3-4B (RTX 4090)	~$150–250/month	20–50s streamed	Full	Budget early-stage option

Production Architecture Stack

User Browser
  ↓
FastAPI Backend
  ├── Resume Parser (PDF/DOCX → text) ← OCR pipeline
  ├── Text Cleaner (non-ASCII removal, whitespace normalization)
  ├── Prompt Builder (resume + JD → chat-format JSONL)
  ↓
Inference Server (vLLM / TGI)
  ├── Qwen3-4B-Instruct-2507 base + QLoRA LoRA adapter (4-bit, rank r=16)
  ├── Streaming token output
  ↓
JSON Schema Validator
  ├── Required field presence check
  ├── Hallucination detection (cross-reference vs original resume)
  ↓
Response → User (streamed tailored resume + quality metrics)

Section 02

API Interface Mockup

Conceptual Workflow & Interactive Demo

The following mockup shows exactly how a candidate interacts with the system — from uploading their resume to receiving a structured, validated tailored resume. Every element maps directly to a component of the actual pipeline built in the project's API backend.

Available Live API Endpoints

GET/Root health check

GET/healthModel & hardware status

GET/schemaJSON output schema definition

POST/optimizeFile-based resume inference

POST/optimize/textText-based resume inference

✓Upload resume

✓Job description

3Configure

4View results

Resume Input

mp_abhinav_resume.pdf 142 KB · 2 pages · text extracted · click to preview ↗

Drop a new file or click to upload (PDF, DOCX)

Resume tokens 1,847 / 10k

JD tokens 0 / 10k

Total prompt 1,847 / 10k

Job Description

Senior Software Engineer — [Company Name]
We are looking for a software engineer with experience in Java, Spring Boot, AWS/GCP, and microservices. Must have proficiency in Python and modern AI/ML workflows. Familiarity with LLMs, document parsing, and agentic systems is a strong plus.

JD tokens 312 / 10k

Output Mode

Structured JSON

Plain text

Inference Mode

Real-time (stream)

Batch (queue)

Resume preview

JSON schema

Metrics

MALLARAPU PAVAN ABHINAV

abhinav@email.com · linkedin.com/in/abhinav1426 · Hyderabad, India

Summary

Software Engineer with 3+ years at ADP building scalable microservices, Java/Spring Boot backends, and GenAI-powered data pipelines on AWS and GCP. Experienced in LLM integration, document parsing, and agentic workflows.

Experience

Software Engineer · ADP

June 2022 – Present · Hyderabad, India

Built GenAI-powered data extraction system (Project Manthan), cutting onboarding from 6 months to 15 days
Designed asynchronous Java application gateway to manage high-traffic microservice routing
Managed Spring Boot & JDK lifecycle upgrades across multiple production microservices
Contributed to LLM training for complex document parsing and data extraction workflows
Resolved security vulnerabilities and optimized Smart Compliance Canada application performance

Skills

JavaSpring BootPython AWSGCPLLM MCPReactJsDocker KubernetesPostgres SQLMongoDB

{
  "name": "MALLARAPU PAVAN ABHINAV",
  "email": "abhinav@email.com",
  "summary": "Software Engineer with...",
  "skills": ["Java", "Spring Boot", ...],
  "experience": [{
    "company": "ADP",
    "bullets": [...]
  }],
  "education": [...],
  "projects": [...]
}

JSON schema valid All required fields present

POST /optimize · REST API endpoint

resume_file: mp_abhinav_resume.pdf
mode: "realtime"
stream: true

← returns: tailored_resume (JSON)
           + schema_valid: true
           + quality_metrics

Conceptual Workflow — How Data Moves Through the System

1

User Input

Upload resume (PDF/DOCX). Paste job description.

2

Parse & Build Prompt

OCR extraction. Chat-format prompt construction.

3

SLM Inference

Qwen3-4B + LoRA adapter (4-bit) generates structured JSON.

4

Validate & Score

Schema check, hallucination detection, quality metrics.

5

Return Result

Streamed resume preview + JSON + quality metrics.

The project also includes saved deployment artifacts that demonstrate the API workflow: the uploaded input file and the generated JSON response show that the backend stores both uploaded content and generated tailored-resume outputs — useful in a production-like environment for traceability, debugging, and quality review.

Section 03

Monitoring & Maintenance

Model Drift, Performance Tracking & Retraining Strategy

A deployed model must be monitored continuously to ensure that it remains reliable and does not become stale as inputs evolve over time. A resume optimizer faces a particularly aggressive form of model drift because the job market vocabulary changes constantly. Skills like "prompt engineering" and "RAG pipelines" barely existed as resume keywords two years ago. A model trained on 1,530 pairs from 2024–2025 will gradually produce outputs that feel dated — missing emerging skills, over-representing obsolete ones, and misaligning with how job descriptions are written today.

⚠ Data Drift

Incoming resumes and job descriptions start looking different from your training distribution. New technology terms enter job postings; shared token ratio calculations become unreliable because the vocabulary the model learned to match is no longer representative of the current market.

⚠ Concept Drift

The relationship between inputs and desired outputs changes even if inputs look similar. A "Senior Data Scientist" posting in 2025 emphasizes LLMs and vector databases; in 2023 it emphasized Spark and Hadoop. The model learned 2024 patterns and anchors outputs to those — not current market reality.

⭐ Output Quality Drift — The Most Dangerous Type

Schema validity and hallucination rates remain stable, but semantic alignment scores quietly decline because the model's vocabulary no longer matches current job market language. This is invisible to structural validators — the JSON is valid, but the content is stale. Without active monitoring of semantic metrics, this drift goes undetected until users start complaining.

Key Operational Metrics

⏱

Request Latency

End-to-end response time per API call. Target: under 50s streamed for real-time mode.

❌

API Failure Rate

Percentage of requests returning 4xx/5xx errors. Alert threshold: >2%.

📋

JSON Parse Success Rate

Proportion of outputs successfully parsed against the defined schema structure.

✅

Schema Validity Rate

Whether required schema fields are consistently present in every generated output.

💻

Model & Hardware Status

GPU utilization, VRAM usage, and adapter loading status from the /health endpoint.

🧠

Hallucination Rate

Cross-referencing generated bullets and skills against the original source resume.

⭐ Live Monitoring Dashboard — Qwen 4B SLM

All systems nominal

97.2%

Schema validity (7-day avg)

↑ above 93% threshold

29.4

Mean skill items (7-day avg)

≈ baseline 29.95

0.154

Mean shared token ratio

≈ training baseline 0.157

0.4%

Hallucination rate

↓ below 2% threshold

Schema validity rate — 30-day rolling (alert threshold: 93%)Baseline: 97.2%

Mean skill items generated — 30-day rolling (alert threshold: <24)Baseline: 29.95

Active alerts 0 active

Schema validity 97.2% — above 93% threshold. No action needed.

Skill item count 29.4 — above minimum threshold of 24.

Shared token ratio 0.154 — within ±20% of training baseline 0.157.

Hallucination rate 0.4% — below 2% alert threshold.

Input distribution drift — vs training baseline Green = stable · Amber = watch · Red = alert

Resume word count drift

+8% shift

JD word count drift

+5% shift

Shared token ratio drift

−2% shift

Lexical diversity drift

−1% shift

PDF vs DOCX ratio shift

+4% PDF share

Retraining trigger status

Vocabulary drift trigger

Not triggered

18% new vocabThreshold: 30%

Quality degradation trigger

Not triggered

0 of 3 flagsThreshold: 2+ flags

Quarterly scheduled refresh

Due in 47 days

Q2 2026Every 90 days

Retraining Strategy — Trigger-Based, Not Schedule-Based

Rather than retraining on a fixed calendar, the system uses three independent triggers. Any one reaching its threshold initiates a retraining cycle using only the QLoRA adapter weights — not the full Qwen 4B base model. Since adapter weights at rank r=16 are a tiny fraction of the full model, a retraining cycle runs on the same GPU used for inference during off-peak hours at near-zero additional cost.

Trigger	Condition	Action	Est. Cost
Vocabulary drift	30%+ of top-100 JD tokens absent from training vocab over 30-day scrape	Queue retraining with fresh Gemini teacher outputs on new JD sample	~$15–30 GPU hours
Quality degradation	Schema validity <93% OR skill count <24 OR hallucination >1.5% sustained 14 days	Emergency retraining cycle, flag outputs for manual review in interim	~$15–30 GPU hours
Quarterly refresh	Every 90 days regardless of alert status	Lightweight adapter update on 300–500 new resume–JD pairs	~$8–15 GPU hours

Practical Maintenance Plan

✓
Log request/response outcomes for each API call — record latency, schema validity, and token counts per request for downstream analysis.
✓
Periodically sample outputs for manual review — a random ~5% sample each week to catch quality regressions that automated metrics may miss.
✓
Check whether parsing quality declines for certain file types — monitor PDF vs. DOCX output quality separately, as the training corpus is 72.4% DOCX.
✓
Monitor changes in job-description vocabulary and document structure — retrain or refresh the LoRA adapter if production input patterns drift significantly from the training distribution.
✓
Track hallucination detection results — ensure generated bullets and credentials trace back to content present in the submitted resume, not invented by the model.

The 94% cost reduction versus GPT-4 API only holds if the self-hosted SLM produces outputs of comparable quality. Monitoring and retraining is what makes the cost savings permanent rather than temporary. A stale model producing mediocre outputs at $300/month is not a better product than a current model producing excellent outputs at $8,000/month.

Section 04

Team Contributions

Deployment Assignment — Individual Responsibilities

VB

Vaishnav Busha

Software Engineer Lead

Led backend architecture design and FastAPI endpoint development. Managed LoRA adapter integration, Spring Boot microservices experience, and GenAI workflow design including the Project Manthan data extraction system.

NM

Nishith Chowdary Mareddy

Model Development

Led QLoRA fine-tuning methodology, adapter training configuration (rank r=16, 4-bit quantization), and vLLM inference server selection. Provided model architecture context for deployment writeup.

JS

Jyothi Swaroop Ganapavarapu

Data & Preprocessing Lead

Managed dataset acquisition (1,530 training pairs), OCR-based resume parsing, multi-format PDF/DOCX handling, and prompt construction pipeline. Built the Gemini batch request workflow for teacher output generation.

TK

Talha Khan

Evaluation & Cost Analysis

Designed the quality metrics framework (schema validity, semantic similarity, skill coverage, hallucination rate). Led the cost modeling comparison showing 94%+ reduction versus GPT-4 API at production scale.

Collaborative Contributions — All Members

Problem framing for production deployment design, API endpoint specification, architecture stack selection, monitoring metric definition, maintenance plan authoring, final document review and integration.

Model DeploymentResume Optimization SLM