Low-Cost Resume Optimization via Distillation of Large Language Model Behavior into a Fine-Tuned Small Language Model (SLM) · Professor: Clifford K. Whitworth
Our trained artifact is a LoRA adapter, not a completely standalone model. To generate outputs in production, the system must load both the base Qwen/Qwen3-4B-Instruct-2507 model and the fine-tuned adapter together at runtime. That setup requires suitable hardware, sufficient storage, and a model-serving environment capable of handling the merged inference path. We will only do model deployment.
This project is production-ready in the form of a backend resume-optimization API. In a real-world application, a user would submit a resume together with a target job description, and the service would return a tailored resume in structured JSON format. The project already includes this deployment path through a FastAPI backend with endpoints for health monitoring, schema access, file-based inference, and text-based inference.
The most appropriate production mode for this project is real-time inference. Resume optimization is typically an interactive task in which a user wants one tailored result for one specific job posting, so the main design is request-response based. A secondary batch mode could also be used in future scenarios such as running one resume against multiple job postings, generating offline benchmark outputs, or performing larger-scale testing.
When a candidate submits their resume and a job description, they expect a result promptly. The system handles this through synchronous inference on the self-hosted Qwen3-4B model. At 4-bit quantization, the model requires only 2–3 GB of VRAM, meaning a single A10G or RTX 4090 GPU can serve multiple concurrent users simultaneously — a critical requirement for SaaS scalability.
Batch mode enables an enterprise recruiting use case: a firm wants to optimize many candidate resumes against a single job description overnight. All jobs queue to the same self-hosted GPU, process sequentially during off-hours, and return results by morning — at a marginal infrastructure cost near zero beyond the existing GPU rental.
The recommended deployment design is:
The backend is designed to be run on suitable hardware where the base Qwen model and the project's fine-tuned LoRA adapter can be loaded together. The API layer handles input validation, resume parsing, optional job-description scraping, inference, and JSON response formatting.
| Deployment Mode | Monthly Cost | Latency | Control | Verdict |
|---|---|---|---|---|
| GPT-4 API (10K tokens/req) | ~$8,000/month | 3–8s | None — external dependency | Not viable at scale |
| Self-hosted Qwen3-4B (A10G GPU) | ~$300–500/month | 15–40s streamed | Full — data stays on-premise | Target architecture |
| Self-hosted Qwen3-4B (RTX 4090) | ~$150–250/month | 20–50s streamed | Full | Budget early-stage option |
The following mockup shows exactly how a candidate interacts with the system — from uploading their resume to receiving a structured, validated tailored resume. Every element maps directly to a component of the actual pipeline built in the project's API backend.
Software Engineer with 3+ years at ADP building scalable microservices, Java/Spring Boot backends, and GenAI-powered data pipelines on AWS and GCP. Experienced in LLM integration, document parsing, and agentic workflows.
resume_file: mp_abhinav_resume.pdf
mode: "realtime"
stream: true
← returns: tailored_resume (JSON)
+ schema_valid: true
+ quality_metrics
The project also includes saved deployment artifacts that demonstrate the API workflow: the uploaded input file and the generated JSON response show that the backend stores both uploaded content and generated tailored-resume outputs — useful in a production-like environment for traceability, debugging, and quality review.
A deployed model must be monitored continuously to ensure that it remains reliable and does not become stale as inputs evolve over time. A resume optimizer faces a particularly aggressive form of model drift because the job market vocabulary changes constantly. Skills like "prompt engineering" and "RAG pipelines" barely existed as resume keywords two years ago. A model trained on 1,530 pairs from 2024–2025 will gradually produce outputs that feel dated — missing emerging skills, over-representing obsolete ones, and misaligning with how job descriptions are written today.
Incoming resumes and job descriptions start looking different from your training distribution. New technology terms enter job postings; shared token ratio calculations become unreliable because the vocabulary the model learned to match is no longer representative of the current market.
The relationship between inputs and desired outputs changes even if inputs look similar. A "Senior Data Scientist" posting in 2025 emphasizes LLMs and vector databases; in 2023 it emphasized Spark and Hadoop. The model learned 2024 patterns and anchors outputs to those — not current market reality.
Schema validity and hallucination rates remain stable, but semantic alignment scores quietly decline because the model's vocabulary no longer matches current job market language. This is invisible to structural validators — the JSON is valid, but the content is stale. Without active monitoring of semantic metrics, this drift goes undetected until users start complaining.
End-to-end response time per API call. Target: under 50s streamed for real-time mode.
Percentage of requests returning 4xx/5xx errors. Alert threshold: >2%.
Proportion of outputs successfully parsed against the defined schema structure.
Whether required schema fields are consistently present in every generated output.
GPU utilization, VRAM usage, and adapter loading status from the /health endpoint.
Cross-referencing generated bullets and skills against the original source resume.
Rather than retraining on a fixed calendar, the system uses three independent triggers. Any one reaching its threshold initiates a retraining cycle using only the QLoRA adapter weights — not the full Qwen 4B base model. Since adapter weights at rank r=16 are a tiny fraction of the full model, a retraining cycle runs on the same GPU used for inference during off-peak hours at near-zero additional cost.
| Trigger | Condition | Action | Est. Cost |
|---|---|---|---|
| Vocabulary drift | 30%+ of top-100 JD tokens absent from training vocab over 30-day scrape | Queue retraining with fresh Gemini teacher outputs on new JD sample | ~$15–30 GPU hours |
| Quality degradation | Schema validity <93% OR skill count <24 OR hallucination >1.5% sustained 14 days | Emergency retraining cycle, flag outputs for manual review in interim | ~$15–30 GPU hours |
| Quarterly refresh | Every 90 days regardless of alert status | Lightweight adapter update on 300–500 new resume–JD pairs | ~$8–15 GPU hours |
The 94% cost reduction versus GPT-4 API only holds if the self-hosted SLM produces outputs of comparable quality. Monitoring and retraining is what makes the cost savings permanent rather than temporary. A stale model producing mediocre outputs at $300/month is not a better product than a current model producing excellent outputs at $8,000/month.
Problem framing for production deployment design, API endpoint specification, architecture stack selection, monitoring metric definition, maintenance plan authoring, final document review and integration.