school logos

Abstract

Model extraction attacks pose significant security threats to deployed language models, potentially compromising intellectual property and user privacy. This survey provides a comprehensive taxonomy of LLM-specific extraction attacks and defenses, categorizing attacks into functionality extraction, training data extraction, and prompt security attacks. We analyze various attack methodologies including API-based knowledge distillation, direct querying, parameter recovery, and prompt stealing techniques that exploit transformer architectures. We then examine defense mechanisms organized into model protection, data privacy protection, and prompt security strategies, evaluating their effectiveness across different deployment scenarios. We propose specialized metrics for evaluating both attack effectiveness and defense performance, addressing the specific challenges of generative language models. Through our analysis, we identify critical limitations in current approaches and propose promising research directions, including integrated attack methodologies and adaptive defense mechanisms that balance security with model utility. This work serves NLP researchers, ML engineers, and security professionals seeking to protect language models in production environments.

Part 1: Background and Motivation: Model Extraction in the Age of LLMs (30 mins)

  • The rise of proprietary LLMs and the economic stakes behind model protection.
  • Recent high-profile controversies (e.g., the DeepSeek incident) and the increasing concern over unauthorized replication of LLMs.
  • Motivations behind model extraction attacks: cost reduction, performance cloning, and intellectual property threats.
  • Threat model overview: black-box access, API-level interactions, and systemic vulnerabilities in MLaaS.

Part 2: Taxonomy of Model Extraction Attacks in LLMs (30 mins)

  • Overview of key extraction objectives: functional replication, training data extraction, and prompt stealing.
  • Functional Extraction: techniques for cloning LLM behavior using API queries or distilled models.
  • Training Data Extraction: memory leakage and reconstruction of sensitive data (e.g., PII, rare sequences).
  • Prompt Inversion and Stealing: threats to proprietary prompts and instructional alignment assets.
  • Attack methodology pipeline: from query crafting to surrogate model training.

Part 3: Defense Techniques Against Model Extraction (30 mins)

  • Architectural defenses: watermarking, model structure randomization, and attention tampering.
  • Output-level protections: GuardEmb, ModelShield, and controlled response perturbation.
  • Training-time defenses: data sanitization, selective forgetting, and differential privacy.
  • Prompt protection and inference-time monitoring: watermarking prompts and detecting suspicious access patterns.

Part 4: Evaluation Metrics and Trade-offs (30 mins)

  • Measuring extraction effectiveness: functional similarity, perplexity divergence, memorization rates.
  • Measuring defense robustness: attack prevention rate, watermark persistence, query anomaly detection.
  • Utility-security trade-offs: balancing usability and protection in deployed systems.
  • Visualization of defense coverage across different attack vectors.

Part 5: Case Studies and Real-World Scenarios (30 mins)

  • Case study: how a commercial-grade model was cloned with limited API budget.
  • Open-source vs. proprietary model exposure risk.
  • Lessons from deployed LLM systems (e.g., chatbot APIs, LLM-as-a-service).
  • Legal and ethical implications of extraction in industry practice.

Part 6: Research Gaps and Future Directions (30 mins)

  • Remaining limitations of current attack and defense techniques.
  • Toward integrated, adaptive defenses and theoretical guarantees.
  • Challenges in monitoring, watermarking, and architectural redesign.
  • Open discussion: how can academia and industry jointly shape responsible LLM deployment?

DETAILED SCHEDULE (August 3rd, 2025)

Time Speaker Title
01:00 PM - 01:20 PM Lincan Li, Kaize Ding, Yue Zhao Opening and Welcome
01:20 PM - 01:50 PM Lincan Li Background and Motivation: Model Extraction in the Age of LLMs
01:50 PM - 02:20 PM Lincan Li Taxonomy of Model Extraction Attacks in LLMs
02:20 PM - 02:50 PM Lincan Li Defense Techniques Against Model Extraction
02:50 PM - 03:10 PM Lincan Li Evaluation Metrics and Trade-offs
03:10 PM - 03:40 PM Lincan Li, Kaize Ding, Yue Zhao Case Studies and Real-World Scenarios
03:40 PM - 04:00 PM Lincan Li, Kaize Ding, Yue Zhao Research Gaps and Future Directions