school logos

Abstract

Model extraction attacks pose significant security threats to deployed language models, potentially compromising intellectual property and user privacy. This survey provides a comprehensive taxonomy of LLM-specific extraction attacks and defenses, categorizing attacks into functionality extraction, training data extraction, and prompt security attacks. We analyze various attack methodologies including API-based knowledge distillation, direct querying, parameter recovery, and prompt stealing techniques that exploit transformer architectures. We then examine defense mechanisms organized into model protection, data privacy protection, and prompt security strategies, evaluating their effectiveness across different deployment scenarios. We propose specialized metrics for evaluating both attack effectiveness and defense performance, addressing the specific challenges of generative language models. Through our analysis, we identify critical limitations in current approaches and propose promising research directions, including integrated attack methodologies and adaptive defense mechanisms that balance security with model utility. This work serves NLP researchers, ML engineers, and security professionals seeking to protect language models in production environments.

Part 1: Background and Motivation: Model Extraction in the Age of LLMs (30 mins)

  • The rise of proprietary LLMs and the economic stakes behind model protection.
  • Recent high-profile controversies (e.g., the DeepSeek incident) and the increasing concern over unauthorized replication of LLMs.
  • Motivations behind model extraction attacks: cost reduction, performance cloning, and intellectual property threats.
  • Threat model overview: black-box access, API-level interactions, and systemic vulnerabilities in MLaaS.

Part 2: Taxonomy of Model Extraction Attacks in LLMs (30 mins)

  • Overview of key extraction objectives: functional replication, training data extraction, and prompt stealing.
  • Functional Extraction: techniques for cloning LLM behavior using API queries or distilled models.
  • Training Data Extraction: memory leakage and reconstruction of sensitive data (e.g., PII, rare sequences).
  • Prompt Inversion and Stealing: threats to proprietary prompts and instructional alignment assets.
  • Attack methodology pipeline: from query crafting to surrogate model training.

Part 3: Defense Techniques Against Model Extraction (30 mins)

  • Architectural defenses: watermarking, model structure randomization, and attention tampering.
  • Output-level protections: GuardEmb, ModelShield, and controlled response perturbation.
  • Training-time defenses: data sanitization, selective forgetting, and differential privacy.
  • Prompt protection and inference-time monitoring: watermarking prompts and detecting suspicious access patterns.

Part 4: Evaluation Metrics and Trade-offs (30 mins)

  • Measuring extraction effectiveness: functional similarity, perplexity divergence, memorization rates.
  • Measuring defense robustness: attack prevention rate, watermark persistence, query anomaly detection.
  • Utility-security trade-offs: balancing usability and protection in deployed systems.
  • Visualization of defense coverage across different attack vectors.

Part 5: Case Studies and Real-World Scenarios (30 mins)

  • Case study: how a commercial-grade model was cloned with limited API budget.
  • Open-source vs. proprietary model exposure risk.
  • Lessons from deployed LLM systems (e.g., chatbot APIs, LLM-as-a-service).
  • Legal and ethical implications of extraction in industry practice.

Part 6: Research Gaps and Future Directions (30 mins)

  • Remaining limitations of current attack and defense techniques.
  • Toward integrated, adaptive defenses and theoretical guarantees.
  • Challenges in monitoring, watermarking, and architectural redesign.
  • Open discussion: how can academia and industry jointly shape responsible LLM deployment?

DETAILED SCHEDULE (August 3rd, 2025)

Time Speaker Title
01:00 PM - 01:20 PM Lincan Li, Kaize Ding, Yue Zhao Opening and Welcome
01:20 PM - 01:50 PM Lincan Li Background and Motivation: Model Extraction in the Age of LLMs
01:50 PM - 02:20 PM Lincan Li Taxonomy of Model Extraction Attacks in LLMs
02:20 PM - 02:50 PM Lincan Li Defense Techniques Against Model Extraction
02:50 PM - 03:10 PM Lincan Li Evaluation Metrics and Trade-offs
03:10 PM - 03:40 PM Lincan Li, Kaize Ding, Yue Zhao Case Studies and Real-World Scenarios
03:40 PM - 04:00 PM Lincan Li, Kaize Ding, Yue Zhao Research Gaps and Future Directions

Related Materials

feature

A Survey on Model Extraction Attacks and Defenses for Large Language Models

Authors: Kaixiang Zhao*, Lincan Li*, Kaize Ding, Neil Zhenqiang Gong, Yue Zhao, Yushun Dong. Proceedings of the 31st ACM SIGKDD international conference on knowledge discovery & data mining.

  • Present a systematic taxonomy of model extraction attacks targeting LLMs, including functionality extraction, training data extraction, and prompt-targeted attacks.
  • Review state-of-the-art defenses, including model protection, data privacy protection, and prompt protection.
  • Analyze key tradeoffs and introduce evaluation metrics specific to the LLM threat landscape.
  • Identify open research challenges and suggest future directions for building more robust and secure LLMs.
  • A valuable resource for researchers, engineers, and practitioners working to secure LLMs.
  • Paper

feature

PyGIP: a comprehensive Python library focused on model extraction attacks and defenses in Graph Neural Networks.

  • We developed this library PyGIP featured for built-in datasets and implementations of popular model extraction attack & defense algorithms
  • Built on PyTorch, PyTorch Geometric, and DGL, the library offers a robust framework for understanding, implementing, and defending against attacks targeting graph learning models.
  • Core Contributors & Acknowledgements: Bolin Shen, Yuxiang Sun, Lincan Li, Chenxi Zhao, Kaixiang Zhao, Zaiyi Zheng, Yushun Dong.

    GitHub

Key References

  • [1] Birch, L., Hackett, W., Trawicki, S., Suri, N. and Garraghan, P., 2023. Model leeching: An extraction attack targeting llms. arXiv preprint arXiv:2309.10544.
  • [2] Carlini, N., Paleka, D., Dvijotham, K.D., Steinke, T., Hayase, J., Cooper, A.F., Lee, K., Jagielski, M., Nasr, M., Conmy, A. and Yona, I., 2024. Stealing part of a production language model. arXiv preprint arXiv:2403.06634.
  • [3] Krishna, K., Tomar, G.S., Parikh, A.P., Papernot, N. and Iyyer, M., 2019. Thieves on sesame street! model extraction of bert-based apis. arXiv preprint arXiv:1910.12366.
  • [4] Chen, C., He, X., Lyu, L. and Wu, F., 2021. Killing one bird with two stones: model extraction and attribute inference attacks against bert-based apis. arXiv preprint arXiv:2105.10909.
  • [5] He, X., Lyu, L., Xu, Q. and Sun, L., 2021. Model extraction and adversarial transferability, your BERT is vulnerable!. arXiv preprint arXiv:2103.10013.
  • [6] Xu, Q., He, X., Lyu, L., Qu, L. and Haffari, G., 2021. Student surpasses teacher: Imitation attack for black-box NLP APIs. arXiv preprint arXiv:2108.13873.
  • [7] Yao, Y., Xiao, Z., Wang, B., Viswanath, B., Zheng, H. and Zhao, B.Y., 2017, November. Complexity vs. performance: empirical analysis of machine learning as a service. In Proceedings of the 2017 Internet Measurement Conference (pp. 384-397).
  • [8] Li, C., Song, Z., Wang, W. and Yang, C., 2023. A theoretical insight into attack and defense of gradient leakage in transformer. arXiv preprint arXiv:2311.13624.
  • [9] Liu, Y., Jia, J., Liu, H. and Gong, N.Z., 2022, November. Stolenencoder: stealing pre-trained encoders in self-supervised learning. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security (pp. 2115-2128).
  • [10] Nazari, N., Xiang, F., Fang, C., Makrani, H.M., Puri, A., Patwari, K., Sayadi, H., Rafatirad, S., Chuah, C.N. and Homayoun, H., 2024, April. Llm-fin: Large language models fingerprinting attack on edge devices. In 2024 25th International Symposium on Quality Electronic Design (ISQED) (pp. 1-6). IEEE.
  • [11] Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U. and Oprea, A., 2021. Extracting training data from large language models. In 30th USENIX security symposium (USENIX Security 21) (pp. 2633-2650).
  • [12] Huang, J., Shao, H. and Chang, K.C.C., 2022. Are large pre-trained language models leaking your personal information?. arXiv preprint arXiv:2205.12628.
  • [13] Wang, J.G., Wang, J., Li, M. and Neel, S., 2024. Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models. arXiv preprint arXiv:2402.17012.
  • [14] Dai, C., Lu, L. and Zhou, P., 2025. Stealing Training Data from Large Language Models in Decentralized Training through Activation Inversion Attack. arXiv preprint arXiv:2502.16086.
  • [15] Parikh, R., Dupuy, C. and Gupta, R., 2022. Canary extraction in natural language understanding models. arXiv preprint arXiv:2203.13920.
  • [16] Yang, Z., Zhao, Z., Wang, C., Shi, J., Kim, D., Han, D. and Lo, D., 2024, April. Unveiling memorization in code models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (pp. 1-13).
  • [17] Hui, B., Yuan, H., Gong, N., Burlina, P. and Cao, Y., 2024, December. Pleak: Prompt leaking attacks against large language model applications. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security (pp. 3600-3614).
  • [18] Sha, Z. and Zhang, Y., 2024. Prompt stealing attacks against large language models. arXiv preprint arXiv:2402.12959.
  • [19] Yang, Y., Li, C., Li, Q., Ma, O., Wang, H., Wang, Z., Gao, Y., Chen, W. and Ji, S., 2025. {PRSA}: Prompt Stealing Attacks against {Real-World} Prompt Services. In 34th USENIX Security Symposium (USENIX Security 25) (pp. 2283-2302).
  • [20] Jiang, Z., Li, M., Yang, G., Wang, J., Huang, Y., Chang, Z. and Wang, Q., 2025. Mimicking the familiar: Dynamic command generation for information theft attacks in llm tool-learning system. arXiv preprint arXiv:2502.11358.
  • [21] Xu, J., Wang, F., Ma, M.D., Koh, P.W., Xiao, C. and Chen, M., 2024. Instructional fingerprinting of large language models. arXiv preprint arXiv:2401.12255.
  • [22] Zhang, C., Morris, J.X. and Shmatikov, V., 2024. Extracting prompts by inverting llm outputs. arXiv preprint arXiv:2405.15012.
  • [23] Li, Q., Shen, Z., Qin, Z., Xie, Y., Zhang, X., Du, T., Cheng, S., Wang, X. and Yin, J., 2024, October. TransLinkGuard: safeguarding Transformer models against model stealing in edge deployment. In Proceedings of the 32nd ACM International Conference on Multimedia (pp. 3479-3488).
  • [24] Li, Q., Xie, Y., Du, T., Shen, Z., Qin, Z., Peng, H., Zhao, X., Zhu, X., Yin, J. and Zhang, X., 2024. Coreguard: Safeguarding foundational capabilities of llms against model stealing in edge deployment. arXiv preprint arXiv:2410.13903.
  • [25] Pang, K., Qi, T., Wu, C., Bai, M., Jiang, M. and Huang, Y., 2025. Modelshield: Adaptive and robust watermark against model extraction attack. IEEE Transactions on Information Forensics and Security.
  • [26] Wang, L. and Cheng, M., 2024, November. GuardEmb: Dynamic Watermark for Safeguarding Large Language Model Embedding Service Against Model Stealing Attack. In Findings of the Association for Computational Linguistics: EMNLP 2024 (pp. 7518-7534).
  • [27] Feng, S. and Tramèr, F., 2024. Privacy backdoors: Stealing data with corrupted pretrained models. arXiv preprint arXiv:2404.00473.
  • [28] Patil, V., Hase, P. and Bansal, M., 2023. Can sensitive information be deleted from llms? objectives for defending against extraction attacks. arXiv preprint arXiv:2309.17410.
  • [29]Li, Q., Hong, J., Xie, C., Tan, J., Xin, R., Hou, J., Yin, X., Wang, Z., Hendrycks, D., Wang, Z. and Li, B., 2024. Llm-pbe: Assessing data privacy in large language models. arXiv preprint arXiv:2408.12787.
  • [30] Wang, Z., Yang, F., Wang, L., Zhao, P., Wang, H., Chen, L., Lin, Q. and Wong, K.F., 2023. Self-guard: Empower the llm to safeguard itself. arXiv preprint arXiv:2310.15851.
  • [31]He, X., Xu, Q., Zeng, Y., Lyu, L., Wu, F., Li, J. and Jia, R., 2022. Cater: Intellectual property protection on text generation apis via conditional watermarks. Advances in Neural Information Processing Systems, 35, pp.5431-5445.
  • [32] Kim, M., Kwon, T., Shim, K. and Kim, B., 2024, October. Protection of LLM Environment Using Prompt Security. In 2024 15th International Conference on Information and Communication Technology Convergence (ICTC) (pp. 1715-1719). IEEE.
  • [33] Wang, Z., Yang, F., Wang, L., Zhao, P., Wang, H., Chen, L., Lin, Q. and Wong, K.F., 2023. Self-guard: Empower the llm to safeguard itself. arXiv preprint arXiv:2310.15851.

Contact

For any questions regarding this tutorial, please reach out to Lincan Li via ll24bb@fsu.edu

Powered by Lincan Li in 2025.