A Comprehensive Analysis on Edge Deployment and Optimization of Llms
DOI:
https://doi.org/10.54097/jcb63r28Keywords:
LLMs; edge deployment; quantization; light-weight models; edge-cloud collaboration.Abstract
After years of development, impressive capabilities are demonstrated by various large language models (LLMs) across many fields. Traditionally, cloud appears to be the preferred deployment option for most applications. The remarkable reasoning performance relies on the support of large and continuous computational power. Increased latency, bandwidth consumption overheads, and lack of privacy protections. All of these are the costs that come with cloud deployment and causing it unsuitable for certain scenarios. In order to address these challenges, increasing research interests toward localized deployment of LLMs have prompted. Such deployment also introduces several issues including limited computational capability, memory capacity, and power consumption of edge devices. This paper first provides an overview of different LLM architectures. Then introduces several proposed approaches aimed at addressing the issues mentioned. Multiple directions are covered, including algorithm-level techniques such as quantization, lightweight or distilled models, and MoE-based architectures; hardware acceleration optimization; algorithm–hardware co-design approaches; and edge–cloud collaboration like split inference strategies. A range of studies are referenced to provide evidence and analysis for the respective approaches. Together it provides insights into current research status, difficulties, and potential opportunities in the related field.
Downloads
References
[1] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in neural information processing systems, 2017, 30.
[2] Zhang M, Shen X, Cao J, et al. Edgeshard: Efficient llm inference via collaborative edge computing. IEEE Internet of Things Journal, 2024. DOI: https://doi.org/10.1109/JIOT.2024.3524255
[3] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 2019: 4171-4186.
[4] Lewis M, Liu Y, Goyal N, et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Proceedings of the 58th annual meeting of the association for computational linguistics. 2020: 7871-7880. DOI: https://doi.org/10.18653/v1/2020.acl-main.703
[5] Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 2020, 21(140): 1-67.
[6] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training. OpenAI preprint. 2018.
[7] Touvron H, Lavril T, Izacard G, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[8] Team G, Anil R, Borgeaud S, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
[9] Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
[10] Ji Z, Lee N, Frieske R, et al. Survey of hallucination in natural language generation. ACM computing surveys, 2023, 55(12): 1-38. DOI: https://doi.org/10.1145/3571730
[11] Wang C, Yan J, Yue Y, et al. DistilQwen2. 5: Industrial Practices of Training Distilled Open Lightweight Language Models. arXiv preprint arXiv:2504.15027, 2025. DOI: https://doi.org/10.18653/v1/2025.acl-industry.4
[12] Chen Y, Li R, Zhao Z, et al. NetGPT: A native-AI network architecture beyond provisioning personalized generative services. arXiv preprint arXiv:2307.06148, 2023. DOI: https://doi.org/10.1109/MNET.2024.3376419
[13] Shen X, Dong P, Lu L, et al. Agile-quant: Activation-guided quantization for faster inference of LLMs on the edge. Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(17): 18944-18951. DOI: https://doi.org/10.1609/aaai.v38i17.29860
[14] Cai W, Jiang J, Wang F, et al. A survey on mixture of experts in large language models. IEEE Transactions on Knowledge and Data Engineering, 2025. DOI: https://doi.org/10.1109/TKDE.2025.3554028
[15] Kong R, Li Y, Feng Q, et al. SwapMoE: Serving off-the-shelf MoE-based large language models with tunable memory budget. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024: 6710-6720. DOI: https://doi.org/10.18653/v1/2024.acl-long.363
[16] Fang Z, Hong Z, Huang Y, et al. Fate: Fast Edge Inference of Mixture-of-Experts Models via Cross-Layer Gate. arXiv preprint arXiv:2502.12224, 2025.
[17] Kundu A, Lim Y C F, Chew A, et al. Efficiently distilling LLMs for edge applications. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track). 2024: 52-62. DOI: https://doi.org/10.18653/v1/2024.naacl-industry.5
[18] Huang M, Shen A, Li K, et al. Edgellm: A highly efficient cpu-fpga heterogeneous edge accelerator for large language models. IEEE Transactions on Circuits and Systems I: Regular Papers, 2025. DOI: https://doi.org/10.1109/TCSI.2025.3546256
[19] Liang Y, Shi H, Shao H, et al. AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design. arXiv preprint arXiv:2505.03745, 2025.
[20] Yang Z, Yang Y, Zhao C, et al. Perllm: Personalized inference scheduling with edge-cloud collaboration for diverse llm services. arXiv preprint arXiv:2405.14636, 2024.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Highlights in Science, Engineering and Technology

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.







