|
Guihong Li
I'm a MTS (Member of Technical Staff) at AMD where I focus on efficient Generative AI models and large-scale training systems. I received my Ph.D. from UT Austin in May 2024. My research has been published at top-tier venues including NeurIPS, COLM, ICLR, ICML, and CVPR.
Research Focus: Building next-generation efficient LLM architectures and scalable training infrastructure for foundation models. I specialize in:
- Efficient LLM Architectures: Developed hybrid models achieving up to 50x KV cache compression and 3.6x inference speedup through novel attention mechanisms (MLA, Mamba) and distillation strategies.
- Large-Scale Training Systems: Optimized distributed training with various strategies for dense and MoE models across PyTorch, JAX, and Megatron-LM. Delivered production-ready training systems for enterprise customers.
Email  / 
Google Scholar  / 
Twitter  / 
LinkedIn
|
|
Research Directions
- Efficient LLM Architectures: Developing hybrid models (MLA-Mamba) achieving up to 50x KV cache compression and 3.6x inference speedup. Research on upcycling pre-trained attention mechanisms, autoregressive to block diffusion adaptation, and byte-level LLM distillation.
- Large-Scale Training Systems: Optimizing distributed training with DP, FSDP, and EP for dense and MoE models across PyTorch, JAX, and Megatron-LM. Delivered production-ready MI300X GPU training systems to enterprise customers (e.g., Cohere).
- Research Output: Published 15+ papers at top-tier venues (NeurIPS, ICLR, ICML, CVPR, COLM) with focus on efficient architectures and training systems.
|
|
News
- 09-2025: One paper accepted to NeurIPS 2025
- 09-2025: We developed, tested and delivered the scalable and reliable large-scale MI300X GPU training systems to our customer (blog)!
- 09-2025: We trained and open-sourced AMD's first hybrid models (combine linear attention and multi-head attention) and it's highlighted on AMD website (blog)!
- 05-2025: One paper accepted to CoLM 2025
- 03-2025: One paper accepted to CVPR 2025
- 03-2025: AMD released the first version of unified training docker; I proposed this idea and was deeply involved on the development stages (blog)
- 10-2024: AMD released the new AMD MI325X GPU and RoCm-6.2; I was deeply involved on both release (blog)!
- 11-2024: One paper accepted to WACV 2025
- 03-2024: One paper accepted to T-PAMI
- 03-2024: Two papers accepted to CVPR 2024
- 01-2024: Two papers accepted to ICLR 2024
- 09-2023: One paper accepted to NeurIPS 2023
- 05-2023: One paper accepted to ICML 2023
- 02-2023: One paper accepted to CVPR 2023
- 01-2023: One paper accepted to ICLR 2023 as Spotlight
|
|
Member of Technical Staff @ AMD AI Group
Bellevue, WA ยท June 2024 - Present
- Efficient LLM Architectures: Developed Zebra-Llama (50x KV compression, 3.6x speedup) and X-EcoMLA (12.8x compression, 2x speedup) - AMD's first hybrid models combining MLA and Mamba architectures. Research on autoregressive to block diffusion adaptation and byte-level LLM distillation.
- Large-Scale Training Optimization: Performance analysis and optimization for distributed training on dense and MoE models in PyTorch and JAX. Delivered production MI300X training systems to enterprise customers.
- Infrastructure & Tools: Proposed and led development of AMD's unified training docker. Built efficient fine-tuning recipes for MI300X. Contributed to MI325X GPU and RoCm-6.2 releases.
Applied Scientist Intern @ JPMorgan Chase & Co
New York, June 2023 - October 2023
Mentors: Dr. Richard Chun-Fu Chen, Dr. Hsiang Hsu
- Trustworthy Generative models: Control the contents generated by image generative models.
- Efficient Machine Unlearning: Build an efficient machine unlearning algorithm to quickly remove information from a trained model.
Research Scientist Intern @ ARM ML Tech
San Jose, May 2021 - August 2021
Mentors: Dr. Kartikeya Bhardwaj, Dr. Naveen Suda, Dr. Lingchuan Meng
- Hardware-aware NAS: Explored the neural architecture search technique to search for hardware-efficient models.
- Hardware Performance evaluation: Built a model to estimate neural networks' latency on neural accelerators.
|
Zebra-Llama: Towards Extremely Efficient Hybrid Models
Mingyu Yang*, Mehdi Rezagholizadeh*, Guihong Li*, Vikram Appia, and Emad Barsoum. (*Equal contribution)
NeurIPS, 2025, paper
AMD's first hybrid MLA-Mamba LLM achieving 50x KV cache compression and 3.6x inference speedup through novel 3-stage distillation and layer selection strategy.
|
X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression
Guihong Li*, Mehdi Rezagholizadeh*, Mingyu Yang*, Vikram Appia, and Emad Barsoum. (*Equal contribution)
COLM, 2025, paper
Novel technique for upcycling GQA/MHA modules to Multi-Latent Attention, achieving 12.8x KV cache compression and 2x inference speedup with minimal quality degradation.
|
Zero-Shot Neural Architecture Search: Challenges, Solutions, and Opportunities
Guihong Li, Duc Hoang, Kartikeya Bhardwaj, Ming Lin, Zhangyang Wang, Radu Marculescu.
IEEE T-PAMI, 2024, paper
|
Machine Unlearning for Image-to-Image Generative Models
Guihong Li, Hsiang Hsu, Chun-Fu Chen, Radu Marculescu
ICLR, 2024, paper
Efficient machine unlearning algorithm to quickly remove information from trained generative models, addressing trustworthiness and data privacy concerns.
|
Dropout-Based Rashomon Set Exploration for Efficient Predictive Multiplicity Estimation
Hsiang Hsu, Guihong Li, Shaohan Hu, Chun-Fu Chen
ICLR, 2024, paper
|
Efficient Low-rank Backpropagation for Vision Transformer Adaptation
Yuedong Yang, Hung-Yueh Chiang, Guihong Li, Diana Marculescu, Radu Marculescu
NeurIPS, 2023, paper
|
TIPS: Topologically Important Path Sampling for Anytime Neural Networks
Guihong Li, Kartikeya Bhardwaj, Yuedong Yang, Radu Marculescu
ICML, 2023, paper
|
Efficient On-device Training via Gradient Filtering
Yuedong Yang, Guihong Li, Radu Marculescu
CVPR, 2023, paper
|
ZiCo: Zero-shot NAS via inverse Coefficient of Variation on Gradients
Guihong Li, Yuedong Yang, Kartikeya Bhardwaj, Radu Marculescu
ICLR, 2023   (Spotlight), paper
Zero-shot neural architecture search using gradient variation analysis, enabling efficient model discovery without training.
|
How does topology influence gradient propagation and model performance of deep networks with densenet-type skip connections?
Kartikeya Bhardwaj*, Guihong Li*, Radu Marculescu. (*Equal contribution)
CVPR, 2021, paper
|
|
Technical Expertise
LLM Architectures: Multi-Latent Attention (MLA), Mamba, GQA, MHA, Transformers, MoE models, Diffusion models
Distributed Training Frameworks: PyTorch, JAX, Megatron-LM, DeepSpeed
Optimization: Knowledge distillation, Model compression, Quantization, Efficient fine-tuning (LoRA, Adapters)
Hardware: AMD MI300X, MI325X, NVIDIA GPUs, TPUs
Systems: Docker, Kubernetes, SLURM, RoCm, CUDA
|
|