Guihong Li

I'm a MTS (Member of Technical Staff) at AMD where I focus on efficient Generative AI models and large-scale training systems. I received my Ph.D. from UT Austin in May 2024. My research has been published at top-tier venues including NeurIPS, COLM, ICLR, ICML, and CVPR.

Research Focus: Building next-generation efficient LLM architectures and scalable training infrastructure for foundation models. I specialize in:

  • Efficient LLM Architectures: Developed hybrid models achieving up to 50x KV cache compression and 3.6x inference speedup through novel attention mechanisms (MLA, Mamba) and distillation strategies.
  • Large-Scale Training Systems: Optimized distributed training with various strategies for dense and MoE models across PyTorch, JAX, and Megatron-LM. Delivered production-ready training systems for enterprise customers.

Email  /  Google Scholar  /  Twitter  /  LinkedIn

profile photo
Research Directions
  • Efficient LLM Architectures: Developing hybrid models (MLA-Mamba) achieving up to 50x KV cache compression and 3.6x inference speedup. Research on upcycling pre-trained attention mechanisms, autoregressive to block diffusion adaptation, and byte-level LLM distillation.
  • Large-Scale Training Systems: Optimizing distributed training with DP, FSDP, and EP for dense and MoE models across PyTorch, JAX, and Megatron-LM. Delivered production-ready MI300X GPU training systems to enterprise customers (e.g., Cohere).
  • Research Output: Published 15+ papers at top-tier venues (NeurIPS, ICLR, ICML, CVPR, COLM) with focus on efficient architectures and training systems.
News

  • 09-2025: One paper accepted to NeurIPS 2025
  • 09-2025: We developed, tested and delivered the scalable and reliable large-scale MI300X GPU training systems to our customer (blog)!
  • 09-2025: We trained and open-sourced AMD's first hybrid models (combine linear attention and multi-head attention) and it's highlighted on AMD website (blog)!
  • 05-2025: One paper accepted to CoLM 2025
  • 03-2025: One paper accepted to CVPR 2025
  • 03-2025: AMD released the first version of unified training docker; I proposed this idea and was deeply involved on the development stages (blog)
  • 10-2024: AMD released the new AMD MI325X GPU and RoCm-6.2; I was deeply involved on both release (blog)!
  • 11-2024: One paper accepted to WACV 2025
  • 03-2024: One paper accepted to T-PAMI
  • 03-2024: Two papers accepted to CVPR 2024
  • 01-2024: Two papers accepted to ICLR 2024
  • 09-2023: One paper accepted to NeurIPS 2023
  • 05-2023: One paper accepted to ICML 2023
  • 02-2023: One paper accepted to CVPR 2023
  • 01-2023: One paper accepted to ICLR 2023 as Spotlight

Professional Experience

Member of Technical Staff @ AMD AI Group
Bellevue, WA ยท June 2024 - Present

  • Efficient LLM Architectures: Developed Zebra-Llama (50x KV compression, 3.6x speedup) and X-EcoMLA (12.8x compression, 2x speedup) - AMD's first hybrid models combining MLA and Mamba architectures. Research on autoregressive to block diffusion adaptation and byte-level LLM distillation.
  • Large-Scale Training Optimization: Performance analysis and optimization for distributed training on dense and MoE models in PyTorch and JAX. Delivered production MI300X training systems to enterprise customers.
  • Infrastructure & Tools: Proposed and led development of AMD's unified training docker. Built efficient fine-tuning recipes for MI300X. Contributed to MI325X GPU and RoCm-6.2 releases.

Applied Scientist Intern @ JPMorgan Chase & Co
New York, June 2023 - October 2023
Mentors: Dr. Richard Chun-Fu Chen, Dr. Hsiang Hsu

  • Trustworthy Generative models: Control the contents generated by image generative models.
  • Efficient Machine Unlearning: Build an efficient machine unlearning algorithm to quickly remove information from a trained model.

Research Scientist Intern @ ARM ML Tech
San Jose, May 2021 - August 2021
Mentors: Dr. Kartikeya Bhardwaj, Dr. Naveen Suda, Dr. Lingchuan Meng

  • Hardware-aware NAS: Explored the neural architecture search technique to search for hardware-efficient models.
  • Hardware Performance evaluation: Built a model to estimate neural networks' latency on neural accelerators.
Publications (Selected)

Full publications on my Google Scholar.

Zebra-Llama: Towards Extremely Efficient Hybrid Models
Mingyu Yang*, Mehdi Rezagholizadeh*, Guihong Li*, Vikram Appia, and Emad Barsoum. (*Equal contribution)
NeurIPS, 2025, paper

AMD's first hybrid MLA-Mamba LLM achieving 50x KV cache compression and 3.6x inference speedup through novel 3-stage distillation and layer selection strategy.

X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression
Guihong Li*, Mehdi Rezagholizadeh*, Mingyu Yang*, Vikram Appia, and Emad Barsoum. (*Equal contribution)
COLM, 2025, paper

Novel technique for upcycling GQA/MHA modules to Multi-Latent Attention, achieving 12.8x KV cache compression and 2x inference speedup with minimal quality degradation.

Zero-Shot Neural Architecture Search: Challenges, Solutions, and Opportunities
Guihong Li, Duc Hoang, Kartikeya Bhardwaj, Ming Lin, Zhangyang Wang, Radu Marculescu.
IEEE T-PAMI, 2024, paper

Machine Unlearning for Image-to-Image Generative Models
Guihong Li, Hsiang Hsu, Chun-Fu Chen, Radu Marculescu
ICLR, 2024, paper

Efficient machine unlearning algorithm to quickly remove information from trained generative models, addressing trustworthiness and data privacy concerns.

Dropout-Based Rashomon Set Exploration for Efficient Predictive Multiplicity Estimation
Hsiang Hsu, Guihong Li, Shaohan Hu, Chun-Fu Chen
ICLR, 2024, paper

Efficient Low-rank Backpropagation for Vision Transformer Adaptation
Yuedong Yang, Hung-Yueh Chiang, Guihong Li, Diana Marculescu, Radu Marculescu
NeurIPS, 2023, paper

TIPS: Topologically Important Path Sampling for Anytime Neural Networks
Guihong Li, Kartikeya Bhardwaj, Yuedong Yang, Radu Marculescu
ICML, 2023, paper

Efficient On-device Training via Gradient Filtering
Yuedong Yang, Guihong Li, Radu Marculescu
CVPR, 2023, paper

ZiCo: Zero-shot NAS via inverse Coefficient of Variation on Gradients
Guihong Li, Yuedong Yang, Kartikeya Bhardwaj, Radu Marculescu
ICLR, 2023   (Spotlight), paper

Zero-shot neural architecture search using gradient variation analysis, enabling efficient model discovery without training.

How does topology influence gradient propagation and model performance of deep networks with densenet-type skip connections?
Kartikeya Bhardwaj*, Guihong Li*, Radu Marculescu. (*Equal contribution)
CVPR, 2021, paper

Technical Expertise

LLM Architectures: Multi-Latent Attention (MLA), Mamba, GQA, MHA, Transformers, MoE models, Diffusion models
Distributed Training Frameworks: PyTorch, JAX, Megatron-LM, DeepSpeed
Optimization: Knowledge distillation, Model compression, Quantization, Efficient fine-tuning (LoRA, Adapters)
Hardware: AMD MI300X, MI325X, NVIDIA GPUs, TPUs
Systems: Docker, Kubernetes, SLURM, RoCm, CUDA


Website template credit to Dr. Jon Barron
Last updated: December 2025