Media Summary: In this AI Research Roundup episode, Alex discusses the paper: 'A Survey on Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam ... The era of actually open AI is here. We've spent the past year helping leading organizations deploy open models and

Llm Inference Engines Optimizing Performance - Detailed Analysis & Overview

In this AI Research Roundup episode, Alex discusses the paper: 'A Survey on Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam ... The era of actually open AI is here. We've spent the past year helping leading organizations deploy open models and Here's the one change that took mine from ~120 tok/s to 1200+ without a new GPU. TryHackMe just launched Cyber Security 101 ... Talk : Everything You Need to Know About Reducing Voice-Agent Latency (by Philip Kiely @ Baseten) Rolling your own ... Today we have Philip Kiely from Baseten on the show. Baseten is a Series B startup focused on providing infrastructure for AI ...

This is Part 1 of a series where I build and Ready to serve your large language models faster, more efficiently, and at a lower cost? Discover how vLLM, a high-throughput ... In tis talk, Charlie Ruan from MLC will focus on WebLLM, a high- Friendli AI is a specialized platform focused on delivering high-

Photo Gallery

LLM Inference Engines: Optimizing Performance
Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou
Why Inference is hard..
What Is Llama.cpp? The LLM Inference Engine for Local AI
LLM Inference Engines: vLLM,  KV Cache, Paged attention and Continuous Batching.
High Performance LLM Inference in Production
Optimizing LLM Training and Inference Performance on GPUs - Faradawn Yang
LLM inference optimization
Inference Office Hours with SGLang: Performance Optimizations for LLM Serving
Your local LLM is 10x slower than it should be
Faster LLMs: Accelerate Inference with Speculative Decoding
AI Optimization Lecture 01 -  Prefill vs Decode - Mastering LLM Techniques from NVIDIA
Sponsored
Sponsored
View Detailed Profile
LLM Inference Engines: Optimizing Performance

LLM Inference Engines: Optimizing Performance

In this AI Research Roundup episode, Alex discusses the paper: 'A Survey on

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

LLM inference

Sponsored
Why Inference is hard..

Why Inference is hard..

Follow me: X: https://x.com/calebfoundry LinkedIn: https://www.linkedin.com/in/calebeom/ TikTok: ...

What Is Llama.cpp? The LLM Inference Engine for Local AI

What Is Llama.cpp? The LLM Inference Engine for Local AI

Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam ...

LLM Inference Engines: vLLM,  KV Cache, Paged attention and Continuous Batching.

LLM Inference Engines: vLLM, KV Cache, Paged attention and Continuous Batching.

https://cefboud.com/posts/inside-

Sponsored
High Performance LLM Inference in Production

High Performance LLM Inference in Production

The era of actually open AI is here. We've spent the past year helping leading organizations deploy open models and

Optimizing LLM Training and Inference Performance on GPUs - Faradawn Yang

Optimizing LLM Training and Inference Performance on GPUs - Faradawn Yang

Connect with Faradawn - https://www.linkedin.com/in/faradawn/ ✓ Connect with

LLM inference optimization

LLM inference optimization

Optimizing LLM inference

Inference Office Hours with SGLang: Performance Optimizations for LLM Serving

Inference Office Hours with SGLang: Performance Optimizations for LLM Serving

Join us to find out the latest

Your local LLM is 10x slower than it should be

Your local LLM is 10x slower than it should be

Here's the one change that took mine from ~120 tok/s to 1200+ without a new GPU. TryHackMe just launched Cyber Security 101 ...

Faster LLMs: Accelerate Inference with Speculative Decoding

Faster LLMs: Accelerate Inference with Speculative Decoding

Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam ...

AI Optimization Lecture 01 -  Prefill vs Decode - Mastering LLM Techniques from NVIDIA

AI Optimization Lecture 01 - Prefill vs Decode - Mastering LLM Techniques from NVIDIA

Video 1 of 6 | Mastering

Maximize LLM Inference Performance + Auto-Profile/Optimize PyTorch/CUDA Code

Maximize LLM Inference Performance + Auto-Profile/Optimize PyTorch/CUDA Code

Talk #1: Everything You Need to Know About Reducing Voice-Agent Latency (by Philip Kiely @ Baseten) Rolling your own ...

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

Understanding the

Deep Dive into Inference Optimization for LLMs with Philip Kiely

Deep Dive into Inference Optimization for LLMs with Philip Kiely

Today we have Philip Kiely from Baseten on the show. Baseten is a Series B startup focused on providing infrastructure for AI ...

Building an LLM Inference Engine on Apple Silicon - Part 1: How GPT Actually Works

Building an LLM Inference Engine on Apple Silicon - Part 1: How GPT Actually Works

This is Part 1 of a series where I build and

Optimize LLM inference with vLLM

Optimize LLM inference with vLLM

Ready to serve your large language models faster, more efficiently, and at a lower cost? Discover how vLLM, a high-throughput ...

WebLLM: A high-performance in-browser LLM Inference engine

WebLLM: A high-performance in-browser LLM Inference engine

In tis talk, Charlie Ruan from MLC will focus on WebLLM, a high-

FriendliAI: High-Performance LLM Serving and Inference Optimization Platform

FriendliAI: High-Performance LLM Serving and Inference Optimization Platform

Friendli AI is a specialized platform focused on delivering high-

Deep Dive: Optimizing LLM inference

Deep Dive: Optimizing LLM inference

In this video, we zoom in on

Related Video Content

Large language model - Wikipedia information

A large language model (LLM) is a neural network trained on a vast amount of text for natural language processing...

Large Language Model (LLM) - GeeksforGeeks information

May 2, 2026 · Large Language Models (LLMs) are advanced AI systems built on deep neural networks designed to process,...

Google NotebookLM | AI Research Tool & Thinking Partner information

Upload PDFs, websites, YouTube videos, audio files, Google Docs, Google Slides and more, and NotebookLM will...

What Is an LLM? Beginner's Guide to AI in 2026 information

Apr 18, 2026 · What Is an LLM in Simple Terms? An LLM — short for Large Language Model — is an AI system trained on...

Best Open-Source LLM Models in 2026: Coding, Local, Agentic AI ... information

Nov 13, 2025 · A Blog post by Daya Shankar on Hugging Face