Skip to content
AI Notes
Audio-Text
Initializing search
ai-notes-mkdocs
AI Notes
ai-notes-mkdocs
Home
Foundations
Foundations
Overview
Losses
Losses
Cross Entropy
BCE Loss
Focal Loss
Dice / IoU Loss
MSE MAE Loss
Normalization
Normalization
Overview
BatchNorm
LayerNorm
GroupNorm
RMSNorm
Activation Functions
Activation Functions
Overview
Sigmoid / Tanh
ReLU Family
GELU / SiLU (Swish)
GLU Family (GEGLU / SwiGLU)
Softmax
Modern / Others
Optimization
Optimization
Overview
Adam / AdamW
LR Scheduler
Gradient Clipping
EMA / SWA
Regularization
Regularization
Dropout
Weight Decay
Label Smoothing
Models
Models
LLM
LLM
Overview
Core
Core
Q, K, V
Attention
Multi-Head Attention
Transformer
Tokenization
Decoding & Sampling
KV Cache (Basics)
Positional Encoding
Positional Encoding
Sinusoidal
Relative PE
Better Relative PE
RoPE
ALiBi
RoPE Scaling / Long Context
Architectures
Architectures
Encoder-Decoder (T5)
MoE (Switch / Router)
Speculative Decoding (Concept)
Models
Models
BERT
GPT
GPT-2
T5
Switch Transformer
Qwen3
deepseek-v3
deepseek-r1
Tasks
Tasks
Sequence Labeling
Sequence Labeling
NER
BIO / BILOU
Embeddings / Reranking
Tool Use / Function Call
Vision
Vision
Overview
Backbones
Backbones
ViT
Swin Transformer
Detection
Detection
DETR
Grounding DINO
Segmentation
Segmentation
Mask2Former
Mask DINO
SAM
Grounding-SAM
Tasks
Tasks
OCR / Document AI
Video Understanding
Audio
Audio
Overview
Representations
Representations
Spectrogram / Mel / Codec Tokens
Speech Recognition (ASR)
Speech Recognition (ASR)
Whisper
Speech Synthesis (TTS)
Speech Synthesis (TTS)
Overview
Neural Codec LM (VALL-E)
Flow-Matching (Voicebox)
Voice Cloning (Concepts)
Text-to-Audio / Music
Text-to-Audio / Music
AudioLDM
MusicLM
AudioLM
Speech Translation
Speech Translation
SeamlessM4T
Evaluation
Evaluation
WER / CER
MOS / PESQ (Intro)
FAD
Video
Video
Overview
Generation
Generation
Overview
Sora
Stable Video Diffusion
Lumiere
Temporal Consistency
Control & Editing
Control & Editing
Inpainting / Outpainting
Motion / Camera Control
Video-to-Video / Style Transfer
Evaluation
Evaluation
FVD
3D
3D
Overview
Representations
Representations
Mesh / Point / SDF
NeRF
3D Gaussian Splatting
Generation
Generation
Text-to-3D (DreamFusion)
Text-to-3D (3DGS)
Novel View Synthesis
Evaluation
Evaluation
Chamfer / F-Score / Render Metrics
VLM / Multimodal
VLM / Multimodal
Overview
Connectors
Connectors
Projector / Adapter (Concepts)
Q-Former (BLIP-2)
Contrastive
Contrastive
CLIP
SigLIP-2
Generative
Generative
BLIP
BLIP-2
LLaVA
Florence-2
nanoVLM
Qwen2.5-VL
Grounding
Grounding
Grounding DINO
Grounding-SAM
Evaluation
Evaluation
VQA / MMMU (Intro)
數å—人 (Digital Human)
數å—人 (Digital Human)
Overview
Face / Head
Face / Head
DECA
GaussianAvatars
Full-body Avatars
Full-body Avatars
3DGS-Avatar
Talking Head / Lip Sync
Talking Head / Lip Sync
Wav2Lip
Audio-Driven Face Animation
Voice & Safety
Voice & Safety
Voice Cloning Risks
Consent / Deepfakes
Watermarking / Provenance (Intro)
Generative Models
Generative Models
Overview
Diffusion
Diffusion
Overview
VAE
VAE
VAE
VQ-VAE
Samplers
Samplers
DDPM
DDIM
DPM-Solver
Guidance
Guidance
Classifier-Free Guidance
Conditioning / Control (Intro)
Latent Diffusion
Latent Diffusion
LDM
Stable Diffusion
DiT
HunyuanImage 2.1
Flow / Rectified Flow
Flow / Rectified Flow
Overview
Autoregressive Generation
Autoregressive Generation
Overview
GAN (Optional)
GAN (Optional)
Overview
Data
Data
Overview
Data Curation
Data Curation
Cleaning / Dedup / Filtering
Mixture & Sampling
Licensing / Copyright (Intro)
Instruction / Preference Data
Instruction / Preference Data
Instruction Tuning Data
Preference Data (RLHF/DPO)
Synthetic Data
Multimodal Data
Multimodal Data
Image-Text
Video-Text
Audio-Text
Tooling
Tooling
Datasets / WebDataset / HF Datasets
Training & Finetuning
Training & Finetuning
Overview
Pretraining
Pretraining
Overview
Scaling Laws (Intro)
Token Budget & Compute
Finetuning
Finetuning
Overview
LLM Finetune Recipe (Hyperparams)
Multimodal Finetune Recipe (Hyperparams)
LoRA / QLoRA
SFT
Multimodal Finetuning
Chat Templates / Prompt Format
Data Pipeline
Alignment
Alignment
Overview
Reward Model
PPO (RLHF)
GRPO
DPO / RLHF
RLAIF (Optional)
Distributed
Distributed
DeepSpeed ZeRO
FSDP
Tensor Parallelism
Pipeline Parallelism
Sequence Parallelism (Optional)
Mixed Precision
Gradient Checkpointing
Systems & Deployment
Systems & Deployment
Overview
Inference Optimization
Inference Optimization
Overview
KV Cache & Memory
Continuous Batching & PagedAttention
FlashAttention
Prefix Caching
Speculative Decoding
torch.compile / Inductor (Optional)
Quantization & Compression
Quantization & Compression
Overview
bitsandbytes 4bit / NF4
GPTQ
AWQ
SmoothQuant (INT8 W8A8)
GGUF & llama.cpp
Distillation
Pruning / Sparsity (Optional)
Serving Runtimes
Serving Runtimes
vLLM
TGI (Text Generation Inference)
TensorRT-LLM
Triton Inference Server
Production
Production
Deployment on Kubernetes
Autoscaling & Load Shedding
Observability (Logs/Metrics/Tracing)
Cost & Capacity Planning
Safety Guardrails (PII/Jailbreak/Prompt Injection)
Caching (Prompt/Response)
Agents
Agents
Overview
Foundations
Foundations
Tool Calling
Planning
Memory
Reflection
Patterns
Patterns
ReAct
Function Calling
Multi-Agent
RAG
RAG
RAG Overview
Chunking
Retriever
Reranker
Vector DB (Optional)
Prompt Injection Defense
RAG Evaluation
Evaluation
Evaluation
Overview
Benchmarks
Benchmarks
LLM Benchmarks (MMLU/MT-Bench/etc.)
VLM Benchmarks (VQA/MMMU/etc.)
Metrics
Metrics
LLM Eval
VLM Eval
Detection Metrics
Video Metrics (FVD)
Audio Metrics (FAD)
3D Metrics
Reliability
Reliability
Robustness / OOD
Calibration / Uncertainty
Safety
Safety
Safety Overview
Red Teaming (Optional)
Deepfake / Synthetic Media Risks (Optional)
Audio-Text