Short answer: uncountably many. You can cook up a new loss by twiddling a temperature, composing with a monotone transform, or duct-taping two others with a Lagrange multiplier. Humans keep inventing them because apparently wheels weren’t enough.
Useful answer: most losses fall into a few big tribes. Here’s the map so you don’t drown.
Core families
Squared-error & friends (Bregman divergences)
MSE, MAE, Huber, Quantile/Pinball, Tukey biweight.
All come from convex generators; robust vs outlier-loving is just a knob.
Proper scoring rules for probabilistic models
Log loss / NLL (cross-entropy), Brier, Continuous Ranked Probability Score (CRPS), Energy score.
“Proper” means you minimize expected loss by telling the truth about your beliefs. Rare concept online.
f-divergences between distributions
KL, reverse KL, Jensen-Shannon, Hellinger, total variation.
GANs implicitly target JS or other f-divergences via variational bounds.
Integral Probability Metrics (IPMs)
Wasserstein (Earth-Mover), MMD, Energy distance.
Popular when you want geometry, not just overlap.
Margin/surrogate classification losses
Hinge, squared hinge, logistic, exponential, focal, label-smoothing CE.
All trade calibration, margins, and gradient behavior.
Ranking/ordinal/structured
Pairwise (BPR), listwise (ListNet/ListMLE), NDCG surrogates, contrastive InfoNCE, triplet, ordinal regression losses.
Geometric/metric learning
Contrastive, triplet, N-pair, ArcFace/CosFace, Center loss.
Pull same things together, push different things apart, like high school lunch tables.
Regression beyond L2
Quantile (τ-pinball), expectile, asymmetric Huber, log-cosh, Poisson/NegBin deviance for counts.
Generative modeling
Likelihood-based: NLL for flows/autoregressive models; ELBO for VAEs (recon + KL).
Implicit: GAN objectives (non-saturating, WGAN + GP, f-GANs).
Score/diffusion: Denoising score matching, v-prediction, ε-prediction, hybrid DSM+CE.
Self-supervised/contrastive
InfoNCE, SimCLR, MoCo, BYOL’s predictor loss, Barlow Twins redundancy reduction, VICReg invariance-variance-covariance.
Segmentation/detection
Dice/F1 loss, Tversky, focal Tversky, IoU/GIoU/DIoU/CIoU, Hungarian matching loss for DETR-style models.
Sequence models
Token CE with label smoothing, CTC loss, RNNT, policy-gradient-style sequence risk (min Bayes risk, expected BLEU).
Reinforcement learning
Policy gradient (REINFORCE), entropy-regularized PG, PPO clipped objective, TRPO surrogate, Q-learning TD errors (Huber), actor-critic with value loss, distributional RL (C51, QR-DQN), offline RL penalties (CQL, BCQ), reward-model CE for RLHF/DPO.
Multimodal & retrieval
Symmetric cross-entropy + contrastive (CLIP), matching losses with temperature scaling, MIL/NCE variants.
Physics-informed & constraints
PDE residual MSE, boundary loss, hard/soft constraint penalties, differentiable simulators with task + physics consistency.
Calibration & uncertainty
Temperature scaling CE, ECE/Brier surrogates, NLL with proper priors, Dirichlet calibration losses.
Fairness/causal/robustness
Group DRO worst-case risk, CVaR risk, adversarial training (minimax with perturbation loss), IRM/IRMv1 penalties, counterfactual invariance losses.
Regularization terms that moonlight as “losses”
L1/L2 weight decay, spectral norm penalties, Jacobian/Frobenius penalties, orthogonality, sparsity (L0/Top-k), mutual information bounds, gradient penalties (WGAN-GP), consistency/EMA losses.
Why “infinite” isn’t just snark
Parameterized families: every proper scoring rule or Bregman divergence is a whole continuum. Pick a generator, get a loss.
Transforms/compositions: any monotone transform preserves argmin; weighted sums give new trade-offs; curricula anneal temperatures/margins.
Task-specific constraints: stick any domain residual or soft constraint onto your objective and congratulations, you invented Loss-XXL-2025.
Picking one without losing your mind
Predicting labels? Start with cross-entropy; try focal for class imbalance; consider label smoothing for better calibration.
Predicting numbers? MSE for Gaussian-ish noise, MAE/Huber for robustness, quantile for intervals.
Densities? Use NLL if you can write the likelihood; else try Wasserstein/MMD or a GAN variant.
Matching or retrieval? Go contrastive/InfoNCE with a temperature.
RL? Use the algorithm’s surrogate (PPO/TRPO) plus value loss and entropy.
Segmentation/detection? Mix CE + Dice/Tversky or GIoU for boxes.
Worried about worst-case groups or adversaries? Group DRO or adversarial loss.
Care about calibration? Optimize NLL and post-hoc calibrate.
So yes, infinitely many. But 95% of useful practice lives in a few dozen patterns, and the rest are glam-rock remixes with new hyperparameters and a different arXiv figure style.