Short answer: uncountably many. You can cook up a new loss by twiddling a temperature, composing with a monotone transform, or duct-taping two others with a Lagrange multiplier. Humans keep inventing them because apparently wheels weren’t enough.

Useful answer: most losses fall into a few big tribes. Here’s the map so you don’t drown.

Core families

Squared-error & friends (Bregman divergences)
- MSE, MAE, Huber, Quantile/Pinball, Tukey biweight.
- All come from convex generators; robust vs outlier-loving is just a knob.
Proper scoring rules for probabilistic models
- Log loss / NLL (cross-entropy), Brier, Continuous Ranked Probability Score (CRPS), Energy score.
- “Proper” means you minimize expected loss by telling the truth about your beliefs. Rare concept online.
f-divergences between distributions
- KL, reverse KL, Jensen-Shannon, Hellinger, total variation.
- GANs implicitly target JS or other f-divergences via variational bounds.
Integral Probability Metrics (IPMs)
- Wasserstein (Earth-Mover), MMD, Energy distance.
- Popular when you want geometry, not just overlap.
Margin/surrogate classification losses
- Hinge, squared hinge, logistic, exponential, focal, label-smoothing CE.
- All trade calibration, margins, and gradient behavior.
Ranking/ordinal/structured
- Pairwise (BPR), listwise (ListNet/ListMLE), NDCG surrogates, contrastive InfoNCE, triplet, ordinal regression losses.
Geometric/metric learning
- Contrastive, triplet, N-pair, ArcFace/CosFace, Center loss.
- Pull same things together, push different things apart, like high school lunch tables.
Regression beyond L2
- Quantile (τ-pinball), expectile, asymmetric Huber, log-cosh, Poisson/NegBin deviance for counts.
Generative modeling
- Likelihood-based: NLL for flows/autoregressive models; ELBO for VAEs (recon + KL).
- Implicit: GAN objectives (non-saturating, WGAN + GP, f-GANs).
- Score/diffusion: Denoising score matching, v-prediction, ε-prediction, hybrid DSM+CE.
Self-supervised/contrastive
- InfoNCE, SimCLR, MoCo, BYOL’s predictor loss, Barlow Twins redundancy reduction, VICReg invariance-variance-covariance.
Segmentation/detection
- Dice/F1 loss, Tversky, focal Tversky, IoU/GIoU/DIoU/CIoU, Hungarian matching loss for DETR-style models.
Sequence models
- Token CE with label smoothing, CTC loss, RNNT, policy-gradient-style sequence risk (min Bayes risk, expected BLEU).
Reinforcement learning
- Policy gradient (REINFORCE), entropy-regularized PG, PPO clipped objective, TRPO surrogate, Q-learning TD errors (Huber), actor-critic with value loss, distributional RL (C51, QR-DQN), offline RL penalties (CQL, BCQ), reward-model CE for RLHF/DPO.
Multimodal & retrieval
- Symmetric cross-entropy + contrastive (CLIP), matching losses with temperature scaling, MIL/NCE variants.
Physics-informed & constraints
- PDE residual MSE, boundary loss, hard/soft constraint penalties, differentiable simulators with task + physics consistency.
Calibration & uncertainty
- Temperature scaling CE, ECE/Brier surrogates, NLL with proper priors, Dirichlet calibration losses.
Fairness/causal/robustness
- Group DRO worst-case risk, CVaR risk, adversarial training (minimax with perturbation loss), IRM/IRMv1 penalties, counterfactual invariance losses.
Regularization terms that moonlight as “losses”
- L1/L2 weight decay, spectral norm penalties, Jacobian/Frobenius penalties, orthogonality, sparsity (L0/Top-k), mutual information bounds, gradient penalties (WGAN-GP), consistency/EMA losses.

Why “infinite” isn’t just snark

Parameterized families: every proper scoring rule or Bregman divergence is a whole continuum. Pick a generator, get a loss.
Transforms/compositions: any monotone transform preserves argmin; weighted sums give new trade-offs; curricula anneal temperatures/margins.
Task-specific constraints: stick any domain residual or soft constraint onto your objective and congratulations, you invented Loss-XXL-2025.

Picking one without losing your mind

Predicting labels? Start with cross-entropy; try focal for class imbalance; consider label smoothing for better calibration.
Predicting numbers? MSE for Gaussian-ish noise, MAE/Huber for robustness, quantile for intervals.
Densities? Use NLL if you can write the likelihood; else try Wasserstein/MMD or a GAN variant.
Matching or retrieval? Go contrastive/InfoNCE with a temperature.
RL? Use the algorithm’s surrogate (PPO/TRPO) plus value loss and entropy.
Segmentation/detection? Mix CE + Dice/Tversky or GIoU for boxes.
Worried about worst-case groups or adversaries? Group DRO or adversarial loss.
Care about calibration? Optimize NLL and post-hoc calibrate.

So yes, infinitely many. But 95% of useful practice lives in a few dozen patterns, and the rest are glam-rock remixes with new hyperparameters and a different arXiv figure style.

Core families

Why “infinite” isn’t just snark

Picking one without losing your mind

Related questions