Unbinned Maximum Likelihood (UML) and Extended Maximum Likelihood (EML)¶

We consider a dataset of $N_{\mathrm{obs}}$ independent events $\{ m_i \}_{i=1}^{N_{\mathrm{obs}}}$, each corresponding to a reconstructed invariant mass $m$. The total probability density function (PDF) is modeled as a mixture of a signal and a background component:

\begin{equation} f(m \mid \boldsymbol{\theta}) = r \, f_s(m \mid \mu, \sigma) + (1 - r) \, f_b(m \mid \tau) \end{equation}

where:

  • $f_s(m \mid \mu, \sigma)$ is the normalized signal PDF (typically Gaussian),
  • $f_b(m \mid \tau)$ is the normalized background PDF (e.g., exponential),
  • $r \in [0,1]$ is the signal fraction parameter,
  • $\boldsymbol{\theta} = \{ r, \mu, \sigma, \tau \}$ denotes the full set of parameters.

Unbinned Maximum Likelihood (UML)¶

In the non-extended case, the total number of events $N_{\mathrm{obs}}$ is fixed. The likelihood function is:

\begin{equation} \mathcal{L}_{\mathrm{UML}}(\boldsymbol{\theta}) = \prod_{i=1}^{N_{\mathrm{obs}}} f(m_i \mid \boldsymbol{\theta}) \end{equation}

and the log-likelihood is:

\begin{equation} \ln \mathcal{L}_{\mathrm{UML}} = \sum_{i=1}^{N_{\mathrm{obs}}} \ln f(m_i \mid \boldsymbol{\theta}) \end{equation}

From the fitted signal fraction $\hat{r}$, the estimated signal yield is:

\begin{equation} \hat{N}_s = \hat{r} \, N_{\mathrm{obs}} \end{equation}


Extended Maximum Likelihood (EML)¶

In the extended formulation, the total event count is treated as a Poisson-distributed variable. We introduce explicit signal and background yields, $N_s$ and $N_b$:

\begin{equation} F(m \mid \boldsymbol{\theta}) = N_s f_s(m \mid \mu, \sigma) + N_b f_b(m \mid \tau) \end{equation}

The expected total number of events is: \begin{equation} N_{\mathrm{exp}} = N_s + N_b \end{equation}

The extended likelihood is then written as:

\begin{equation} \mathcal{L}_{\mathrm{EML}}(\boldsymbol{\theta}) = e^{-N_{\mathrm{exp}}} \prod_{i=1}^{N_{\mathrm{obs}}} \bigl[ N_s f_s(m_i \mid \mu, \sigma) + N_b f_b(m_i \mid \tau) \bigr] \end{equation}

and the corresponding log-likelihood is:

\begin{equation} \ln \mathcal{L}_{\mathrm{EML}} = -N_{\mathrm{exp}} + \sum_{i=1}^{N_{\mathrm{obs}}} \ln \bigl[ N_s f_s(m_i \mid \mu, \sigma) + N_b f_b(m_i \mid \tau) \bigr] \end{equation}

The best-fit signal yield $\hat{N}_s$ and background yield $\hat{N}_b$ are obtained by maximizing $\mathcal{L}_{\mathrm{EML}}$ with respect to all free parameters.


Residuals and Pulls¶

For binned visualization, the expected number of events in bin $i$ is:

\begin{equation} N_i^{\mathrm{exp}} = N_{\mathrm{exp}} \int_{m_i}^{m_{i+1}} f(m \mid \hat{\boldsymbol{\theta}}) \, dm \end{equation}

The residual and pull in each bin are defined as:

\begin{align} \mathrm{Residual}_i &= N_i^{\mathrm{obs}} - N_i^{\mathrm{exp}} \\ \mathrm{Pull}_i &= \frac{N_i^{\mathrm{obs}} - N_i^{\mathrm{exp}}} {\sqrt{N_i^{\mathrm{exp}}}} \end{align}

If the model is correct, the pulls should be distributed as a standard normal with mean zero and unit variance.

🧠 When to Use UML vs EML¶

UML (Unbinned Maximum Likelihood) focuses only on the shape of the distribution.
EML (Extended Maximum Likelihood) fits both the shape and the total number of events.
EML is the standard method in physics analyses when signal yield or significance matters.


🔹 UML – Shape-Only Fit¶

UML assumes the total number of observed events is fixed.
It optimizes the likelihood based only on how the events are distributed,
not how many events were produced in total.

Use UML when:

  • You only care about shape parameters (e.g. peak position, width, slope).
  • The total event count has no physical meaning (e.g. normalized spectra).
  • The dataset is large and Poisson fluctuations are negligible.

✅ Advantages: Simple, fast, and stable.
⚠️ Limitation: Cannot estimate event yields or statistical significance.


🔹 EML – Shape + Yield Fit¶

EML treats the total number of observed events as a random variable
and includes it in the likelihood through a Poisson term.
It simultaneously estimates both the distribution shape and event yields.

Use EML when:

  • You need signal yield, cross section, or significance.
  • Event counts fluctuate noticeably (small or medium samples).
  • You want a physically complete statistical model.

✅ Advantages: Physically correct, gives yields and errors directly.
⚠️ Limitation: Slightly more complex and computationally heavier.


🔸 In Summary¶

You can always use EML — it generalizes UML and is valid in all cases.
However, when total normalization is irrelevant and only the shape matters,
UML is simpler and perfectly adequate.

In [7]:
// === EML: Extended ML fit (Jupyter can run directly, no main/function wrapper) ===
#include "TCanvas.h"
#include "TSystem.h"
#include "RooRealVar.h"
#include "RooGaussian.h"
#include "RooExponential.h"
#include "RooAddPdf.h"
#include "RooDataSet.h"
#include "RooPlot.h"
#include "RooFitResult.h"
#include "RooRandom.h"
#include <iostream>
#include <cmath>

using namespace RooFit;

// ----- Unified parameters & variables (consistent with UML naming) -----
double xmin=2.6, xmax=3.8;
int    nbins=80, seed=12345;
int    Nsig_true=100, Nbkg_true=500;

RooRealVar x("x","m (GeV)", xmin, xmax, "GeV");
x.setBins(nbins);

RooRandom::randomGenerator()->SetSeed(seed);

// Shape parameters (same as UML)
RooRealVar mu("mu","mean",   3.10, 3.0, 3.2);
RooRealVar sigma("sigma","sigma", 0.03, 0.005, 0.08);
RooGaussian   sig("sig","signal pdf", x, mu, sigma);

RooRealVar tau("tau","slope", -1.5, -10., -0.1);
RooExponential bkg("bkg","background pdf", x, tau);

// ===================== Data generation (consistent with UML: extended generation) =====================
RooRealVar Nsig_gen("Nsig_gen","signal yield(gen)", Nsig_true);
RooRealVar Nbkg_gen("Nbkg_gen","bkg yield(gen)",    Nbkg_true);
RooAddPdf  model_gen("model_gen","sig+bkg(gen)", RooArgList(sig,bkg), RooArgList(Nsig_gen,Nbkg_gen));

RooDataSet* dataset = model_gen.generate(x, Extended(true));
double Nobs = dataset->numEntries();

// ===================== EML fit (extended: directly fit yields Nsig, Nbkg) =====================
RooRealVar Nsig("Nsig","signal yield", Nsig_true*0.8, 0.0, 1.0e5);
RooRealVar Nbkg("Nbkg","bkg yield",    Nbkg_true*1.2, 0.0, 1.0e6);
RooAddPdf  model("model","sig+bkg (EML)", RooArgList(sig,bkg), RooArgList(Nsig,Nbkg));

// Extended fit: include Extended(true)
// To switch to non-extended fit, remove Extended(true); yields will be treated as relative fractions (normalized to sum=1)
RooFitResult* fitres = model.fitTo(*dataset, Save(true), Extended(true));

std::cout << "\n===== EML (extended) Fit =====\n";
std::cout << "Nobs   = " << Nobs << " +/- " << sqrt(Nobs) << "\n";
std::cout << "Nsig   = " << Nsig.getVal() << " ± " << Nsig.getError() << "\n";
std::cout << "Nbkg   = " << Nbkg.getVal() << " ± " << Nbkg.getError() << "\n";
std::cout << "mu     = " << mu.getVal()    << " ± " << mu.getError()    << "\n";
std::cout << "sigma  = " << sigma.getVal() << " ± " << sigma.getError() << "\n";
std::cout << "tau    = " << tau.getVal()   << " ± " << tau.getError()   << "\n";
std::cout << "==============================\n";

// ===================== Plotting (top: fit + bottom: pull) =====================
if (auto *old = dynamic_cast<TCanvas*>(gROOT->FindObject("c_eml"))) delete old;
TCanvas* c = new TCanvas("c_eml","EML fit + Pull", 900, 800);
c->Divide(1,2);

// Top pad
c->cd(1);
gPad->SetPad(0,0.30,1,1.00);
gPad->SetBottomMargin(0.02);

RooPlot* frame = x.frame(Title("EML: Gaussian + exponential background"));
dataset->plotOn(frame, Binning(nbins), DataError(RooAbsData::Poisson), Name("data"));

// Plot total model curve with expected counts scale (consistent with data dimension)
model.plotOn(frame, Name("model"), LineWidth(2),
             Normalization(dataset->sumEntries(), RooAbsReal::NumEvent));
model.plotOn(frame, Components(sig), LineStyle(kDashed), LineWidth(2));
model.plotOn(frame, Components(bkg), LineStyle(kDotted), LineWidth(2));
frame->Draw();

// Bottom pad: pull (data vs model)
c->cd(2);
gPad->SetPad(0,0.00,1,0.30);
gPad->SetTopMargin(0.02);
gPad->SetBottomMargin(0.30);

RooHist* hpull = frame->pullHist("data","model");
RooPlot* fpull = x.frame(Title("Pull = (data - fit)/#sigma"));
fpull->addPlotable(hpull,"P");
fpull->SetYTitle("Pull");
fpull->SetMinimum(-5); fpull->SetMaximum(5);
fpull->GetYaxis()->SetTitleSize(0.10);
fpull->GetYaxis()->SetLabelSize(0.09);
fpull->GetXaxis()->SetTitleSize(0.10);
fpull->GetXaxis()->SetLabelSize(0.09);
fpull->Draw();

c->Draw();
gSystem->ProcessEvents();
[#1] INFO:Fitting -- RooAbsPdf::fitTo(model) fixing normalization set for coefficient determination to observables in data
[#1] INFO:Fitting -- RooAddition::defaultErrorLevel(nll_model_model_genData) Summation contains a RooNLLVar, using its error level
[#1] INFO:Minimization -- RooAbsMinimizerFcn::setOptimizeConst: activating const optimization
Minuit2Minimizer: Minimize with max-calls 2500 convergence for edm < 1 strategy 1
Minuit2Minimizer : Valid minimum - status = 0
FVAL  = -3030.61991177998516
Edm   = 7.6773874339533898e-06
Nfcn  = 101
Nbkg	  = 471.826	 +/-  23.7537	(limited)
Nsig	  = 95.1359	 +/-  13.7025	(limited)
mu	  = 3.1079	 +/-  0.00548104	(limited)
sigma	  = 0.0345635	 +/-  0.00517216	(limited)
tau	  = -1.45042	 +/-  0.143884	(limited)
[#1] INFO:Minimization -- RooAbsMinimizerFcn::setOptimizeConst: deactivating const optimization

===== EML (extended) Fit =====
Nobs   = 567 +/- 23.8118
Nsig   = 95.1359 ± 13.7151
Nbkg   = 471.826 ± 23.7643
mu     = 3.1079 ± 0.00548027
sigma  = 0.0345635 ± 0.00517614
tau    = -1.45042 ± 0.143893
==============================
[#1] INFO:Plotting -- RooAbsPdf::plotOn(model) directly selected PDF components: (sig)
[#1] INFO:Plotting -- RooAbsPdf::plotOn(model) indirectly selected PDF components: ()
[#1] INFO:Plotting -- RooAbsPdf::plotOn(model) directly selected PDF components: (bkg)
[#1] INFO:Plotting -- RooAbsPdf::plotOn(model) indirectly selected PDF components: ()
Info in <Minuit2>: MnSeedGenerator Computing seed using NumericalGradient calculator
Info in <Minuit2>: MnSeedGenerator Initial state: FCN =      -3017.031168 Edm =       13.13816811 NCalls =     21
Info in <Minuit2>: MnSeedGenerator Initial state  
  Minimum value : -3017.031168
  Edm           : 13.13816811
  Internal parameters:	[     -1.521801632     -1.514220239                0    -0.3398369095        0.7997354]	
  Internal gradient  :	[      4526.471847      73.90560598     -23.69216193     0.1505502726     -14.33701614]	
  Internal covariance matrix:
[[  2.3515968e-06              0              0              0              0]
 [              0  3.5553147e-05              0              0              0]
 [              0              0   0.0062395652              0              0]
 [              0              0              0    0.034028568              0]
 [              0              0              0              0   0.0032769746]]]
Info in <Minuit2>: VariableMetricBuilder Start iterating until Edm is < 0.001 with call limit = 2500
Info in <Minuit2>: VariableMetricBuilder    0 - FCN =      -3017.031168 Edm =       13.13816811 NCalls =     21
Info in <Minuit2>: VariableMetricBuilder    1 - FCN =      -3029.514747 Edm =      0.7695529999 NCalls =     33
Info in <Minuit2>: VariableMetricBuilder    2 - FCN =      -3030.538346 Edm =      0.1032587436 NCalls =     45
Info in <Minuit2>: VariableMetricBuilder    3 - FCN =      -3030.618122 Edm =    0.001719106911 NCalls =     57
Info in <Minuit2>: VariableMetricBuilder    4 - FCN =      -3030.619912 Edm =   7.029473007e-06 NCalls =     68
Info in <Minuit2>: VariableMetricBuilder After Hessian
Info in <Minuit2>: VariableMetricBuilder    5 - FCN =      -3030.619912 Edm =   7.677387434e-06 NCalls =    101
Info in <Minuit2>: Minuit2Minimizer::Hesse Using max-calls 2500
Info in <Minuit2>: Minuit2Minimizer::Hesse Hesse is valid - matrix is accurate
In [ ]: