ROOT_528-00b_version: tutorials/roostats/FourBinInstructional.C Source File

00001 //  This example is a generalization of the on/off problem.  
00002 
00003 /*
00004 FourBin Instructional Tutorial:
00005  authors: 
00006  Kyle Cranmer <cranmer@cern.ch>
00007  Tanja Rommerskirchen <tanja.rommerskirchen@cern.ch>
00008 
00009  date: June 1, 2010
00010 
00011  This example is a generalization of the on/off problem.  
00012 It's a common setup for SUSY searches.  Imagine that one has two
00013 variables "x" and "y" (eg. missing ET and SumET), see figure.  
00014 The signal region has high values of both of these variables (top right).
00015 One can see low values of "x" or "y" acting as side-bands.  If we
00016 just used "y" as a sideband, we would have the on/off problem.  
00017  - In the signal region we observe non events and expect s+b events.  
00018  - In the region with low values of "y" (bottom right) 
00019    we observe noff events and expect tau*b events.  
00020 Note the significance of tau.  In the background only case:
00021    tau ~ <expectation off> / <expectation on>
00022 If tau is known, this model is sufficient, but often tau is not known exactly.
00023 So one can use low values of "x" as an additional constraint for tau.  
00024 Note that this technique critically depends on the notion that the 
00025 joint distribution for "x" and "y" can be factorized.  
00026 Generally, these regions have many events, so it the ratio can be
00027 measured very precisely there.  So we extend the model to describe the 
00028 left two boxes... denoted with "bar".
00029   - In the upper left we observe nonbar events and expect bbar events
00030   - In the bottom left we observe noffbar events and expect tau bbar events
00031 Note again we have:
00032    tau ~ <expecation off bar> / <expectation on bar>
00033 One can further expand the model to account for the systematic associated 
00034 to assuming the distribution of "x" and "y" factorizes (eg. that 
00035 tau is the same for off/on and offbar/onbar). This can be done in several
00036 ways, but here we introduce an additional parameter rho, which so that
00037 one set of models will use tau and the other tau*rho. The choice is arbitary,
00038 but it has consequences on the numerical stability of the algorithms.  
00039 The "bar" measurements typically have more events (& smaller relative errors).
00040 If we choose <expectation noffbar> = tau * rho * <expectation noonbar>, the
00041 product tau*rho will be known very precisely (~1/sqrt(bbar)) and the contour 
00042 in those parameters will be narrow and have a non-trivial tau~1/rho shape.
00043 However, if we choose to put rho on the non/noff measurements (where the 
00044 product will have an error ~1/sqrt(b)), the contours will be more ameanable 
00045 to numerical techniques.  Thus, here we choose to define
00046    tau := <expecation off bar> / (<expectation on bar>)
00047    rho := <expecation off> / (<expectation on> * tau)
00048 
00049 ^ y
00050 |
00051 |---------------------------+
00052 |               |           |
00053 |     nonbar    |    non    |
00054 |      bbar     |    s+b    |
00055 |               |           |
00056 |---------------+-----------|
00057 |               |           |
00058 |    noffbar    |    noff   |
00059 |    tau bbar   | tau b rho |
00060 |               |           |
00061 +-----------------------------> x
00062 
00063 
00064 Left in this way, the problem is under-constrained.  However, one may
00065 have some auxiliary measurement (usually based on Monte Carlo) to
00066 constrain rho.  Let us call this auxiliary measurement that gives
00067 the nominal value of rho "rhonom".  Thus, there is a 'constraint' term in 
00068 the full model: P(rhonom | rho).  In this case, we consider a Gaussian
00069 constraint with standard deviation sigma.
00070 
00071 In the example, the initial values of the parameters are
00072   - s    = 40
00073   - b    = 100
00074   - tau  = 5
00075   - bbar = 1000
00076   - rho  = 1
00077   (sigma for rho = 20%)
00078 and in the toy dataset:
00079    - non = 139 
00080    - noff = 528
00081    - nonbar = 993
00082    - noffbar = 4906
00083    - rhonom = 1.27824
00084 
00085 Note, the covariance matrix of the parameters has large off-diagonal terms.  
00086 Clearly s,b are anti-correlated.  Similary, since noffbar >> nonbar, one would
00087 expect bbar,tau to be anti-correlated.  
00088 
00089 This can be seen below.
00090             GLOBAL      b    bbar   rho      s     tau
00091         b  0.96820   1.000  0.191 -0.942 -0.762 -0.209
00092      bbar  0.91191   0.191  1.000  0.000 -0.146 -0.912
00093       rho  0.96348  -0.942  0.000  1.000  0.718 -0.000
00094         s  0.76250  -0.762 -0.146  0.718  1.000  0.160
00095       tau  0.92084  -0.209 -0.912 -0.000  0.160  1.000
00096 
00097 Similarly, since tau*rho appears as a product, we expect rho,tau 
00098 to be anti-correlated. When the error on rho is significantly 
00099 larger than 1/sqrt(bbar), tau is essentially known and the 
00100 correlation is minimal (tau mainly cares about bbar, and rho about b,s).  
00101 In the alternate parametrizaton (bbar* tau * rho) the correlation coefficient 
00102 for rho,tau is large (and negative).
00103 
00104 
00105 The code below uses best-practices for RooFit & RooStats as of 
00106 June 2010.  It proceeds as follows:
00107  - create a workspace to hold the model
00108  - use workspace factory to quickly create the terms of the model
00109  - use workspace factory to define total model (a prod pdf)
00110  - create a RooStats ModelConfig to specify observables, parameters of interest
00111  - add to the ModelConfig a prior on the parameters for Bayesian techniques
00112    - note, the pdf it is factorized for parameters of interest & nuisance params
00113  - visualize the model
00114  - write the workspace to a file
00115  - use several of RooStats IntervalCalculators & compare results
00116 */
00117 
00118 #include "TStopwatch.h"
00119 #include "TCanvas.h"
00120 #include "TROOT.h"
00121 #include "RooPlot.h"
00122 #include "RooAbsPdf.h"
00123 #include "RooWorkspace.h"
00124 #include "RooDataSet.h"
00125 #include "RooGlobalFunc.h"
00126 #include "RooFitResult.h"
00127 #include "RooRandom.h"
00128 #include "RooStats/ProfileLikelihoodCalculator.h"
00129 #include "RooStats/LikelihoodInterval.h"
00130 #include "RooStats/LikelihoodIntervalPlot.h"
00131 #include "RooStats/BayesianCalculator.h"
00132 #include "RooStats/MCMCCalculator.h"
00133 #include "RooStats/MCMCInterval.h"
00134 #include "RooStats/MCMCIntervalPlot.h"
00135 #include "RooStats/ProposalHelper.h"
00136 #include "RooStats/SimpleInterval.h"
00137 #include "RooStats/FeldmanCousins.h"
00138 #include "RooStats/PointSetInterval.h"
00139 
00140 using namespace RooFit;
00141 using namespace RooStats;
00142 
00143 void FourBinInstructional(bool doBayesian=false, bool doFeldmanCousins=false, bool doMCMC=false){
00144   
00145   // let's time this challenging example
00146   TStopwatch t;
00147   t.Start();
00148 
00149   // set RooFit random seed for reproducible results
00150   RooRandom::randomGenerator()->SetSeed(4357);
00151 
00152   // make model
00153   RooWorkspace* wspace = new RooWorkspace("wspace");
00154   wspace->factory("Poisson::on(non[0,1000], sum::splusb(s[40,0,100],b[100,0,300]))");
00155   wspace->factory("Poisson::off(noff[0,5000], prod::taub(b,tau[5,3,7],rho[1,0,2]))");
00156   wspace->factory("Poisson::onbar(nonbar[0,10000], bbar[1000,500,2000])");
00157   wspace->factory("Poisson::offbar(noffbar[0,1000000], prod::lambdaoffbar(bbar, tau))");
00158   wspace->factory("Gaussian::mcCons(rhonom[1.,0,2], rho, sigma[.2])");
00159   wspace->factory("PROD::model(on,off,onbar,offbar,mcCons)");
00160   wspace->defineSet("obs","non,noff,nonbar,noffbar,rhonom");
00161 
00162   wspace->factory("Uniform::prior_poi({s})");
00163   wspace->factory("Uniform::prior_nuis({b,bbar,tau, rho})");
00164   wspace->factory("PROD::prior(prior_poi,prior_nuis)"); 
00165 
00166   ///////////////////////////////////////////
00167   // Control some interesting variations
00168   // define parameers of interest
00169   // for 1-d plots
00170   wspace->defineSet("poi","s");
00171   wspace->defineSet("nuis","b,tau,rho,bbar");
00172   // for 2-d plots to inspect correlations:
00173   //  wspace->defineSet("poi","s,rho");
00174 
00175   // test simpler cases where parameters are known.
00176   //  wspace->var("tau")->setConstant();
00177   //  wspace->var("rho")->setConstant();
00178   //  wspace->var("b")->setConstant();
00179   //  wspace->var("bbar")->setConstant();
00180 
00181   // inspect workspace
00182   //  wspace->Print();
00183 
00184   ////////////////////////////////////////////////////////////
00185   // Generate toy data
00186   // generate toy data assuming current value of the parameters
00187   // import into workspace. 
00188   // add Verbose() to see how it's being generated
00189   RooDataSet* data =   wspace->pdf("model")->generate(*wspace->set("obs"),1);
00190   //  data->Print("v");
00191   wspace->import(*data);
00192 
00193   /////////////////////////////////////////////////////
00194   // Now the statistical tests
00195   // model config
00196   ModelConfig* modelConfig = new ModelConfig("FourBins");
00197   modelConfig->SetWorkspace(*wspace);
00198   modelConfig->SetPdf(*wspace->pdf("model"));
00199   modelConfig->SetPriorPdf(*wspace->pdf("prior"));
00200   modelConfig->SetParametersOfInterest(*wspace->set("poi"));
00201   modelConfig->SetNuisanceParameters(*wspace->set("nuis"));
00202   wspace->import(*modelConfig);
00203   wspace->writeToFile("FourBin.root");
00204 
00205   //////////////////////////////////////////////////
00206   // If you want to see the covariance matrix uncomment
00207   //  wspace->pdf("model")->fitTo(*data);
00208 
00209   // use ProfileLikelihood
00210   ProfileLikelihoodCalculator plc(*data, *modelConfig);
00211   plc.SetConfidenceLevel(0.95);
00212   LikelihoodInterval* plInt = plc.GetInterval();
00213   RooFit::MsgLevel msglevel = RooMsgService::instance().globalKillBelow();
00214   RooMsgService::instance().setGlobalKillBelow(RooFit::FATAL);
00215   plInt->LowerLimit( *wspace->var("s") ); // get ugly print out of the way. Fix.
00216   RooMsgService::instance().setGlobalKillBelow(msglevel);
00217 
00218   // use FeldmaCousins (takes ~20 min)  
00219   FeldmanCousins fc(*data, *modelConfig);
00220   fc.SetConfidenceLevel(0.95);
00221   //number counting: dataset always has 1 entry with N events observed
00222   fc.FluctuateNumDataEntries(false); 
00223   fc.UseAdaptiveSampling(true);
00224   fc.SetNBins(40);
00225   PointSetInterval* fcInt = NULL;
00226   if(doFeldmanCousins){ // takes 7 minutes
00227     fcInt = (PointSetInterval*) fc.GetInterval(); // fix cast
00228   }
00229 
00230 
00231   // use BayesianCalculator (only 1-d parameter of interest, slow for this problem)  
00232   BayesianCalculator bc(*data, *modelConfig);
00233   bc.SetConfidenceLevel(0.95);
00234   SimpleInterval* bInt = NULL;
00235   if(doBayesian && wspace->set("poi")->getSize() == 1)   {
00236     bInt = bc.GetInterval();
00237   } else{
00238     cout << "Bayesian Calc. only supports on parameter of interest" << endl;
00239   }
00240 
00241 
00242   // use MCMCCalculator  (takes about 1 min)
00243   // Want an efficient proposal function, so derive it from covariance
00244   // matrix of fit
00245   RooFitResult* fit = wspace->pdf("model")->fitTo(*data,Save());
00246   ProposalHelper ph;
00247   ph.SetVariables((RooArgSet&)fit->floatParsFinal());
00248   ph.SetCovMatrix(fit->covarianceMatrix());
00249   ph.SetUpdateProposalParameters(kTRUE); // auto-create mean vars and add mappings
00250   ph.SetCacheSize(100);
00251   ProposalFunction* pf = ph.GetProposalFunction();
00252 
00253   MCMCCalculator mc(*data, *modelConfig);
00254   mc.SetConfidenceLevel(0.95);
00255   mc.SetProposalFunction(*pf);
00256   mc.SetNumBurnInSteps(500); // first N steps to be ignored as burn-in
00257   mc.SetNumIters(50000);
00258   mc.SetLeftSideTailFraction(0.5); // make a central interval
00259   MCMCInterval* mcInt = NULL;
00260   if(doMCMC)
00261     mcInt = mc.GetInterval();
00262 
00263   //////////////////////////////////////
00264   // Make some  plots
00265   TCanvas* c1 = (TCanvas*) gROOT->Get("c1");  
00266   if(!c1)
00267     c1 = new TCanvas("c1");
00268 
00269   if(doBayesian && doMCMC){
00270     c1->Divide(3);
00271     c1->cd(1);
00272   }
00273   else if (doBayesian || doMCMC){
00274     c1->Divide(2);
00275     c1->cd(1);
00276   }
00277 
00278   LikelihoodIntervalPlot* lrplot = new LikelihoodIntervalPlot(plInt);
00279   lrplot->Draw();
00280 
00281   if(doBayesian && wspace->set("poi")->getSize() == 1)   {
00282     c1->cd(2);
00283     // the plot takes a long time and print lots of error
00284     // using a scan it is better
00285     bc.SetScanOfPosterior(20);
00286     RooPlot* bplot = bc.GetPosteriorPlot();
00287     bplot->Draw();
00288   } 
00289 
00290   if(doMCMC){
00291     if(doBayesian && wspace->set("poi")->getSize() == 1) 
00292       c1->cd(3);
00293     else 
00294       c1->cd(2);
00295     MCMCIntervalPlot mcPlot(*mcInt); 
00296     mcPlot.Draw();
00297   }
00298 
00299   ////////////////////////////////////
00300   // querry intervals
00301   cout << "Profile Likelihood interval on s = [" << 
00302     plInt->LowerLimit( *wspace->var("s") ) << ", " <<
00303     plInt->UpperLimit( *wspace->var("s") ) << "]" << endl; 
00304   //Profile Likelihood interval on s = [12.1902, 88.6871]
00305 
00306    
00307   if(doBayesian && wspace->set("poi")->getSize() == 1)   {
00308     cout << "Bayesian interval on s = [" << 
00309       bInt->LowerLimit( ) << ", " <<
00310       bInt->UpperLimit( ) << "]" << endl;
00311   }  
00312   
00313   if(doFeldmanCousins){    
00314     cout << "Feldman Cousins interval on s = [" << 
00315       fcInt->LowerLimit( *wspace->var("s") ) << ", " <<
00316       fcInt->UpperLimit( *wspace->var("s") ) << "]" << endl;
00317     //Feldman Cousins interval on s = [18.75 +/- 2.45, 83.75 +/- 2.45]
00318   }
00319 
00320   if(doMCMC){
00321     cout << "MCMC interval on s = [" << 
00322       mcInt->LowerLimit(*wspace->var("s") ) << ", " <<
00323       mcInt->UpperLimit(*wspace->var("s") ) << "]" << endl;
00324     //MCMC interval on s = [15.7628, 84.7266]
00325 
00326   }
00327 
00328   t.Print();
00329    
00330 
00331 }