~ajf/root/OneSidedFrequentistUpperLimitWithBands_8C_source.html

 /// \file

 /// \ingroup tutorial_roostats

 /// \notebook

 /// OneSidedFrequentistUpperLimitWithBands

 ///

 /// This is a standard demo that can be used with any ROOT file

 /// prepared in the standard way.  You specify:

 ///  - name for input ROOT file

 ///  - name of workspace inside ROOT file that holds model and data

 ///  - name of ModelConfig that specifies details for calculator tools

 ///  - name of dataset

 ///

 /// With default parameters the macro will attempt to run the

 /// standard hist2workspace example and read the ROOT file

 /// that it produces.

 ///

 /// The first ~100 lines define a new test statistic, then the main macro starts.

 /// You may want to control:

 /// ~~~{.cpp}

 ///   double confidenceLevel=0.95;

 ///   int nPointsToScan = 12;

 ///   int nToyMC = 150;

 /// ~~~

 /// This uses a modified version of the profile likelihood ratio as

 /// a test statistic for upper limits (eg. test stat = 0 if muhat>mu).

 ///

 /// Based on the observed data, one defines a set of parameter points

 /// to be tested based on the value of the parameter of interest

 /// and the conditional MLE (eg. profiled) values of the nuisance parameters.

 ///

 /// At each parameter point, pseudo-experiments are generated using this

 /// fixed reference model and then the test statistic is evaluated.

 /// Note, the nuisance parameters are floating in the fits.  For each point,

 /// the threshold that defines the 95% acceptance region is found.  This

 /// forms a "Confidence Belt".

 ///

 /// After constructing the confidence belt, one can find the confidence

 /// interval for any particular dataset by finding the intersection

 /// of the observed test statistic and the confidence belt.  First

 /// this is done on the observed data to get an observed 1-sided upper limt.

 ///

 /// Finally, there expected limit and bands (from background-only) are

 /// formed by generating background-only data and finding the upper limit.

 /// This is done by hand for now, will later be part of the RooStats tools.

 ///

 /// On a technical note, this technique is NOT the Feldman-Cousins technique,

 /// because that is a 2-sided interval BY DEFINITION.  However, like the

 /// Feldman-Cousins technique this is a Neyman-Construction.  For technical

 /// reasons the easiest way to implement this right now is to use the

 /// FeldmanCousins tool and then change the test statistic that it is using.

 ///

 /// Building the confidence belt can be computationally expensive.  Once it is built,

 /// one could save it to a file and use it in a separate step.

 ///

 /// We can use PROOF to speed things along in parallel, however,

 /// the test statistic has to be installed on the workers

 /// so either turn off PROOF or include the modified test statistic

 /// in your `$ROOTSYS/roofit/roostats/inc` directory,

 /// add the additional line to the LinkDef.h file,

 /// and recompile root.

 ///

 /// Note, if you have a boundary on the parameter of interest (eg. cross-section)

 /// the threshold on the one-sided test statistic starts off very small because we

 /// are only including downward fluctuations.  You can see the threshold in these printouts:

 /// ~~~{.cpp}

 /// [#0] PROGRESS:Generation -- generated toys: 500 / 999

 /// NeymanConstruction: Prog: 12/50 total MC = 39 this test stat = 0

 ///  SigXsecOverSM=0.69 alpha_syst1=0.136515 alpha_syst3=0.425415 beta_syst2=1.08496 [-1e+30, 0.011215]  in interval = 1

 /// ~~~

 /// this tells you the values of the parameters being used to generate the pseudo-experiments

 /// and the threshold in this case is 0.011215.  One would expect for 95% that the threshold

 /// would be ~1.35 once the cross-section is far enough away from 0 that it is essentially

 /// unaffected by the boundary.  As one reaches the last points in the scan, the

 /// theshold starts to get artificially high.  This is because the range of the parameter in

 /// the fit is the same as the range in the scan.  In the future, these should be independently

 /// controlled, but they are not now.  As a result the ~50% of pseudo-experiments that have an

 /// upward fluctuation end up with muhat = muMax.  Because of this, the upper range of the

 /// parameter should be well above the expected upper limit... but not too high or one will

 /// need a very large value of nPointsToScan to resolve the relevant region.  This can be

 /// improved, but this is the first version of this script.

 ///

 /// Important note: when the model includes external constraint terms, like a Gaussian

 /// constraint to a nuisance parameter centered around some nominal value there is

 /// a subtlety.  The asymptotic results are all based on the assumption that all the

 /// measurements fluctuate... including the nominal values from auxiliary measurements.

 /// If these do not fluctuate, this corresponds to an "conditional ensemble".  The

 /// result is that the distribution of the test statistic can become very non-chi^2.

 /// This results in thresholds that become very large. This can be seen in the following

 /// thought experiment.  Say the model is

 /// \f$ Pois(N | s + b)G(b0|b,sigma) \f$

 /// where \f$ G(b0|b,sigma) \f$ is the external constraint and b0 is 100.  If N is also 100

 /// then the profiled value of b given s is going to be some trade off between 100-s and b0.

 /// If sigma is \f$ \sqrt(N) \f$, then the profiled value of b is probably 100 - s/2   So for

 /// s=60 we are going to have a profiled value of b~70.  Now when we generate pseudo-experiments

 /// for s=60, b=70 we will have N~130 and the average shat will be 30, not 60.  In practice,

 /// this is only an issue for values of s that are very excluded.  For values of s near the 95%

 /// limit this should not be a big effect.  This can be avoided if the nominal values of the constraints also fluctuate,

 /// but that requires that those parameters are RooRealVars in the model.

 /// This version does not deal with this issue, but it will be addressed in a future version.

 ///

 /// \macro_image

 /// \macro_output

 /// \macro_code

 ///

 /// \authors Kyle Cranmer Haichen Wang Daniel Whiteson


 #include "TFile.h"

 #include "TROOT.h"

 #include "TH1F.h"

 #include "TCanvas.h"

 #include "TSystem.h"


 #include "RooWorkspace.h"

 #include "RooSimultaneous.h"

 #include "RooAbsData.h"


 #include "RooStats/ModelConfig.h"

 #include "RooStats/FeldmanCousins.h"

 #include "RooStats/ToyMCSampler.h"

 #include "RooStats/PointSetInterval.h"

 #include "RooStats/ConfidenceBelt.h"


 #include "RooStats/RooStatsUtils.h"

 #include "RooStats/ProfileLikelihoodTestStat.h"


 using namespace RooFit;

 using namespace RooStats;


 bool useProof = false; // flag to control whether to use Proof

 int nworkers = 0;      // number of workers (default use all available cores)


 // -------------------------------------------------------

 // The actual macro


 void OneSidedFrequentistUpperLimitWithBands(const char *infile = "", const char *workspaceName = "combined",

                                             const char *modelConfigName = "ModelConfig",

                                             const char *dataName = "obsData")

 {


    double confidenceLevel = 0.95;

    int nPointsToScan = 12;

    int nToyMC = 150;


    // -------------------------------------------------------

    // First part is just to access a user-defined file

    // or create the standard example file if it doesn't exist

    const char *filename = "";

    if (!strcmp(infile, "")) {

       filename = "results/example_combined_GaussExample_model.root";

       bool fileExist = !gSystem->AccessPathName(filename); // note opposite return code

       // if file does not exists generate with histfactory

       if (!fileExist) {

 #ifdef _WIN32

          cout << "HistFactory file cannot be generated on Windows - exit" << endl;

          return;

 #endif

          // Normally this would be run on the command line

          cout << "will run standard hist2workspace example" << endl;

          gROOT->ProcessLine(".! prepareHistFactory .");

          gROOT->ProcessLine(".! hist2workspace config/example.xml");

          cout << "\n\n---------------------" << endl;

          cout << "Done creating example input" << endl;

          cout << "---------------------\n\n" << endl;

       }


    } else

       filename = infile;


    // Try to open the file

    TFile *file = TFile::Open(filename);


    // if input file was specified byt not found, quit

    if (!file) {

       cout << "StandardRooStatsDemoMacro: Input file " << filename << " is not found" << endl;

       return;

    }


    // -------------------------------------------------------

    // Now get the data and workspace


    // get the workspace out of the file

    RooWorkspace *w = (RooWorkspace *)file->Get(workspaceName);

    if (!w) {

       cout << "workspace not found" << endl;

       return;

    }


    // get the modelConfig out of the file

    ModelConfig *mc = (ModelConfig *)w->obj(modelConfigName);


    // get the modelConfig out of the file

    RooAbsData *data = w->data(dataName);


    // make sure ingredients are found

    if (!data || !mc) {

       w->Print();

       cout << "data or ModelConfig was not found" << endl;

       return;

    }


    // -------------------------------------------------------

    // Now get the POI for convenience

    // you may want to adjust the range of your POI


    RooRealVar *firstPOI = (RooRealVar *)mc->GetParametersOfInterest()->first();

    /*  firstPOI->setMin(0);*/

    /*  firstPOI->setMax(10);*/


    // --------------------------------------------

    // Create and use the FeldmanCousins tool

    // to find and plot the 95% confidence interval

    // on the parameter of interest as specified

    // in the model config

    // REMEMBER, we will change the test statistic

    // so this is NOT a Feldman-Cousins interval

    FeldmanCousins fc(*data, *mc);

    fc.SetConfidenceLevel(confidenceLevel);

    fc.AdditionalNToysFactor(

       0.5); // degrade/improve sampling that defines confidence belt: in this case makes the example faster

    /*  fc.UseAdaptiveSampling(true); // speed it up a bit, don't use for expected limits*/

    fc.SetNBins(nPointsToScan); // set how many points per parameter of interest to scan

    fc.CreateConfBelt(true);    // save the information in the belt for plotting


    // -------------------------------------------------------

    // Feldman-Cousins is a unified limit by definition

    // but the tool takes care of a few things for us like which values

    // of the nuisance parameters should be used to generate toys.

    // so let's just change the test statistic and realize this is

    // no longer "Feldman-Cousins" but is a fully frequentist Neyman-Construction.

    /*  ProfileLikelihoodTestStatModified onesided(*mc->GetPdf());*/

    /*  fc.GetTestStatSampler()->SetTestStatistic(&onesided);*/

    /* ((ToyMCSampler*) fc.GetTestStatSampler())->SetGenerateBinned(true); */

    ToyMCSampler *toymcsampler = (ToyMCSampler *)fc.GetTestStatSampler();

    ProfileLikelihoodTestStat *testStat = dynamic_cast<ProfileLikelihoodTestStat *>(toymcsampler->GetTestStatistic());

    testStat->SetOneSided(true);


    // Since this tool needs to throw toy MC the PDF needs to be

    // extended or the tool needs to know how many entries in a dataset

    // per pseudo experiment.

    // In the 'number counting form' where the entries in the dataset

    // are counts, and not values of discriminating variables, the

    // datasets typically only have one entry and the PDF is not

    // extended.

    if (!mc->GetPdf()->canBeExtended()) {

       if (data->numEntries() == 1)

          fc.FluctuateNumDataEntries(false);

       else

          cout << "Not sure what to do about this model" << endl;

    }


    // We can use PROOF to speed things along in parallel

    // However, the test statistic has to be installed on the workers

    // so either turn off PROOF or include the modified test statistic

    // in your `$ROOTSYS/roofit/roostats/inc` directory,

    // add the additional line to the LinkDef.h file,

    // and recompile root.

    if (useProof) {

       ProofConfig pc(*w, nworkers, "", false);

       toymcsampler->SetProofConfig(&pc); // enable proof

    }


    if (mc->GetGlobalObservables()) {

       cout << "will use global observables for unconditional ensemble" << endl;

       mc->GetGlobalObservables()->Print();

       toymcsampler->SetGlobalObservables(*mc->GetGlobalObservables());

    }


    // Now get the interval

    PointSetInterval *interval = fc.GetInterval();

    ConfidenceBelt *belt = fc.GetConfidenceBelt();


    // print out the interval on the first Parameter of Interest

    cout << "\n95% interval on " << firstPOI->GetName() << " is : [" << interval->LowerLimit(*firstPOI) << ", "

         << interval->UpperLimit(*firstPOI) << "] " << endl;


    // get observed UL and value of test statistic evaluated there

    RooArgSet tmpPOI(*firstPOI);

    double observedUL = interval->UpperLimit(*firstPOI);

    firstPOI->setVal(observedUL);

    double obsTSatObsUL = fc.GetTestStatSampler()->EvaluateTestStatistic(*data, tmpPOI);


    // Ask the calculator which points were scanned

    RooDataSet *parameterScan = (RooDataSet *)fc.GetPointsToScan();

    RooArgSet *tmpPoint;


    // make a histogram of parameter vs. threshold

    TH1F *histOfThresholds =

       new TH1F("histOfThresholds", "", parameterScan->numEntries(), firstPOI->getMin(), firstPOI->getMax());

    histOfThresholds->GetXaxis()->SetTitle(firstPOI->GetName());

    histOfThresholds->GetYaxis()->SetTitle("Threshold");


    // loop through the points that were tested and ask confidence belt

    // what the upper/lower thresholds were.

    // For FeldmanCousins, the lower cut off is always 0

    for (Int_t i = 0; i < parameterScan->numEntries(); ++i) {

       tmpPoint = (RooArgSet *)parameterScan->get(i)->clone("temp");

       // cout <<"get threshold"<<endl;

       double arMax = belt->GetAcceptanceRegionMax(*tmpPoint);

       double poiVal = tmpPoint->getRealValue(firstPOI->GetName());

       histOfThresholds->Fill(poiVal, arMax);

    }

    TCanvas *c1 = new TCanvas();

    c1->Divide(2);

    c1->cd(1);

    histOfThresholds->SetMinimum(0);

    histOfThresholds->Draw();

    c1->cd(2);


    // -------------------------------------------------------

    // Now we generate the expected bands and power-constraint


    // First: find parameter point for mu=0, with conditional MLEs for nuisance parameters

    RooAbsReal *nll = mc->GetPdf()->createNLL(*data);

    RooAbsReal *profile = nll->createProfile(*mc->GetParametersOfInterest());

    firstPOI->setVal(0.);

    profile->getVal(); // this will do fit and set nuisance parameters to profiled values

    RooArgSet *poiAndNuisance = new RooArgSet();

    if (mc->GetNuisanceParameters())

       poiAndNuisance->add(*mc->GetNuisanceParameters());

    poiAndNuisance->add(*mc->GetParametersOfInterest());

    w->saveSnapshot("paramsToGenerateData", *poiAndNuisance);

    RooArgSet *paramsToGenerateData = (RooArgSet *)poiAndNuisance->snapshot();

    cout << "\nWill use these parameter points to generate pseudo data for bkg only" << endl;

    paramsToGenerateData->Print("v");


    RooArgSet unconditionalObs;

    unconditionalObs.add(*mc->GetObservables());

    unconditionalObs.add(*mc->GetGlobalObservables()); // comment this out for the original conditional ensemble


    double CLb = 0;

    double CLbinclusive = 0;


    // Now we generate background only and find distribution of upper limits

    TH1F *histOfUL = new TH1F("histOfUL", "", 100, 0, firstPOI->getMax());

    histOfUL->GetXaxis()->SetTitle("Upper Limit (background only)");

    histOfUL->GetYaxis()->SetTitle("Entries");

    for (int imc = 0; imc < nToyMC; ++imc) {


       // set parameters back to values for generating pseudo data

       //    cout << "\n get current nuis, set vals, print again" << endl;

       w->loadSnapshot("paramsToGenerateData");

       //    poiAndNuisance->Print("v");


       RooDataSet *toyData = 0;

       // now generate a toy dataset

       if (!mc->GetPdf()->canBeExtended()) {

          if (data->numEntries() == 1)

             toyData = mc->GetPdf()->generate(*mc->GetObservables(), 1);

          else

             cout << "Not sure what to do about this model" << endl;

       } else {

          //      cout << "generating extended dataset"<<endl;

          toyData = mc->GetPdf()->generate(*mc->GetObservables(), Extended());

       }


       // generate global observables

       // need to be careful for simpdf

       //    RooDataSet* globalData = mc->GetPdf()->generate(*mc->GetGlobalObservables(),1);


       RooSimultaneous *simPdf = dynamic_cast<RooSimultaneous *>(mc->GetPdf());

       if (!simPdf) {

          RooDataSet *one = mc->GetPdf()->generate(*mc->GetGlobalObservables(), 1);

          const RooArgSet *values = one->get();

          RooArgSet *allVars = mc->GetPdf()->getVariables();

          *allVars = *values;

          delete allVars;

          delete values;

          delete one;

       } else {


          // try fix for sim pdf

          TIterator *iter = simPdf->indexCat().typeIterator();

          RooCatType *tt = NULL;

          while ((tt = (RooCatType *)iter->Next())) {


             // Get pdf associated with state from simpdf

             RooAbsPdf *pdftmp = simPdf->getPdf(tt->GetName());


             // Generate only global variables defined by the pdf associated with this state

             RooArgSet *globtmp = pdftmp->getObservables(*mc->GetGlobalObservables());

             RooDataSet *tmp = pdftmp->generate(*globtmp, 1);


             // Transfer values to output placeholder

             *globtmp = *tmp->get(0);


             // Cleanup

             delete globtmp;

             delete tmp;

          }

       }


       //    globalData->Print("v");

       //    unconditionalObs = *globalData->get();

       //    mc->GetGlobalObservables()->Print("v");

       //    delete globalData;

       //    cout << "toy data = " << endl;

       //    toyData->get()->Print("v");


       // get test stat at observed UL in observed data

       firstPOI->setVal(observedUL);

       double toyTSatObsUL = fc.GetTestStatSampler()->EvaluateTestStatistic(*toyData, tmpPOI);

       //    toyData->get()->Print("v");

       //    cout <<"obsTSatObsUL " <<obsTSatObsUL << "toyTS " << toyTSatObsUL << endl;

       if (obsTSatObsUL < toyTSatObsUL) // not sure about <= part yet

          CLb += (1.) / nToyMC;

       if (obsTSatObsUL <= toyTSatObsUL) // not sure about <= part yet

          CLbinclusive += (1.) / nToyMC;


       // loop over points in belt to find upper limit for this toy data

       double thisUL = 0;

       for (Int_t i = 0; i < parameterScan->numEntries(); ++i) {

          tmpPoint = (RooArgSet *)parameterScan->get(i)->clone("temp");

          double arMax = belt->GetAcceptanceRegionMax(*tmpPoint);

          firstPOI->setVal(tmpPoint->getRealValue(firstPOI->GetName()));

          //   double thisTS = profile->getVal();

          double thisTS = fc.GetTestStatSampler()->EvaluateTestStatistic(*toyData, tmpPOI);


          //   cout << "poi = " << firstPOI->getVal()

          // << " max is " << arMax << " this profile = " << thisTS << endl;

          //      cout << "thisTS = " << thisTS<<endl;

          if (thisTS <= arMax) {

             thisUL = firstPOI->getVal();

          } else {

             break;

          }

       }


       /*

       // loop over points in belt to find upper limit for this toy data

       double thisUL = 0;

       for(Int_t i=0; i<histOfThresholds->GetNbinsX(); ++i){

          tmpPoint = (RooArgSet*) parameterScan->get(i)->clone("temp");

          cout <<"----------------  "<<i<<endl;

          tmpPoint->Print("v");

          cout << "from hist " << histOfThresholds->GetBinCenter(i+1) <<endl;

          double arMax = histOfThresholds->GetBinContent(i+1);

          // cout << " threhold from Hist = aMax " << arMax<<endl;

          // double arMax2 = belt->GetAcceptanceRegionMax(*tmpPoint);

          // cout << "from scan arMax2 = "<< arMax2 << endl; // not the same due to TH1F not TH1D

          // cout << "scan - hist" << arMax2-arMax << endl;

          firstPOI->setVal( histOfThresholds->GetBinCenter(i+1));

          //   double thisTS = profile->getVal();

          double thisTS = fc.GetTestStatSampler()->EvaluateTestStatistic(*toyData,tmpPOI);


          //   cout << "poi = " << firstPOI->getVal()

          // << " max is " << arMax << " this profile = " << thisTS << endl;

          //      cout << "thisTS = " << thisTS<<endl;


          // NOTE: need to add a small epsilon term for single precision vs. double precision

          if(thisTS<=arMax + 1e-7){

             thisUL = firstPOI->getVal();

          } else{

             break;

          }

       }

       */


       histOfUL->Fill(thisUL);


       // for few events, data is often the same, and UL is often the same

       //    cout << "thisUL = " << thisUL<<endl;


       delete toyData;

    }

    histOfUL->Draw();

    c1->SaveAs("one-sided_upper_limit_output.pdf");


    // if you want to see a plot of the sampling distribution for a particular scan point:

    /*

    SamplingDistPlot sampPlot;

    int indexInScan = 0;

    tmpPoint = (RooArgSet*) parameterScan->get(indexInScan)->clone("temp");

    firstPOI->setVal( tmpPoint->getRealValue(firstPOI->GetName()) );

    toymcsampler->SetParametersForTestStat(tmpPOI);

    SamplingDistribution* samp = toymcsampler->GetSamplingDistribution(*tmpPoint);

    sampPlot.AddSamplingDistribution(samp);

    sampPlot.Draw();

       */


    // Now find bands and power constraint

    Double_t *bins = histOfUL->GetIntegral();

    TH1F *cumulative = (TH1F *)histOfUL->Clone("cumulative");

    cumulative->SetContent(bins);

    double band2sigDown, band1sigDown, bandMedian, band1sigUp, band2sigUp;

    for (int i = 1; i <= cumulative->GetNbinsX(); ++i) {

       if (bins[i] < RooStats::SignificanceToPValue(2))

          band2sigDown = cumulative->GetBinCenter(i);

       if (bins[i] < RooStats::SignificanceToPValue(1))

          band1sigDown = cumulative->GetBinCenter(i);

       if (bins[i] < 0.5)

          bandMedian = cumulative->GetBinCenter(i);

       if (bins[i] < RooStats::SignificanceToPValue(-1))

          band1sigUp = cumulative->GetBinCenter(i);

       if (bins[i] < RooStats::SignificanceToPValue(-2))

          band2sigUp = cumulative->GetBinCenter(i);

    }

    cout << "-2 sigma  band " << band2sigDown << endl;

    cout << "-1 sigma  band " << band1sigDown << " [Power Constraint)]" << endl;

    cout << "median of band " << bandMedian << endl;

    cout << "+1 sigma  band " << band1sigUp << endl;

    cout << "+2 sigma  band " << band2sigUp << endl;


    // print out the interval on the first Parameter of Interest

    cout << "\nobserved 95% upper-limit " << interval->UpperLimit(*firstPOI) << endl;

    cout << "CLb strict [P(toy>obs|0)] for observed 95% upper-limit " << CLb << endl;

    cout << "CLb inclusive [P(toy>=obs|0)] for observed 95% upper-limit " << CLbinclusive << endl;


    delete profile;

    delete nll;

 }

RooAbsData.h

TFile.h

ConfidenceBelt.h

ProfileLikelihoodTestStat.h

TSystem.h

RooStatsUtils.h

TCanvas.h

ToyMCSampler.h

TROOT.h

ModelConfig.h

PointSetInterval.h

RooWorkspace.h

TH1F.h

RooSimultaneous.h

FeldmanCousins.h