Evaluation
The specific target of this challenge is to determine a one standard deviation ( 68.27%) confidence interval for \(\mu\) on provided pseudo-experiment(s) (see main “Overview” page)
For this competition participants models will be tested on 10 sets of 100 pseudo-experiments (Total 1000 pseudo experiments). Each set will have a different value of \(\mu\). The Overall Quantile score will be based on total coverage and average interval.
Quantiles Score
The participants are requested to provide a method that, for a given pseudo-experiment, returns an interval [\(\hat \mu_{16}, \hat \mu_{84}\)]. This interval should describe the central 68.27% quantile of the likelihood function of the signal strength \(\mu\). In other words, the interval should contain the true \(\mu\) value of a given data set 68.27% of the time.
Constructing the Interval
Not every uncertainty quantification method is able to return a full likelihood function. Some methods, for example, return a predicted central value \(\hat \mu\) and an uncertainty on that value \(\Delta \hat \mu\) instead. However, constructing the interval does not require the full likelihood function. For ease of use, here are a few suggestions for constructing the interval for various methods:
Methods that return a likelihood function: Calculate the central Interval containing 68.27% of the total probability mass using numerical integration
Methods that a central value and Gaussian uncertainty: Define \(\hat \mu_{16} = \hat \mu - \Delta \hat \mu\) and \(\hat \mu_{84} = \hat \mu + \Delta \hat \mu\). If the underlying assumption of a symmetrical uncertainty made by the Gaussian uncertainty is given, this should contain 68.27% of the probability, since the 1 standard deviation region of a Gaussian distribution contains 68.27% of the probability mass.
Submission requirements
Participants’ submissions must consist of a zip file containing a model.py
file (which must not be inside a directory), and it can contain any other necessary files (eg, a pre-trained model). The model.py
file must define a Model
class which must satisfy the following criteria.
The
Model
class must accept two arguments,get_train_set
andsystematics
, when it is initialized. Theget_train_set
argument will receive a callable which, when called, will return the public dataset. Thesystematics
argument will receive a callable which can be used to apply the systematic effects (adjusting weights and primary features, computing derived features, and applying post-selection cuts) to a dataset.The
Model
class must have afit
method, which will be called once when the submission is being evaluated. This method can be used prepare the model for inference. We encourage participants to submit models which have already been trained, as there is limited compute time for each submission to be evaluated.The
Model
class must have apredict
method which must accept a test dataset and return the results as a dictionary containing four items:”mu_hat”
: the predicted value of mu,”delta_mu_hat”
: the uncertainty in the predicted value of mu,”p16”
: the lower bound of the 16th percentile of mu, and”p84”
: the upper bound of the 84th percentile of mu.
Hardware description
Throughout the competition, participants’ submissions will be run on either the Perlmutter supercomputer at NERSC or an alternative workstation at LBNL, but they will only be run on Perlmutter for the final evaluation. When running on Perlmutter, submissions will have one node assigned, which consists of 1 AMD EPYC 7763 CPU, 256GB of RAM, and 4 NVIDIA A100 GPUs with 40GB of memory each (https://docs.nersc.gov/systems/perlmutter/architecture/#gpu-nodes). The alternative workstation consists of 1 Intel(R) Xeon(R) Gold 6148 CPU, 376GB of RAM, and 3 Tesla V100-SXM2-16GB GPUs available. On either system, participants’ submissions will be allotted 2 hours to complete evaluation on all of the pseudoexperiments (10 sets of 100 pseudoexperiments each for the initial phase, and 10 sets of 1000 pseudoexperiments for the final phase). These pseudoexperiments will be parallelized in the following way. Each participants’ Model.fit()
will be run once, and then each of the pseudoexperiments will be run by one of many parallel workers, with each worker calling Model.predict()
once. There will be 30 parallel workers when running on Perlmutter. This will be reduced to 10 parallel workers when running on the alternative workstation.
Scoring
The score consist of two parts, the interval width and the coverage:
Interval width: we define the width as the average size of the interval over \(N\) pseudo-experiments \(w = \frac{1}{N} \sum_{i=0}^{N} \left| \hat \mu_{84,i} - \hat \mu_{16,i} \right|\).
Coverage: we define the coverage as the fraction of times the true \(\mu\) is contained withing the respective interval. \(c = \frac{1}{N} \sum_{i=0}^{N} 1~\textrm{if} (\mu_{true,i} \in [\hat \mu_{16,i} , \hat \mu_{84,i}])\).
The coverage function is meant to penalise a deviation of the coverage from the desired 68.27% (see graph): we define the coverage scoring function
\(x \ge 0.68-2 \sigma_{68}\) and \(x \le 0.68+2 \sigma_{68} :~1\).
\(x\) < \(0.68-2\sigma_{68}:~1+|\frac{x - (0.68-2\sigma_{68})}{\sigma_{68}}|^{4} \)
\(x\) > \(0.68+2\sigma_{68}:~1+|\frac{x - (0.68+2\sigma_{68})}{\sigma_{68}}|^{3} \)
with \(\sigma_{68} = \frac{\sqrt{(1-0.68)0.68 N)}}{N} \)
Full score: finally we define the full score as
\(s = -\ln{((w+\epsilon) f(c))}\)
So the best model will maximise their score by minimising the average width of the interval, while maintaining an expected 68.27% coverage.