Only the sub benchmarks that are scored as an accuracy (i.e. from 0-100%) will be included (I think that's all of them but I'm not sure)
It must be a single model. If Model A achieves 75% on half and Model B achieves 75% on the other half that does not resolve the question YES
Ensemble models are fine but something like "run Model A on this benchmark and model B on this other benchmark" is not. If there is model selection is must be learned and it cannot include the current benchmark as an input.
For this and the related BIG-bench markets: it seems like most groups are done publishing metrics on the individual tasks (as opposed to average score), and that they're mostly publishing on BIG-bench hard. If that's the case then my current plan is to resolve these markets N/A, and I'll make new ones asking about average score on BIG-bench hard.