By 2025 end, a model exhibits action recognition (video) equivalent to human level accuracy on Something Something V2?

Action Recognition is a computer vision task that involves recognizing human actions in videos or images.

The goal is to classify and categorize the actions being performed in the video or image into a predefined set of action classes.

In the video domain, it is an open question whether training an action classification network on a sufficiently large dataset, will give a similar boost in performance when applied to a different temporal task or dataset. The challenges of building video datasets has meant that most popular benchmarks for action recognition are small, having on the order of 10k videos.

(i'll accept if within 2% of human performance)

