Performance evaluation of PySceneDetect in terms of both latency and accuracy #481

awkrail · 2025-02-11T04:20:20Z

Problem/Use Case

Currently, PySceneDetect does not support evaluation, However, evaluating its performance is crucial for the further development. I propose a feature to integrate evaluation codes into PySceneDetect. This issue describes the procedure.

Solutions

Datasets

To evaluate the performance, we need datasets that consists of videos and manually-annotated shots. I investigated shot detection on Google scholar, and found that the following datasets were proposed. I think that BCC and RAI are a good starting point because these datasets are frequently used in shot detection literature and the dataset size is small, so easy to download. In addition, Kinetics-GEBD, ClipShot, and AutoShot collected videos from YouTube, thus using them for our evaluation protocols may violate YouTube's policy.

Dataset	conference	domain	#videos	Avg. video length (second)	#citations	Paper title
BCC	ACMMM15	Broadcast	11	2,945	133	A deep siamese network for scene detection in broadcast videos
RAI	CAIP15	Broadcast	10	591	86	Shot and scene detection via clustering for re-using broadcast video
Kinetics-GEBD	ICCV21	General	55351	n/a	81	Generic Event Boundary Detection: A Benchmark for Event Detection
ClipShot	ACCV18	General	4039	237	54	Fast Video Shot Transition Localization with Deep Structured Models
AutoShot	CVPR Workshop 23	General	853	39	13	AutoShot: A Short Video Dataset and State-of-the-Art Shot Boundary Detection

Metrics

The previous literature use recall, precision, and F1 scores to evaluate their methods. Let $\hat{Y}=(\hat{y}_1, \hat{y}_2, \cdots, \hat{y}_k, \cdots, \hat{y}_K)$ be predicted shot boundary frame numbers and $Y=(y_1, y_2, \cdots, y_l, \cdots, y_L)$ be the manually-annotated shot frame numbers.
Recall and precision is calculated as the following Python code:

def compute_f1(hat_ys, ys):
    threshold = 5 # if abs(hat_y - y) <= threshold, the prediction is accurate
    correct = 0
    for hat_y in hat_ys:
         if min([abs(hat_y - y) for y in ys]) < threshold:
             correct += 1
    recall = correct / len(ys)
    precision = correct / len(hat_ys)
    f1 = 2 * recall * precision / (recall + precision)

Note that this code provides a rough overview of the evaluation process. For precise implementation details, I will need to understand edge cases (e.g., two hat_y correspond to one y, so many-to-one case).

Implementation

I believe two evaluation modes are necessary: local mode and CI mode.
For local mode, I created an evaluation/ directory in the home directory and wrote Python scripts to run evaluations on local laptops.
For CI mode, based on the evaluation/ directory, we set up GitHub Actions to automatically run evaluation commands whenever new commits are pushed.

Questions

How do we store RAI and BCC video datasets? Because the video size are larger than Github limitations (100MB), we need a storage service.
Zenodo is one of the candidate because it allows us to store datasets for academic purposes and allows us to download the datasets in a CLI-friendly manner (like curl and wget).

Breakthrough · 2025-02-12T01:18:49Z

This would be fantastic to have, thanks for writing this up. Have you been able to run any evaluations locally? Feel free to upload a pull request with any scripts you might have. Even if they can only be run locally by a developer, or if they need to download files, that's okay for now.

Once we have a workflow that's easy enough to run locally, I don't mind looking into the other issues you raised about how to do this with Github Actions. Thanks for the link to Zenodo as well by the way, that looks super useful. We might be able to find other existing datasets on there as well in the future.

Note that we could also use Git LFS on Github and store the artifacts in a separate repository. I registered a Github organization called PySceneDetect, so we could setup a repo there for this purpose.

awkrail · 2025-02-12T07:44:16Z

@Breakthrough Thank you for your reply.

Have you been able to run any evaluations locally? Feel free to upload a pull request with any scripts you might have. Even if they can only be run locally by a developer, or if they need to download files, that's okay for now.

I just started writing code for it. Which directory structure do you prefer: PySceneDetect/scenedetect/evaluation or PySceneDetect/evaluation? I’m wondering if I should place my Python code in PySceneDetect/scenedetect/evaluation, since the current code is located in PySceneDetect/scenedetect.

Note that we could also use Git LFS on Github and store the artifacts in a separate repository. I registered a Github organization called PySceneDetect, so we could setup a repo there for this purpose.

Github LFS seems a good option. Because the video size is limited, LFS might be better if you allow me to use it.
Anyway, I will create a PR for evaluation commands in local, and then create a next PR for Github actions (CI mode).

Breakthrough · 2025-02-13T02:13:40Z

Which directory structure do you prefer: PySceneDetect/scenedetect/evaluation or PySceneDetect/evaluation? I’m wondering if I should place my Python code in PySceneDetect/scenedetect/evaluation, since the current code is located in PySceneDetect/scenedetect.

There are actually issues placing sub-folders under the scenedetect/ folder since it acts like the module for Python. Could you create a new folder called benchmarks in the root of the repo and use that?

Github LFS seems a good option. Because the video size is limited, LFS might be better if you allow me to use it. Anyway, I will create a PR for evaluation commands in local, and then create a next PR for Github actions (CI mode).

Hmm, I did some more digging and using Github for this might not be tenable - the bandwidth limit for free accounts is 1 GiB. Pricing is $0.07/GiB of storage and ~$0.09/GiB for bandwidth. The ClipShot dataset alone is around 45 GiB, which works out to $3.15, plus say ~$4 per download of the entire thing, so that would add up fast. We unfortunately would have to pay those bandwidth costs for Github Actions too each time we run this here.

Let me do some more research into this aspect. I think we definitely need to have some kind of backup mirror for the project's purposes, but we shouldn't need to be blocked on that. Running it locally is also fine for now. Maybe we can choose a small sub-set of the full data that will be good enough for most purposes. E.g. we all choose ~100 videos or so, and limit the dataset size used in the CI actions so we don't hit bandwidth limits. If we can keep the costs under control, I'm happy to cover them.

For now, could we just include links to where to download the datasets, and instructions on where to put them to run the benchmarks?

awkrail · 2025-02-13T04:30:51Z

There are actually issues placing sub-folders under the scenedetect/ folder since it acts like the module for Python. Could you create a new folder called benchmarks in the root of the repo and use that?

For now, could we just include links to where to download the datasets, and instructions on where to put them to run the benchmarks?

Got it! Thanks.

Hmm, I did some more digging and using Github for this might not be tenable - the bandwidth limit for free accounts is 1 GiB. Pricing is $0.07/GiB of storage and ~$0.09/GiB for bandwidth. The ClipShot dataset alone is around 45 GiB, which works out to $3.15, plus say ~$4 per download of the entire thing, so that would add up fast. We unfortunately would have to pay those bandwidth costs for Github Actions too each time we run this here.
Let me do some more research into this aspect. I think we definitely need to have some kind of backup mirror for the project's purposes, but we shouldn't need to be blocked on that. Running it locally is also fine for now. Maybe we can choose a small sub-set of the full data that will be good enough for most purposes. E.g. we all choose ~100 videos or so, and limit the dataset size used in the CI actions so we don't hit bandwidth limits. If we can keep the costs under control, I'm happy to cover them.

I agree with selecting videos from the dataset to reduce the cost. RAI and BCC are broadcast videos, so I want to pick up diverse videos from ClipShots or AutoShots. In addition, I think that video's license is also important, Creative Commons License or other free-to-use License are desirable.

Breakthrough added the technical item label Feb 12, 2025

awkrail mentioned this issue Feb 14, 2025

[benchmark/WIP] Benchmarking pyscenedetect detectors' performance #484

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance evaluation of PySceneDetect in terms of both latency and accuracy #481

Performance evaluation of PySceneDetect in terms of both latency and accuracy #481

awkrail commented Feb 11, 2025

Breakthrough commented Feb 12, 2025

awkrail commented Feb 12, 2025

Breakthrough commented Feb 13, 2025 •

edited

Loading

awkrail commented Feb 13, 2025

Performance evaluation of PySceneDetect in terms of both latency and accuracy #481

Performance evaluation of PySceneDetect in terms of both latency and accuracy #481

Comments

awkrail commented Feb 11, 2025

Problem/Use Case

Solutions

Datasets

Metrics

Implementation

Questions

Breakthrough commented Feb 12, 2025

awkrail commented Feb 12, 2025

Breakthrough commented Feb 13, 2025 • edited Loading

awkrail commented Feb 13, 2025

Breakthrough commented Feb 13, 2025 •

edited

Loading