This repo contains all the models (including the variant) developed in the project Multi-Stage Hybrid-CNN Transformer Model for Human Intent-Prediction. As an overview, the Multi-Stage Hybrid-CNN Transformer Classifier System is composed of two key components: the Gazed Object Detector and the Intent Classifier.
- The Gazed Object Detector is in the
gazed_object_detectors
which contains the three different variations of the model. - The Intent Classifier is in the
intent_classifier
folder. - The Overall System for inference is in the
multi-stage_human_intent_classifier_system
folder.
Each folder has its own readme.md
for guidance.
The dataset that was used in this project can be accessed here. The generators and statistics for the train-test split are in the split
folder.
- Add more video samples such that the gaze distribution is balanced ("None" or not looking at objects is currently overrepresented)
- Develop the weights of the Gaze Object Detector from scratch to tailor the model for object-gaze classification
- Consider the probability of gaze for all objects in a given frame, instead of the most probable gaze, as input to the human intent classifier
- Explore other human pose estimation techniques (as inspired by the increase in performance from the additional head information used)
We are extremely grateful to the work of DETR and MGTR, where the gazed object detector was heavily based from.