The goal of this project is to detect object from a number of visual object classes in realistic scenes. There are 7 object classes:
- Car, Van, Truck, Tram
- Pedestrian, Person
- Cyclist
The training and test data are ~6GB each (12GB in total). The data can be downloaded at http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark .The label data provided in the KITTI dataset corresponding to a particular image includes the following fields. The labels also include 3D data which is out of scope for this project.
Key | Values | Description |
---|---|---|
type | 1 | String describing the type of object: [Car, Van, Truck, Pedestrian,Person_sitting, Cyclist, Tram, Misc or DontCare] |
truncated | 1 | Float from 0 (non-truncated) to 1 (truncated), where truncated refers to the object leaving image boundaries |
occluded | 1 | Integer (0,1,2,3) indicating occlusion state: 0 = fully visible 1 = partly occluded 2 = largely occluded 3 = unknown |
alpha | 1 | Observation angle of object ranging from [-pi, pi] |
bbox | 4 | 2D bounding box of object in the image (0-based index): contains left, top, right, bottom pixel coordinates |
Since the only has 7481 labelled images, it is essential to incorporate data augmentations to create more variability in available data. The following list provides the types of image augmentations performed.
- Image Embossing
- Blur (Gaussian, Average, Median)
- Brightness variation with per-channel probability
- Adding Gaussian Noise with per-channel probability
- Random dropout of pixels
Geometric augmentations are thus hard to perform since it requires modification of every bounding box coordinate and results in changing the aspect ratio of images. We plan to implement Geometric augmentations in the next release. Examples of image embossing, brightness/ color jitter and Dropout are shown below.
Adding Label Noise
To allow adding noise to our labels to make the model robust, We performed side by side of cropping images where the number of pixels were chosen from a uniform distribution of [-5px, 5px] where values less than 0 correspond to no crop.
We used an 80 / 20 split for train and validation sets respectively since a separate test set is provided.
We use mean average precision (mAP) as the performance metric here.
Average Precision: It is the average precision over multiple IoU values.
mAP: It is average of AP over all the object categories.
We experimented with faster R-CNN, SSD (single shot detector) and YOLO networks. We chose YOLO V3 as the network architecture for the following reasons,
- YOLO V3 is relatively lightweight compared to both SSD and faster R-CNN, allowing me to iterate faster.
- Costs associated with GPUs encouraged me to stick to YOLO V3.
- We wanted to evaluate performance real-time, which requires very fast inference time and hence we chose YOLO V3 architecture.
We implemented YoloV3 with Darknet backbone using Pytorch deep learning framework.
Use the detect.py script to test the model on sample images at /data/samples. Feel free to put your own test images here. The results are saved in /output directory. Some inference results are shown below.
- Install dependencies : pip install -r requirements.txt
- Directory structure
- /src: contains source code
- /data: data directory for KITTI 2D dataset
- samples/
- train/
- images/ (Place all training images here)
- yolo_labels/ (This is included in the repo)
- test/
- images/ (Place all test images here)
- names.txt (Contains the object categories)
- readme.txt (Official KITTI Data Documentation)
- /config: contains yolo configuration file
- /readme_resources:
- Run the main function in main.py with required arguments. The codebase is clearly documented with clear details on how to execute the functions. You need to interface only with this function to reproduce the code.
- Pre-trained weights can be obtained at https://drive.google.com/open?id=1qvv5j59Vx3rg9GZCYW1WwlvQxWg4aPlL