对原有项目的调试纠错和改造添加记录

调试纠错

计算溢出

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

原因:使用随机初始化的权重进行预测，经过Resnet 101，tf.keras.layers.batchnormalization的传入参数training=False时，（meaning that it will use the moving mean and the moving variance to normalize the current batch, rather than using the mean and variance of the current batch），从结果来看，C4, C5输出的大量值的量级处于(10e6, 10e8)这个区间内，因此P2 ~ P5的大量值的量级也会处于(10e6, 10e8)这个区间内。这些大量级的参数计算而得的rpn_deltas也是非常之大，且rpn层需要利用rpn_deltas对初始的建议框进行微调，对于w的微调公式为Wnew = Wold * exp(dw)，得到的Ｗnew会因为超出范围而被标记为nan，这些带nan的框经过roi_align层，roi_align层使用的函数为tf.image.crop_and_resize，这个函数不允许输入的边界框含有nan数值，否则则会报上述错误。
解决方法：将训练代码之前的预测部分移除

内存不足

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

原因：网上给出的原因不外乎内存不足、CPU不足、显存不足，最后发现原因是运行过程中不足。但在colab上使用相同代码就不会报这个错，推测是某些库版本带来的问题。
解决方法：减少预加载图片数量有效解决了这个问题

模型参数个数不匹配

ValueError: Layer #0 (named "res_net") expects 0 weight(s), but the saved weights have 318 element(s).

原因：提示模型需求参数为0，但加载的h5文件中的参数大于0。在colab上运行并未报这个错，应该是tensorflow版本的问题。
解决方法：在加载模型前先对任意一张图片进行完整的训练过程，然后再加载模型参数，则不会报错。

batch_imgs, batch_metas, batch_bboxes, batch_labels = train_dataset[0]
with tf.GradientTape() as tape:
    _, _, _, _ = model((np.array([batch_imgs]), np.array([batch_metas]),
                        np.array([batch_bboxes]), np.array([batch_labels])), training=True)

model.load_weights(CHECKPOINT, by_name=True)

阈值过大

detection/models/detector/faster_rcnn.py中的

self.RCNN_MIN_CONFIDENCE

该参数过大,应当减小一些,否则会过滤掉所有的框

ps:根据github上星最多的mask RCNN代码中给出的模型，检测气球使用的最小置信度为0.9；检测细胞的最小置信度为0；coco数据集上进行目标检测使用的最小值置信度为0，因此推断对于不同的检测目标，最小置信度的设置也应有所不同。对于细胞的置信度设置为0，代码作者给出的解释为：Don't exclude based on confidence. Since we have two classes　then 0.5 is the minimum anyway as it picks between nucleus and BG。

异常处理

detection/models/bbox_head/bbox_head.py中如若nms_keep为0

nms_keep = tf.concat(nms_keep, axis=0)

该行代码的执行会报错,需要增加异常处理:

if len(nms_keep) != 0:
    nms_keep = tf.concat(nms_keep, axis=0)
else:
    nms_keep = tf.zeros([0, ], tf.int64)

改造添加

增加了对epoch数量、batch大小、是否微调、学习率大小、评估效果的频率以及标准化的配置功能

img_mean = (123.675, 116.28, 103.53)
img_std = (58.395, 57.12, 57.375)

epochs = 100
batch_size = 2
learning_rate = 1e-4
checkpoint = 500
finetune = 0

opts, args = getopt.getopt(sys.argv[1:], "-b:-f:-l:-e:-c:-n:", )

for opt, arg in opts:
    if opt == '-b':
        batch_size = int(arg)
    elif opt == '-f':
        finetune = int(arg)
    elif opt == '-l':
        learning_rate = float(arg)
    elif opt == '-e':
        epochs = int(arg)
    elif opt == '-c':
        checkpoint = int(arg)
    elif opt == '-n':
        if int(arg) == 0:
            img_mean = (0., 0., 0.)
            img_std = (1., 1., 1.)
        elif int(arg) == 1:
            # Company Articles Dataset
            img_mean = (0.9684, 0.9683, 0.9683)
            img_std = (0.1502, 0.1505, 0.1505)

标准化可配置为不使用标准化，imagenet标准化和公司章程数据集标准化

img_mean = (123.675, 116.28, 103.53)
img_std = (58.395, 57.12, 57.375)

    elif opt == '-n':
        if int(arg) == 0:
            img_mean = (0., 0., 0.)
            img_std = (1., 1., 1.)
        elif int(arg) == 1:
            # Company Articles Dataset
            img_mean = (0.9684, 0.9683, 0.9683)
            img_std = (0.1502, 0.1505, 0.1505)

分离了训练时基于batch或epoch进行预测及保存模型的py文件
增加了对RPN和RCNN两部分网络分类和定位loss的输出

print('Epoch:', epoch, 'Batch:', batch, 'Loss:', loss_value.numpy(),
      'RPN Class Loss:', rpn_class_loss.numpy(),
      'RPN Bbox Loss:', rpn_bbox_loss.numpy(),
      'RCNN Class Loss:', rcnn_class_loss.numpy(),
      'RCNN Bbox Loss:', rcnn_bbox_loss.numpy())

增加了保存模型的功能

if batch % checkpoint == 0 and not batch == 0:
    model.save_weights('./model/epoch_' + str(epoch) + '_batch_' + str(batch) + '.h5')

增加了加载模型的功能

model.load_weights('model/faster_rcnn.h5', by_name=True)

在训练中增加了对验证集预测，计算AP和AR并保存预测结果的功能

dataset_results = []
imgIds = []

for idx in range(len(test_dataset)):
    if idx % 10 == 9 or idx + 1 == len(test_dataset):
        print(str(idx + 1) + ' / ' + str(len(test_dataset)))

    img, img_meta, _, _ = test_dataset[idx]

    proposals = model.simple_test_rpn(img, img_meta)
    res = model.simple_test_bboxes(img, img_meta, proposals)

    # visualize.display_instances(ori_img, res['rois'], res['class_ids'],
    #                             test_dataset.get_categories(), scores=res['scores'])

    image_id = test_dataset.img_ids[idx]
    imgIds.append(image_id)

    for pos in range(res['class_ids'].shape[0]):
        results = dict()
        results['score'] = float(res['scores'][pos])
        results['category_id'] = test_dataset.label2cat[int(res['class_ids'][pos])]
        y1, x1, y2, x2 = [float(num) for num in list(res['rois'][pos])]
        results['bbox'] = [x1, y1, x2 - x1 + 1, y2 - y1 + 1]
        results['image_id'] = image_id
        dataset_results.append(results)

if not dataset_results == []:
    with open('result/epoch_' + str(epoch) + '_batch_' + str(batch) + '.json', 'w') as f:
        f.write(json.dumps(dataset_results))

    coco_dt = test_dataset.coco.loadRes(
        'result/epoch_' + str(epoch) + '_batch_' + str(batch) + '.json')
    cocoEval = COCOeval(test_dataset.coco, coco_dt, 'bbox')
    cocoEval.params.imgIds = imgIds

    cocoEval.evaluate()
    cocoEval.accumulate()
    cocoEval.summarize()

    with open('result/evaluation.txt', 'a+') as f:
        content = 'Epoch: ' + str(epoch) + 'Batch: ' + str(batch) \
                  + '\n' + str(cocoEval.stats) + '\n'
        f.write(content)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly