-
Notifications
You must be signed in to change notification settings - Fork 7
对原有项目的调试纠错和改造添加记录
NoBa1anc3 edited this page May 11, 2020
·
2 revisions
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
- 原因:使用随机初始化的权重进行预测,经过Resnet 101,tf.keras.layers.batchnormalization的传入参数training=False时,(meaning that it will use the moving mean and the moving variance to normalize the current batch, rather than using the mean and variance of the current batch),从结果来看,C4, C5输出的大量值的量级处于(10e6, 10e8)这个区间内,因此P2 ~ P5的大量值的量级也会处于(10e6, 10e8)这个区间内。 这些大量级的参数计算而得的rpn_deltas也是非常之大,且rpn层需要利用rpn_deltas对初始的建议框进行微调,对于w的微调公式为Wnew = Wold * exp(dw),得到的Wnew会因为超出范围而被标记为nan,这些带nan的框经过roi_align层,roi_align层使用的函数为tf.image.crop_and_resize,这个函数不允许输入的边界框含有nan数值,否则则会报上述错误。
- 解决方法:将训练代码之前的预测部分移除
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
- 原因:网上给出的原因不外乎内存不足、CPU不足、显存不足,最后发现原因是运行过程中不足。但在colab上使用相同代码就不会报这个错,推测是某些库版本带来的问题。
- 解决方法:减少预加载图片数量有效解决了这个问题
ValueError: Layer #0 (named "res_net") expects 0 weight(s), but the saved weights have 318 element(s).
- 原因:提示模型需求参数为0,但加载的h5文件中的参数大于0。在colab上运行并未报这个错,应该是tensorflow版本的问题。
- 解决方法:在加载模型前先对任意一张图片进行完整的训练过程,然后再加载模型参数,则不会报错。
batch_imgs, batch_metas, batch_bboxes, batch_labels = train_dataset[0]
with tf.GradientTape() as tape:
_, _, _, _ = model((np.array([batch_imgs]), np.array([batch_metas]),
np.array([batch_bboxes]), np.array([batch_labels])), training=True)
model.load_weights(CHECKPOINT, by_name=True)
detection/models/detector/faster_rcnn.py中的
self.RCNN_MIN_CONFIDENCE
该参数过大,应当减小一些,否则会过滤掉所有的框
ps:根据github上星最多的mask RCNN代码中给出的模型,检测气球使用的最小置信度为0.9;检测细胞的最小置信度为0;coco数据集上进行目标检测使用的最小值置信度为0,因此推断对于不同的检测目标,最小置信度的设置也应有所不同。对于细胞的置信度设置为0,代码作者给出的解释为:Don't exclude based on confidence. Since we have two classes then 0.5 is the minimum anyway as it picks between nucleus and BG。
detection/models/bbox_head/bbox_head.py中 如若nms_keep为0
nms_keep = tf.concat(nms_keep, axis=0)
该行代码的执行会报错,需要增加异常处理:
if len(nms_keep) != 0:
nms_keep = tf.concat(nms_keep, axis=0)
else:
nms_keep = tf.zeros([0, ], tf.int64)
- 增加了对epoch数量、batch大小、是否微调、学习率大小、评估效果的频率以及标准化的配置功能
img_mean = (123.675, 116.28, 103.53)
img_std = (58.395, 57.12, 57.375)
epochs = 100
batch_size = 2
learning_rate = 1e-4
checkpoint = 500
finetune = 0
opts, args = getopt.getopt(sys.argv[1:], "-b:-f:-l:-e:-c:-n:", )
for opt, arg in opts:
if opt == '-b':
batch_size = int(arg)
elif opt == '-f':
finetune = int(arg)
elif opt == '-l':
learning_rate = float(arg)
elif opt == '-e':
epochs = int(arg)
elif opt == '-c':
checkpoint = int(arg)
elif opt == '-n':
if int(arg) == 0:
img_mean = (0., 0., 0.)
img_std = (1., 1., 1.)
elif int(arg) == 1:
# Company Articles Dataset
img_mean = (0.9684, 0.9683, 0.9683)
img_std = (0.1502, 0.1505, 0.1505)
- 标准化可配置为不使用标准化,imagenet标准化和公司章程数据集标准化
img_mean = (123.675, 116.28, 103.53)
img_std = (58.395, 57.12, 57.375)
elif opt == '-n':
if int(arg) == 0:
img_mean = (0., 0., 0.)
img_std = (1., 1., 1.)
elif int(arg) == 1:
# Company Articles Dataset
img_mean = (0.9684, 0.9683, 0.9683)
img_std = (0.1502, 0.1505, 0.1505)
- 分离了训练时基于batch或epoch进行预测及保存模型的py文件
- 增加了对RPN和RCNN两部分网络分类和定位loss的输出
print('Epoch:', epoch, 'Batch:', batch, 'Loss:', loss_value.numpy(),
'RPN Class Loss:', rpn_class_loss.numpy(),
'RPN Bbox Loss:', rpn_bbox_loss.numpy(),
'RCNN Class Loss:', rcnn_class_loss.numpy(),
'RCNN Bbox Loss:', rcnn_bbox_loss.numpy())
- 增加了保存模型的功能
if batch % checkpoint == 0 and not batch == 0:
model.save_weights('./model/epoch_' + str(epoch) + '_batch_' + str(batch) + '.h5')
- 增加了加载模型的功能
model.load_weights('model/faster_rcnn.h5', by_name=True)
- 在训练中增加了对验证集预测,计算AP和AR并保存预测结果的功能
dataset_results = []
imgIds = []
for idx in range(len(test_dataset)):
if idx % 10 == 9 or idx + 1 == len(test_dataset):
print(str(idx + 1) + ' / ' + str(len(test_dataset)))
img, img_meta, _, _ = test_dataset[idx]
proposals = model.simple_test_rpn(img, img_meta)
res = model.simple_test_bboxes(img, img_meta, proposals)
# visualize.display_instances(ori_img, res['rois'], res['class_ids'],
# test_dataset.get_categories(), scores=res['scores'])
image_id = test_dataset.img_ids[idx]
imgIds.append(image_id)
for pos in range(res['class_ids'].shape[0]):
results = dict()
results['score'] = float(res['scores'][pos])
results['category_id'] = test_dataset.label2cat[int(res['class_ids'][pos])]
y1, x1, y2, x2 = [float(num) for num in list(res['rois'][pos])]
results['bbox'] = [x1, y1, x2 - x1 + 1, y2 - y1 + 1]
results['image_id'] = image_id
dataset_results.append(results)
if not dataset_results == []:
with open('result/epoch_' + str(epoch) + '_batch_' + str(batch) + '.json', 'w') as f:
f.write(json.dumps(dataset_results))
coco_dt = test_dataset.coco.loadRes(
'result/epoch_' + str(epoch) + '_batch_' + str(batch) + '.json')
cocoEval = COCOeval(test_dataset.coco, coco_dt, 'bbox')
cocoEval.params.imgIds = imgIds
cocoEval.evaluate()
cocoEval.accumulate()
cocoEval.summarize()
with open('result/evaluation.txt', 'a+') as f:
content = 'Epoch: ' + str(epoch) + 'Batch: ' + str(batch) \
+ '\n' + str(cocoEval.stats) + '\n'
f.write(content)