Training consistently stalling #2089

gomesteixeira · 2025-01-21T14:47:43Z

Bug description

Hi everyone!

I’m encountering an issue where the model training stalls consistently at the same point. I have posted about this in a thread where a similar (if not the same) problem was discussed, however I’m now raising an issue because I tried more troubleshooting solutions, and I will now add more details.

Expected behaviour

My implementation is to track two (for now) rats in a large environment, recorded from above. Because of that, I am using the multi-animal top-down model. Initially I had defined a quite complex skeleton, with 11 keypoints/bodyparts. I made sure to use an adequate anchor point and defined a sufficiently large crop size and use -180° to 180° for the augmentation (both on the centroid and centered instance models). I didn’t really touch any of the other hyperparameters.

That was working ok-ish, meaning that I could train the model on some 80 labeled frames, and it would distinguish the two rats, however not properly identify the bodyparts. Because of that, I decided to make a new project, where I removed some bodyparts that were not even visible most of the time, and this time I also defined the max number of instances (to 2). I didn't change anything else from what I was doing before, and this is when this issue showed up.

Actual behaviour

The training for the centroid model apparently works fine, then the training for the centered instance model always stalls after 198 or 199 batches, on training epoch 1. By stalling I mean that nothing happens, it just keeps running without advancing. When I press the ‘Stop Early’ button, also nothing happens, but the ‘Cancel Training’ does work. I have left it running for two days (while before it would be done in a couple hours), to make sure it never advances.

By inspecting the conda console, I see ‘Finished training centroid’, then it starts the centered instance training, and throws the following:

ValueError: Exception encountered when calling layer “find_instance_peaks” (type FindInstancePeaks).
Input 0 of layer “model” is incompatible with the layer: expected shape=(None, 192, 192, 1), found shape=(1, 28, 28, 1)

On different troubleshooting runs (I'll describe my troubleshooting attempts below) I have seen different numbers for the shape, however this specific error always comes up. I'm attaching the console logs that pertain only to the centered instance model training, however please let me know if it would be useful to attach the whole console output.

Your personal set up

OS: Windows-10-10.0.22621-SP0
Version(s): SLEAP v1.3.3, python 3.7.12
SLEAP installation method (listed here):
- Conda from package

Environment packages

# packages in environment at C:\Users\Jose_Teixeira\miniconda3\envs\sleap:
#
# Name                    Version                   Build  Channel
absl-py                   1.0.0                    pypi_0    pypi
astunparse                1.6.3                    pypi_0    pypi
attrs                     21.4.0             pyhd8ed1ab_0    conda-forge
backports-zoneinfo        0.2.1                    pypi_0    pypi
brotli                    1.1.0                h2466b09_2    conda-forge
brotli-bin                1.1.0                h2466b09_2    conda-forge
ca-certificates           2024.8.30            h56e8100_0    conda-forge
cached-property           1.5.2                    pypi_0    pypi
cachetools                4.2.4                    pypi_0    pypi
cattrs                    1.1.1              pyhd8ed1ab_0    conda-forge
certifi                   2024.8.30          pyhd8ed1ab_0    conda-forge
charset-normalizer        2.0.9                    pypi_0    pypi
cloudpickle               2.2.1              pyhd8ed1ab_0    conda-forge
cuda-nvcc                 11.3.58              hb8d16a4_0    nvidia
cudatoolkit               11.3.1              hf2f0253_13    conda-forge
cudnn                     8.2.1.32             h754d62a_0    conda-forge
cycler                    0.11.0             pyhd8ed1ab_0    conda-forge
cytoolz                   0.12.0           py37hcc03f2d_0    conda-forge
dask-core                 2022.2.0           pyhd8ed1ab_0    conda-forge
efficientnet              1.0.0                    pypi_0    pypi
flatbuffers               2.0                      pypi_0    pypi
fonttools                 4.38.0           py37h51bd9d9_0    conda-forge
freeglut                  3.2.2                he0c23c2_3    conda-forge
freetype                  2.12.1               hdaf720e_2    conda-forge
fsspec                    2023.1.0           pyhd8ed1ab_0    conda-forge
gast                      0.4.0                    pypi_0    pypi
geos                      3.11.0               h39d44d4_0    conda-forge
google-auth               2.3.3                    pypi_0    pypi
google-auth-oauthlib      0.4.6                    pypi_0    pypi
google-pasta              0.2.0                    pypi_0    pypi
grpcio                    1.43.0                   pypi_0    pypi
h5py                      3.1.0                    pypi_0    pypi
hdmf                      3.6.1                    pypi_0    pypi
icu                       69.1                 h0e60522_0    conda-forge
idna                      3.3                      pypi_0    pypi
image-classifiers         1.0.0                    pypi_0    pypi
imagecodecs-lite          2019.12.3        py37h0b711f8_5    conda-forge
imageio                   2.35.1             pyh12aca89_0    conda-forge
imgaug                    0.4.0              pyhd8ed1ab_1    conda-forge
imgstore                  0.2.9                    pypi_0    pypi
importlib-metadata        4.2.0                    pypi_0    pypi
importlib-resources       5.12.0                   pypi_0    pypi
intel-openmp              2024.2.1          h57928b3_1083    conda-forge
jasper                    2.0.33               hc2e4405_1    conda-forge
joblib                    1.3.2              pyhd8ed1ab_0    conda-forge
jpeg                      9e                   hcfcfb64_3    conda-forge
jsmin                     3.0.1              pyhd8ed1ab_0    conda-forge
jsonpickle                1.2                        py_0    conda-forge
jsonschema                4.17.3                   pypi_0    pypi
keras                     2.7.0                    pypi_0    pypi
keras-applications        1.0.8                    pypi_0    pypi
keras-preprocessing       1.1.2                    pypi_0    pypi
kiwisolver                1.4.4            py37h8c56517_0    conda-forge
lcms2                     2.14                 h90d422f_0    conda-forge
lerc                      4.0.0                h63175ca_0    conda-forge
libblas                   3.9.0              23_win64_mkl    conda-forge
libbrotlicommon           1.1.0                h2466b09_2    conda-forge
libbrotlidec              1.1.0                h2466b09_2    conda-forge
libbrotlienc              1.1.0                h2466b09_2    conda-forge
libcblas                  3.9.0              23_win64_mkl    conda-forge
libclang                  12.0.0                   pypi_0    pypi
libdeflate                1.14                 hcfcfb64_0    conda-forge
libhwloc                  2.11.1          default_h8125262_1000    conda-forge
libiconv                  1.17                 hcfcfb64_2    conda-forge
liblapack                 3.9.0              23_win64_mkl    conda-forge
liblapacke                3.9.0              23_win64_mkl    conda-forge
libopencv                 4.5.1            py37ha0199f4_0    conda-forge
libpng                    1.6.43               h19919ed_0    conda-forge
libprotobuf               3.21.8               h12be248_0    conda-forge
libsodium                 1.0.18               h8d14728_1    conda-forge
libsqlite                 3.46.1               h2466b09_0    conda-forge
libtiff                   4.4.0                hc4f729c_5    conda-forge
libwebp-base              1.4.0                hcfcfb64_0    conda-forge
libxcb                    1.13              hcd874cb_1004    conda-forge
libxml2                   2.12.7               h283a6d9_1    conda-forge
libxslt                   1.1.39               h3df6e99_0    conda-forge
libzlib                   1.2.13               h2466b09_6    conda-forge
locket                    1.0.0              pyhd8ed1ab_0    conda-forge
m2w64-gcc-libgfortran     5.3.0                         6    conda-forge
m2w64-gcc-libs            5.3.0                         7    conda-forge
m2w64-gcc-libs-core       5.3.0                         7    conda-forge
m2w64-gmp                 6.1.0                         2    conda-forge
m2w64-libwinpthread-git   5.0.0.4634.697f757               2    conda-forge
markdown                  3.3.6                    pypi_0    pypi
markdown-it-py            2.2.0              pyhd8ed1ab_0    conda-forge
matplotlib-base           3.5.3            py37hbaab90a_2    conda-forge
mdurl                     0.1.2              pyhd8ed1ab_0    conda-forge
mkl                       2024.1.0           h66d3029_694    conda-forge
msys2-conda-epoch         20160418                      1    conda-forge
munkres                   1.1.4              pyh9f0ad1d_0    conda-forge
ndx-pose                  0.1.1                    pypi_0    pypi
networkx                  2.6.3              pyhd8ed1ab_1    conda-forge
nixio                     1.5.3                    pypi_0    pypi
numpy                     1.19.5                   pypi_0    pypi
oauthlib                  3.1.1                    pypi_0    pypi
opencv                    4.5.1            py37h03978a9_0    conda-forge
opencv-python-headless    4.2.0.34                 pypi_0    pypi
openjpeg                  2.5.0                hc9384bd_1    conda-forge
openssl                   1.1.1w               hcfcfb64_0    conda-forge
opt-einsum                3.3.0                    pypi_0    pypi
packaging                 21.3                     pypi_0    pypi
pandas                    1.3.5            py37h9386db6_0    conda-forge
partd                     1.4.1              pyhd8ed1ab_0    conda-forge
patsy                     0.5.6              pyhd8ed1ab_0    conda-forge
pillow                    9.2.0            py37h42a8222_2    conda-forge
pip                       24.0               pyhd8ed1ab_0    conda-forge
pkgutil-resolve-name      1.3.10                   pypi_0    pypi
protobuf                  3.19.1                   pypi_0    pypi
psutil                    5.9.3            py37h51bd9d9_0    conda-forge
pthread-stubs             0.4               hcd874cb_1001    conda-forge
pthreads-win32            2.9.1                hfa6e2cd_3    conda-forge
py-opencv                 4.5.1            py37heaed05f_0    conda-forge
pyasn1                    0.4.8                    pypi_0    pypi
pyasn1-modules            0.2.8                    pypi_0    pypi
pygments                  2.17.2             pyhd8ed1ab_0    conda-forge
pykalman                  0.9.7              pyhd8ed1ab_0    conda-forge
pynwb                     2.3.3                    pypi_0    pypi
pyparsing                 3.0.6                    pypi_0    pypi
pyrsistent                0.19.3                   pypi_0    pypi
pyside2                   5.13.2           py37h760f651_8    conda-forge
python                    3.7.12          h7840368_100_cpython    conda-forge
python-dateutil           2.9.0              pyhd8ed1ab_0    conda-forge
python-rapidjson          1.9              py37h7f67f24_0    conda-forge
python_abi                3.7                     4_cp37m    conda-forge
pytz                      2024.1             pyhd8ed1ab_0    conda-forge
pywavelets                1.3.0            py37h3a130e4_1    conda-forge
pyyaml                    6.0              py37hcc03f2d_4    conda-forge
pyzmq                     24.0.1           py37h7347f05_0    conda-forge
qimage2ndarray            1.10.0                   pypi_0    pypi
qt                        5.12.9               h556501e_6    conda-forge
qtpy                      2.4.1              pyhd8ed1ab_0    conda-forge
requests                  2.26.0                   pypi_0    pypi
requests-oauthlib         1.3.0                    pypi_0    pypi
rich                      13.7.1             pyhd8ed1ab_0    conda-forge
ruamel-yaml               0.17.32                  pypi_0    pypi
ruamel-yaml-clib          0.2.7                    pypi_0    pypi
scikit-image              0.19.3           py37h3182a2c_1    conda-forge
scikit-learn              1.0              py37ha78be43_1    conda-forge
scikit-video              1.1.11             pyh24bf2e0_0    conda-forge
scipy                     1.7.3            py37hb6553fb_0    conda-forge
seaborn                   0.12.2               hd8ed1ab_0    conda-forge
seaborn-base              0.12.2             pyhd8ed1ab_0    conda-forge
segmentation-models       1.0.1                    pypi_0    pypi
setuptools                59.8.0           py37h03978a9_1    conda-forge
setuptools-scm            6.3.2                    pypi_0    pypi
shapely                   1.8.5            py37h475e9a0_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sleap                     1.3.3                    pypi_0    pypi
sqlite                    3.46.1               h2466b09_0    conda-forge
statsmodels               0.13.2           py37h3a130e4_0    conda-forge
tbb                       2021.13.0            hc790b64_0    conda-forge
tensorboard               2.7.0                    pypi_0    pypi
tensorboard-data-server   0.6.1                    pypi_0    pypi
tensorboard-plugin-wit    1.8.0                    pypi_0    pypi
tensorflow                2.7.0                    pypi_0    pypi
tensorflow-estimator      2.7.0                    pypi_0    pypi
tensorflow-hub            0.12.0             pyhca92ed8_0    conda-forge
tensorflow-io-gcs-filesystem 0.23.1                   pypi_0    pypi
termcolor                 1.1.0                    pypi_0    pypi
threadpoolctl             3.1.0              pyh8a188c0_0    conda-forge
tifffile                  2020.6.3                   py_0    conda-forge
tk                        8.6.13               h5226925_1    conda-forge
tomli                     2.0.0                    pypi_0    pypi
toolz                     0.12.1             pyhd8ed1ab_0    conda-forge
typing-extensions         4.0.1                    pypi_0    pypi
typing_extensions         4.7.1              pyha770c72_0    conda-forge
tzdata                    2023.3                   pypi_0    pypi
tzlocal                   5.0.1                    pypi_0    pypi
ucrt                      10.0.22621.0         h57928b3_0    conda-forge
unicodedata2              14.0.0           py37hcc03f2d_1    conda-forge
urllib3                   1.26.7                   pypi_0    pypi
vc                        14.3                h8a93ad2_20    conda-forge
vc14_runtime              14.40.33810         hcc2c482_20    conda-forge
vs2015_runtime            14.40.33810         h3bf8584_20    conda-forge
werkzeug                  2.0.2                    pypi_0    pypi
wheel                     0.42.0             pyhd8ed1ab_0    conda-forge
wrapt                     1.13.3                   pypi_0    pypi
xorg-libxau               1.0.11               hcd874cb_0    conda-forge
xorg-libxdmcp             1.1.3                hcd874cb_0    conda-forge
xz                        5.2.6                h8d14728_0    conda-forge
yaml                      0.2.5                h8ffe710_2    conda-forge
zeromq                    4.3.4                h0e60522_1    conda-forge
zipp                      3.6.0                    pypi_0    pypi
zlib                      1.2.13               h2466b09_6    conda-forge
zstd                      1.5.6                h0ea2cb4_0    conda-forge

Logs

Finished training centroid.
Resetting monitor window.
Polling: C:/Users/Jose_Teixeira\models\250121_115017.centered_instance.n=213\viz\validation.*.png
Start training centered_instance...
['sleap-train', 'C:\\Users\\JOSE_T~1\\AppData\\Local\\Temp\\tmpiupefd8y\\250121_115017_training_job.json', 'C:/Users/Jose_Teixeira/nat_env_top_goodlens_refined.slp', '--zmq', '--save_viz']
INFO:sleap.nn.training:Versions:
SLEAP: 1.3.3
TensorFlow: 2.7.0
Numpy: 1.21.6
Python: 3.7.12
OS: Windows-10-10.0.22621-SP0
INFO:sleap.nn.training:Training labels file: C:/Users/Jose_Teixeira/nat_env_top_goodlens_refined.slp
INFO:sleap.nn.training:Training profile: C:\Users\JOSE_T~1\AppData\Local\Temp\tmpiupefd8y\250121_115017_training_job.json
INFO:sleap.nn.training:
INFO:sleap.nn.training:Arguments:
INFO:sleap.nn.training:{
    "training_job_path": "C:\\Users\\JOSE_T~1\\AppData\\Local\\Temp\\tmpiupefd8y\\250121_115017_training_job.json",
    "labels_path": "C:/Users/Jose_Teixeira/nat_env_top_goodlens_refined.slp",
    "video_paths": [
        ""
    ],
    "val_labels": null,
    "test_labels": null,
    "base_checkpoint": null,
    "tensorboard": false,
    "save_viz": true,
    "zmq": true,
    "run_name": "",
    "prefix": "",
    "suffix": "",
    "cpu": false,
    "first_gpu": false,
    "last_gpu": false,
    "gpu": "auto"
}
INFO:sleap.nn.training:
INFO:sleap.nn.training:Training job:
INFO:sleap.nn.training:{
    "data": {
        "labels": {
            "training_labels": null,
            "validation_labels": null,
            "validation_fraction": 0.1,
            "test_labels": null,
            "split_by_inds": false,
            "training_inds": null,
            "validation_inds": null,
            "test_inds": null,
            "search_path_hints": [],
            "skeletons": []
        },
        "preprocessing": {
            "ensure_rgb": false,
            "ensure_grayscale": false,
            "imagenet_mode": null,
            "input_scaling": 0.15,
            "pad_to_stride": null,
            "resize_and_pad_to_target": true,
            "target_height": null,
            "target_width": null
        },
        "instance_cropping": {
            "center_on_part": "back_mark",
            "crop_size": null,
            "crop_size_detection_padding": 16
        }
    },
    "model": {
        "backbone": {
            "leap": null,
            "unet": {
                "stem_stride": null,
                "max_stride": 16,
                "output_stride": 4,
                "filters": 24,
                "filters_rate": 2.0,
                "middle_block": true,
                "up_interpolate": true,
                "stacks": 1
            },
            "hourglass": null,
            "resnet": null,
            "pretrained_encoder": null
        },
        "heads": {
            "single_instance": null,
            "centroid": null,
            "centered_instance": {
                "anchor_part": "back_mark",
                "part_names": null,
                "sigma": 2.5,
                "output_stride": 4,
                "loss_weight": 1.0,
                "offset_refinement": false
            },
            "multi_instance": null,
            "multi_class_bottomup": null,
            "multi_class_topdown": null
        },
        "base_checkpoint": null
    },
    "optimization": {
        "preload_data": true,
        "augmentation_config": {
            "rotate": true,
            "rotation_min_angle": -180.0,
            "rotation_max_angle": 180.0,
            "translate": false,
            "translate_min": -5,
            "translate_max": 5,
            "scale": false,
            "scale_min": 0.9,
            "scale_max": 1.1,
            "uniform_noise": false,
            "uniform_noise_min_val": 0.0,
            "uniform_noise_max_val": 10.0,
            "gaussian_noise": false,
            "gaussian_noise_mean": 5.0,
            "gaussian_noise_stddev": 1.0,
            "contrast": false,
            "contrast_min_gamma": 0.5,
            "contrast_max_gamma": 2.0,
            "brightness": false,
            "brightness_min_val": 0.0,
            "brightness_max_val": 10.0,
            "random_crop": false,
            "random_crop_height": 256,
            "random_crop_width": 256,
            "random_flip": true,
            "flip_horizontal": false
        },
        "online_shuffling": true,
        "shuffle_buffer_size": 128,
        "prefetch": true,
        "batch_size": 4,
        "batches_per_epoch": null,
        "min_batches_per_epoch": 200,
        "val_batches_per_epoch": null,
        "min_val_batches_per_epoch": 10,
        "epochs": 200,
        "optimizer": "adam",
        "initial_learning_rate": 0.0001,
        "learning_rate_schedule": {
            "reduce_on_plateau": true,
            "reduction_factor": 0.5,
            "plateau_min_delta": 1e-06,
            "plateau_patience": 5,
            "plateau_cooldown": 3,
            "min_learning_rate": 1e-08
        },
        "hard_keypoint_mining": {
            "online_mining": false,
            "hard_to_easy_ratio": 2.0,
            "min_hard_keypoints": 2,
            "max_hard_keypoints": null,
            "loss_scale": 5.0
        },
        "early_stopping": {
            "stop_training_on_plateau": true,
            "plateau_min_delta": 1e-08,
            "plateau_patience": 10
        }
    },
    "outputs": {
        "save_outputs": true,
        "run_name": "250121_115017.centered_instance.n=213",
        "run_name_prefix": "",
        "run_name_suffix": "",
        "runs_folder": "C:/Users/Jose_Teixeira\\models",
        "tags": [
            ""
        ],
        "save_visualizations": true,
        "delete_viz_images": true,
        "zip_outputs": false,
        "log_to_csv": true,
        "checkpointing": {
            "initial_model": false,
            "best_model": true,
            "every_epoch": false,
            "latest_model": false,
            "final_model": false
        },
        "tensorboard": {
            "write_logs": false,
            "loss_frequency": "epoch",
            "architecture_graph": false,
            "profile_graph": false,
            "visualizations": true
        },
        "zmq": {
            "subscribe_to_controller": true,
            "controller_address": "tcp://127.0.0.1:9000",
            "controller_polling_timeout": 10,
            "publish_updates": true,
            "publish_address": "tcp://127.0.0.1:9001"
        }
    },
    "name": "",
    "description": "",
    "sleap_version": "1.3.3",
    "filename": "C:\\Users\\JOSE_T~1\\AppData\\Local\\Temp\\tmpiupefd8y\\250121_115017_training_job.json"
}
INFO:sleap.nn.training:
INFO:sleap.nn.training:Auto-selected GPU 0 with 3732 MiB of free memory.
INFO:sleap.nn.training:Using GPU 0 for acceleration.
INFO:sleap.nn.training:Disabled GPU memory pre-allocation.
INFO:sleap.nn.training:System:
GPUs: 1/1 available
  Device: /physical_device:GPU:0
         Available: True
        Initalized: False
     Memory growth: True
INFO:sleap.nn.training:
INFO:sleap.nn.training:Initializing trainer...
INFO:sleap.nn.training:Loading training labels from: C:/Users/Jose_Teixeira/nat_env_top_goodlens_refined.slp
INFO:sleap.nn.training:Creating training and validation splits from validation fraction: 0.1
INFO:sleap.nn.training:  Splits: Training = 192 / Validation = 21.
INFO:sleap.nn.training:Setting up for training...
INFO:sleap.nn.training:Setting up pipeline builders...
INFO:sleap.nn.training:Setting up model...
INFO:sleap.nn.training:Building test pipeline...
2025-01-21 11:50:23.121938: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-21 11:50:23.645163: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 2765 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1
INFO:sleap.nn.training:Loaded test example. [2.986s]
INFO:sleap.nn.training:  Input shape: (192, 192, 1)
INFO:sleap.nn.training:Created Keras model.
INFO:sleap.nn.training:  Backbone: UNet(stacks=1, filters=24, filters_rate=2.0, kernel_size=3, stem_kernel_size=7, convs_per_block=2, stem_blocks=0, down_blocks=4, middle_block=True, up_blocks=2, up_interpolate=True, block_contraction=False)
INFO:sleap.nn.training:  Max stride: 16
INFO:sleap.nn.training:  Parameters: 4,310,766
INFO:sleap.nn.training:  Heads:
INFO:sleap.nn.training:    [0] = CenteredInstanceConfmapsHead(part_names=['nose', 'ear_right', 'ear_left', 'back_mark', 'tail_base', 'tail_tip'], anchor_part='back_mark', sigma=2.5, output_stride=4, loss_weight=1.0)
INFO:sleap.nn.training:  Outputs:
INFO:sleap.nn.training:    [0] = KerasTensor(type_spec=TensorSpec(shape=(None, 48, 48, 6), dtype=tf.float32, name=None), name='CenteredInstanceConfmapsHead/BiasAdd:0', description="created by layer 'CenteredInstanceConfmapsHead'")
INFO:sleap.nn.training:Training from scratch
INFO:sleap.nn.training:Setting up data pipelines...
INFO:sleap.nn.training:Training set: n = 192
INFO:sleap.nn.training:Validation set: n = 21
INFO:sleap.nn.training:Setting up optimization...
INFO:sleap.nn.training:  Learning rate schedule: LearningRateScheduleConfig(reduce_on_plateau=True, reduction_factor=0.5, plateau_min_delta=1e-06, plateau_patience=5, plateau_cooldown=3, min_learning_rate=1e-08)
INFO:sleap.nn.training:  Early stopping: EarlyStoppingConfig(stop_training_on_plateau=True, plateau_min_delta=1e-08, plateau_patience=10)
INFO:sleap.nn.training:Setting up outputs...
INFO:sleap.nn.callbacks:Training controller subscribed to: tcp://127.0.0.1:9000 (topic: )
INFO:sleap.nn.training:  ZMQ controller subcribed to: tcp://127.0.0.1:9000
INFO:sleap.nn.callbacks:Progress reporter publishing on: tcp://127.0.0.1:9001 for: not_set
INFO:sleap.nn.training:  ZMQ progress reporter publish on: tcp://127.0.0.1:9001
INFO:sleap.nn.training:Created run path: C:/Users/Jose_Teixeira\models\250121_115017.centered_instance.n=213
INFO:sleap.nn.training:Setting up visualization...
INFO:sleap.nn.training:Finished trainer set up. [5.3s]
INFO:sleap.nn.training:Creating tf.data.Datasets for training data generation...
INFO:sleap.nn.training:Finished creating training datasets. [14.4s]
INFO:sleap.nn.training:Starting training loop...
Epoch 1/200
2025-01-21 11:50:44.859891: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8201
WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0281s vs `on_train_batch_end` time: 0.0339s). Check your callbacks.
Traceback (most recent call last):
  File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\Scripts\sleap-train-script.py", line 33, in <module>
    sys.exit(load_entry_point('sleap==1.3.3', 'console_scripts', 'sleap-train')())
  File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\sleap\nn\training.py", line 2014, in main
    trainer.train()
  File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\sleap\nn\training.py", line 941, in train
    verbose=2,
  File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\sleap\nn\callbacks.py", line 280, in on_epoch_end
    figure = self.plot_fn()
  File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\sleap\nn\training.py", line 1346, in <lambda>
    viz_fn=lambda: visualize_example(next(training_viz_ds_iter)),
  File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\sleap\nn\training.py", line 1326, in visualize_example
    preds = find_peaks(tf.expand_dims(example["instance_image"], axis=0))
  File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\sleap\nn\inference.py", line 2088, in call
    out = self.keras_model(crops)
ValueError: Exception encountered when calling layer "find_instance_peaks" (type FindInstancePeaks).

Input 0 of layer "model" is incompatible with the layer: expected shape=(None, 192, 192, 1), found shape=(1, 28, 28, 1)

Call arguments received:
  • inputs=tf.Tensor(shape=(1, 192, 192, 1), dtype=float32)
INFO:sleap.nn.callbacks:Closing the reporter controller/context.
INFO:sleap.nn.callbacks:Closing the training controller socket/context.
INFO:sleap.gui.widgets.monitor:Sending command to stop training.
Canceling training...
Killed PID: 1180
Deleting canceled run data: C:/Users/Jose_Teixeira\models\250121_115017.centered_instance.n=213
Saving config: C:\Users\Jose_Teixeira/.sleap/1.3.3/preferences.yaml

Troubleshooting attempts / other thoughts

I have tried labelling more data, and also reducing the plateau patience and the batch size for the centered instance training (in this case, it stalls at 195 instead of 199 batches). I have also tried both the baseline and different previously trained models (which show up on the dragdown list).

Because of this, I concluded the issue is probably on my machine, so I tried trivial things like rebooting, freeing up memory on the disk and disk defragging (which probably don’t even have anything to do with it anyway, but always worth a try). I also updated my Nvidia drivers.

Because I am using the same data, the same machine, same SLEAP version, I decided to go back to my initial project, which as I mentioned, was working, in the sense the training would conclude and I would get predictions from the model, and run the training again, and now I encounter the same issue. Hence, I am very confident the problem results somehow from my machine, and not model specifics, but I can’t understand what it could be. In between running the initial project, and this new one where the issue came up, there was the Christmas break, so now I’m thinking maybe there was some sort of Windows update that could have messed it up? I’ve had that experience in the past with other software.

I’ll be happy to hear your suggestions/insights about how I can resolve this issue.

Screenshots

How to reproduce

I just try to Predict > Run Training, and click 'Run' after setting the desired hyperparameters.
I see the graph for the centroid training, showing its progression. I also see the graph for the centered instance, however it only shows light-blue points (as opposed to the previous graph). I don't think I can provide more details about how to reproduce this issue, unfortunately.

The text was updated successfully, but these errors were encountered:

talmo · 2025-01-21T17:58:50Z

Hi @gomesteixeira,

I believe the issue here is that the centered instance model is expecting a (192, 192, 1) crop but it's getting something that is (28, 28, 1).

28 / 192 = 0.145833333 which is roughly 0.15 which is what you have the input scaling set to in your centered instance model.

We fixed this in #2054, but this didn't make it into the newest release (v1.4.1).

While you could install from source to get the latest changes, I think you could just fix your issue by setting the input scaling to 1.0 in your centered instance model. You're already cropping to a decently small size, so I don't think you want to be scaling it down any further.

Let us know if that works!

Cheers,

Talmo

gomesteixeira · 2025-01-22T14:57:34Z

Hi talmo,

Thanks for your quick response!

That makes sense. To be honest, I feel a little silly now, because that means I did change something else about the hyperparameters compared to what I was doing before, when it was working. I'm happy to hear this will be fixed in a future release. Also, I would suggest to maybe explain better in the docs/tutorial how these crops work and interact with one another (although maybe I just misunderstood/misinterpreted it).

In any case, my centered instance model was still not working (the centroid model still trains just fine). Although now it was throwing an error, so I guess that is an improvement! The error dialog box points to the console log, which I'm including below.

Console Log


Finished training centroid.
Resetting monitor window.
Polling: C:/Users/Jose_Teixeira\models\250122_111808.centered_instance.n=213\viz\validation.*.png
Start training centered_instance...
['sleap-train', 'C:\\Users\\JOSE_T~1\\AppData\\Local\\Temp\\tmpfw7rd_lu\\250122_111808_training_job.json', 'C:/Users/Jose_Teixeira/nat_env_top_goodlens_refined.slp', '--zmq', '--save_viz']
INFO:sleap.nn.training:Versions:
SLEAP: 1.3.3
TensorFlow: 2.7.0
Numpy: 1.21.6
Python: 3.7.12
OS: Windows-10-10.0.22621-SP0
INFO:sleap.nn.training:Training labels file: C:/Users/Jose_Teixeira/nat_env_top_goodlens_refined.slp
INFO:sleap.nn.training:Training profile: C:\Users\JOSE_T~1\AppData\Local\Temp\tmpfw7rd_lu\250122_111808_training_job.json
INFO:sleap.nn.training:
INFO:sleap.nn.training:Arguments:
INFO:sleap.nn.training:{
    "training_job_path": "C:\\Users\\JOSE_T~1\\AppData\\Local\\Temp\\tmpfw7rd_lu\\250122_111808_training_job.json",
    "labels_path": "C:/Users/Jose_Teixeira/nat_env_top_goodlens_refined.slp",
    "video_paths": [
        ""
    ],
    "val_labels": null,
    "test_labels": null,
    "base_checkpoint": null,
    "tensorboard": false,
    "save_viz": true,
    "zmq": true,
    "run_name": "",
    "prefix": "",
    "suffix": "",
    "cpu": false,
    "first_gpu": false,
    "last_gpu": false,
    "gpu": "auto"
}
INFO:sleap.nn.training:
INFO:sleap.nn.training:Training job:
INFO:sleap.nn.training:{
    "data": {
        "labels": {
            "training_labels": null,
            "validation_labels": null,
            "validation_fraction": 0.1,
            "test_labels": null,
            "split_by_inds": false,
            "training_inds": null,
            "validation_inds": null,
            "test_inds": null,
            "search_path_hints": [],
            "skeletons": []
        },
        "preprocessing": {
            "ensure_rgb": false,
            "ensure_grayscale": false,
            "imagenet_mode": null,
            "input_scaling": 1.0,
            "pad_to_stride": null,
            "resize_and_pad_to_target": true,
            "target_height": null,
            "target_width": null
        },
        "instance_cropping": {
            "center_on_part": "back_mark",
            "crop_size": null,
            "crop_size_detection_padding": 16
        }
    },
    "model": {
        "backbone": {
            "leap": null,
            "unet": {
                "stem_stride": null,
                "max_stride": 16,
                "output_stride": 4,
                "filters": 24,
                "filters_rate": 2.0,
                "middle_block": true,
                "up_interpolate": true,
                "stacks": 1
            },
            "hourglass": null,
            "resnet": null,
            "pretrained_encoder": null
        },
        "heads": {
            "single_instance": null,
            "centroid": null,
            "centered_instance": {
                "anchor_part": "back_mark",
                "part_names": null,
                "sigma": 2.5,
                "output_stride": 4,
                "loss_weight": 1.0,
                "offset_refinement": false
            },
            "multi_instance": null,
            "multi_class_bottomup": null,
            "multi_class_topdown": null
        },
        "base_checkpoint": null
    },
    "optimization": {
        "preload_data": true,
        "augmentation_config": {
            "rotate": true,
            "rotation_min_angle": -180.0,
            "rotation_max_angle": 180.0,
            "translate": false,
            "translate_min": -5,
            "translate_max": 5,
            "scale": false,
            "scale_min": 0.9,
            "scale_max": 1.1,
            "uniform_noise": false,
            "uniform_noise_min_val": 0.0,
            "uniform_noise_max_val": 10.0,
            "gaussian_noise": false,
            "gaussian_noise_mean": 5.0,
            "gaussian_noise_stddev": 1.0,
            "contrast": false,
            "contrast_min_gamma": 0.5,
            "contrast_max_gamma": 2.0,
            "brightness": false,
            "brightness_min_val": 0.0,
            "brightness_max_val": 10.0,
            "random_crop": false,
            "random_crop_height": 256,
            "random_crop_width": 256,
            "random_flip": true,
            "flip_horizontal": false
        },
        "online_shuffling": true,
        "shuffle_buffer_size": 128,
        "prefetch": true,
        "batch_size": 4,
        "batches_per_epoch": null,
        "min_batches_per_epoch": 200,
        "val_batches_per_epoch": null,
        "min_val_batches_per_epoch": 10,
        "epochs": 200,
        "optimizer": "adam",
        "initial_learning_rate": 0.0001,
        "learning_rate_schedule": {
            "reduce_on_plateau": true,
            "reduction_factor": 0.5,
            "plateau_min_delta": 1e-06,
            "plateau_patience": 5,
            "plateau_cooldown": 3,
            "min_learning_rate": 1e-08
        },
        "hard_keypoint_mining": {
            "online_mining": false,
            "hard_to_easy_ratio": 2.0,
            "min_hard_keypoints": 2,
            "max_hard_keypoints": null,
            "loss_scale": 5.0
        },
        "early_stopping": {
            "stop_training_on_plateau": true,
            "plateau_min_delta": 1e-08,
            "plateau_patience": 10
        }
    },
    "outputs": {
        "save_outputs": true,
        "run_name": "250122_111808.centered_instance.n=213",
        "run_name_prefix": "",
        "run_name_suffix": "",
        "runs_folder": "C:/Users/Jose_Teixeira\\models",
        "tags": [
            ""
        ],
        "save_visualizations": true,
        "delete_viz_images": true,
        "zip_outputs": false,
        "log_to_csv": true,
        "checkpointing": {
            "initial_model": false,
            "best_model": true,
            "every_epoch": false,
            "latest_model": false,
            "final_model": false
        },
        "tensorboard": {
            "write_logs": false,
            "loss_frequency": "epoch",
            "architecture_graph": false,
            "profile_graph": false,
            "visualizations": true
        },
        "zmq": {
            "subscribe_to_controller": true,
            "controller_address": "tcp://127.0.0.1:9000",
            "controller_polling_timeout": 10,
            "publish_updates": true,
            "publish_address": "tcp://127.0.0.1:9001"
        }
    },
    "name": "",
    "description": "",
    "sleap_version": "1.3.3",
    "filename": "C:\\Users\\JOSE_T~1\\AppData\\Local\\Temp\\tmpfw7rd_lu\\250122_111808_training_job.json"
}
INFO:sleap.nn.training:
INFO:sleap.nn.training:Auto-selected GPU 0 with 3722 MiB of free memory.
INFO:sleap.nn.training:Using GPU 0 for acceleration.
INFO:sleap.nn.training:Disabled GPU memory pre-allocation.
INFO:sleap.nn.training:System:
GPUs: 1/1 available
  Device: /physical_device:GPU:0
         Available: True
        Initalized: False
     Memory growth: True
INFO:sleap.nn.training:
INFO:sleap.nn.training:Initializing trainer...
INFO:sleap.nn.training:Loading training labels from: C:/Users/Jose_Teixeira/nat_env_top_goodlens_refined.slp
INFO:sleap.nn.training:Creating training and validation splits from validation fraction: 0.1
INFO:sleap.nn.training:  Splits: Training = 192 / Validation = 21.
INFO:sleap.nn.training:Setting up for training...
INFO:sleap.nn.training:Setting up pipeline builders...
INFO:sleap.nn.training:Setting up model...
INFO:sleap.nn.training:Building test pipeline...
2025-01-22 11:18:14.665504: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-22 11:18:15.182064: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 2765 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1
INFO:sleap.nn.training:Loaded test example. [2.908s]
INFO:sleap.nn.training:  Input shape: (1120, 1120, 1)
INFO:sleap.nn.training:Created Keras model.
INFO:sleap.nn.training:  Backbone: UNet(stacks=1, filters=24, filters_rate=2.0, kernel_size=3, stem_kernel_size=7, convs_per_block=2, stem_blocks=0, down_blocks=4, middle_block=True, up_blocks=2, up_interpolate=True, block_contraction=False)
INFO:sleap.nn.training:  Max stride: 16
INFO:sleap.nn.training:  Parameters: 4,310,766
INFO:sleap.nn.training:  Heads:
INFO:sleap.nn.training:    [0] = CenteredInstanceConfmapsHead(part_names=['nose', 'ear_right', 'ear_left', 'back_mark', 'tail_base', 'tail_tip'], anchor_part='back_mark', sigma=2.5, output_stride=4, loss_weight=1.0)
INFO:sleap.nn.training:  Outputs:
INFO:sleap.nn.training:    [0] = KerasTensor(type_spec=TensorSpec(shape=(None, 280, 280, 6), dtype=tf.float32, name=None), name='CenteredInstanceConfmapsHead/BiasAdd:0', description="created by layer 'CenteredInstanceConfmapsHead'")
INFO:sleap.nn.training:Training from scratch
INFO:sleap.nn.training:Setting up data pipelines...
INFO:sleap.nn.training:Training set: n = 192
INFO:sleap.nn.training:Validation set: n = 21
INFO:sleap.nn.training:Setting up optimization...
INFO:sleap.nn.training:  Learning rate schedule: LearningRateScheduleConfig(reduce_on_plateau=True, reduction_factor=0.5, plateau_min_delta=1e-06, plateau_patience=5, plateau_cooldown=3, min_learning_rate=1e-08)
INFO:sleap.nn.training:  Early stopping: EarlyStoppingConfig(stop_training_on_plateau=True, plateau_min_delta=1e-08, plateau_patience=10)
INFO:sleap.nn.training:Setting up outputs...
INFO:sleap.nn.callbacks:Training controller subscribed to: tcp://127.0.0.1:9000 (topic: )
INFO:sleap.nn.training:  ZMQ controller subcribed to: tcp://127.0.0.1:9000
INFO:sleap.nn.callbacks:Progress reporter publishing on: tcp://127.0.0.1:9001 for: not_set
INFO:sleap.nn.training:  ZMQ progress reporter publish on: tcp://127.0.0.1:9001
INFO:sleap.nn.training:Created run path: C:/Users/Jose_Teixeira\models\250122_111808.centered_instance.n=213
INFO:sleap.nn.training:Setting up visualization...
INFO:sleap.nn.training:Finished trainer set up. [5.3s]
INFO:sleap.nn.training:Creating tf.data.Datasets for training data generation...
INFO:sleap.nn.training:Finished creating training datasets. [14.2s]
INFO:sleap.nn.training:Starting training loop...
Epoch 1/200
2025-01-22 11:18:36.265866: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8201
2025-01-22 11:18:38.927063: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.02GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2025-01-22 11:18:39.056573: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.04GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2025-01-22 11:18:39.228914: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.03GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2025-01-22 11:18:39.922440: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.53GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2025-01-22 11:18:40.030111: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.02GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2025-01-22 11:18:40.128136: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.04GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2025-01-22 11:18:40.639847: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 793.73MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2025-01-22 11:18:40.740124: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 532.80MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2025-01-22 11:18:40.831811: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.03GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2025-01-22 11:18:42.263632: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 653.50MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2025-01-22 11:18:52.620254: W tensorflow/core/common_runtime/bfc_allocator.cc:462] Allocator (GPU_0_bfc) ran out of memory trying to allocate 229.69MiB (rounded to 240844800)requested by op model/stack0_dec1_s8_to_s4_interp_bilinear/resize/ResizeBilinear
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2025-01-22 11:18:52.620823: I tensorflow/core/common_runtime/bfc_allocator.cc:1010] BFCAllocator dump for GPU_0_bfc
2025-01-22 11:18:52.621151: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (256):  Total Chunks: 86, Chunks in use: 86. 21.5KiB allocated for chunks. 21.5KiB in use in bin. 2.6KiB client-requested in use in bin.
2025-01-22 11:18:52.621516: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (512):  Total Chunks: 34, Chunks in use: 32. 21.0KiB allocated for chunks. 20.0KiB in use in bin. 18.0KiB client-requested in use in bin.
2025-01-22 11:18:52.621863: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (1024):         Total Chunks: 18, Chunks in use: 17. 23.5KiB allocated for chunks. 22.2KiB in use in bin. 20.8KiB client-requested in use in bin.
2025-01-22 11:18:52.622198: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (2048):         Total Chunks: 4, Chunks in use: 4. 9.0KiB allocated for chunks. 9.0KiB in use in bin. 9.0KiB client-requested in use in bin.
2025-01-22 11:18:52.622573: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (4096):         Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-01-22 11:18:52.622872: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (8192):         Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-01-22 11:18:52.623158: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (16384):        Total Chunks: 4, Chunks in use: 4. 81.0KiB allocated for chunks. 81.0KiB in use in bin. 81.0KiB client-requested in use in bin.
2025-01-22 11:18:52.623468: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (32768):        Total Chunks: 4, Chunks in use: 4. 162.0KiB allocated for chunks. 162.0KiB in use in bin. 162.0KiB client-requested in use in bin.
2025-01-22 11:18:52.623871: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (65536):        Total Chunks: 4, Chunks in use: 4. 364.5KiB allocated for chunks. 364.5KiB in use in bin. 324.0KiB client-requested in use in bin.
2025-01-22 11:18:52.624288: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (131072):       Total Chunks: 4, Chunks in use: 4. 648.0KiB allocated for chunks. 648.0KiB in use in bin. 648.0KiB client-requested in use in bin.
2025-01-22 11:18:52.624715: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (262144):       Total Chunks: 8, Chunks in use: 8. 2.53MiB allocated for chunks. 2.53MiB in use in bin. 2.53MiB client-requested in use in bin.
2025-01-22 11:18:52.625126: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (524288):       Total Chunks: 7, Chunks in use: 7. 5.38MiB allocated for chunks. 5.38MiB in use in bin. 5.38MiB client-requested in use in bin.
2025-01-22 11:18:52.625534: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (1048576):      Total Chunks: 16, Chunks in use: 15. 20.10MiB allocated for chunks. 18.91MiB in use in bin. 18.25MiB client-requested in use in bin.
2025-01-22 11:18:52.625920: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (2097152):      Total Chunks: 7, Chunks in use: 7. 21.92MiB allocated for chunks. 21.92MiB in use in bin. 21.52MiB client-requested in use in bin.
2025-01-22 11:18:52.626330: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (4194304):      Total Chunks: 12, Chunks in use: 11. 66.72MiB allocated for chunks. 59.54MiB in use in bin. 57.53MiB client-requested in use in bin.
2025-01-22 11:18:52.626694: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (8388608):      Total Chunks: 1, Chunks in use: 1. 14.36MiB allocated for chunks. 14.36MiB in use in bin. 14.36MiB client-requested in use in bin.
2025-01-22 11:18:52.627108: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (16777216):     Total Chunks: 5, Chunks in use: 5. 139.24MiB allocated for chunks. 139.24MiB in use in bin. 133.98MiB client-requested in use in bin.
2025-01-22 11:18:52.627524: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (33554432):     Total Chunks: 6, Chunks in use: 5. 344.53MiB allocated for chunks. 287.11MiB in use in bin. 287.11MiB client-requested in use in bin.
2025-01-22 11:18:52.627942: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (67108864):     Total Chunks: 4, Chunks in use: 4. 421.07MiB allocated for chunks. 421.07MiB in use in bin. 401.95MiB client-requested in use in bin.
2025-01-22 11:18:52.628352: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (134217728):    Total Chunks: 2, Chunks in use: 2. 459.38MiB allocated for chunks. 459.38MiB in use in bin. 459.38MiB client-requested in use in bin.
2025-01-22 11:18:52.628769: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (268435456):    Total Chunks: 3, Chunks in use: 3. 1.24GiB allocated for chunks. 1.24GiB in use in bin. 1.07GiB client-requested in use in bin.
2025-01-22 11:18:52.629171: I tensorflow/core/common_runtime/bfc_allocator.cc:1033] Bin for 229.69MiB was 128.00MiB, Chunk State:
2025-01-22 11:18:52.629554: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Next region of size 8388608
2025-01-22 11:18:52.632104: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b02800000 of size 1327104 next 58
2025-01-22 11:18:52.632541: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b02944000 of size 1327104 next 47
2025-01-22 11:18:52.632917: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b02a88000 of size 2654208 next 46
2025-01-22 11:18:52.633296: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b02d10000 of size 3080192 next 18446744073709551615
2025-01-22 11:18:52.633679: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Next region of size 8388608
2025-01-22 11:18:52.634025: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03000000 of size 1280 next 2
2025-01-22 11:18:52.634401: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03000500 of size 256 next 3
2025-01-22 11:18:52.634776: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03000600 of size 256 next 4
2025-01-22 11:18:52.635109: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03000700 of size 256 next 5
2025-01-22 11:18:52.635493: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03000800 of size 256 next 6
2025-01-22 11:18:52.635857: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03000900 of size 256 next 7
2025-01-22 11:18:52.636195: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03000a00 of size 256 next 10
2025-01-22 11:18:52.636577: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03000b00 of size 256 next 11
2025-01-22 11:18:52.636954: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03000c00 of size 256 next 12
2025-01-22 11:18:52.637338: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03000d00 of size 256 next 15
2025-01-22 11:18:52.637722: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03000e00 of size 256 next 16
2025-01-22 11:18:52.638099: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03000f00 of size 256 next 19
2025-01-22 11:18:52.638477: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03001000 of size 256 next 8
2025-01-22 11:18:52.638855: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03001100 of size 1024 next 9
2025-01-22 11:18:52.639239: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03001500 of size 256 next 20
2025-01-22 11:18:52.639622: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03001600 of size 256 next 21
2025-01-22 11:18:52.640002: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03001700 of size 256 next 24
2025-01-22 11:18:52.640383: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03001800 of size 256 next 25
2025-01-22 11:18:52.640767: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03001900 of size 512 next 28
2025-01-22 11:18:52.641157: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03001b00 of size 256 next 29
2025-01-22 11:18:52.641538: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03001c00 of size 256 next 30
2025-01-22 11:18:52.641931: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03001d00 of size 512 next 31
2025-01-22 11:18:52.642312: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03001f00 of size 256 next 34
2025-01-22 11:18:52.642702: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03002000 of size 256 next 35
2025-01-22 11:18:52.642982: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03002100 of size 768 next 38
2025-01-22 11:18:52.643159: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03002400 of size 256 next 39
2025-01-22 11:18:52.643302: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03002500 of size 256 next 40
2025-01-22 11:18:52.643480: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03002600 of size 768 next 41
2025-01-22 11:18:52.643646: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03002900 of size 256 next 44
2025-01-22 11:18:52.643822: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03002a00 of size 256 next 45
2025-01-22 11:18:52.644001: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03002b00 of size 1536 next 48
2025-01-22 11:18:52.644179: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03003100 of size 256 next 49
2025-01-22 11:18:52.644359: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03003200 of size 256 next 50
2025-01-22 11:18:52.644538: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03003300 of size 1536 next 52
2025-01-22 11:18:52.644707: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03003900 of size 768 next 53
2025-01-22 11:18:52.644868: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03003c00 of size 768 next 57
2025-01-22 11:18:52.645029: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03003f00 of size 512 next 60
2025-01-22 11:18:52.645189: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03004100 of size 512 next 61
2025-01-22 11:18:52.645348: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03004300 of size 256 next 63
2025-01-22 11:18:52.645508: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03004400 of size 256 next 64
2025-01-22 11:18:52.645666: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03004500 of size 256 next 65
2025-01-22 11:18:52.645824: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03004600 of size 256 next 66
2025-01-22 11:18:52.645984: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03004700 of size 256 next 69
2025-01-22 11:18:52.646143: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03004800 of size 256 next 70
2025-01-22 11:18:52.646301: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03004900 of size 256 next 71
2025-01-22 11:18:52.646461: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03004a00 of size 256 next 72
2025-01-22 11:18:52.646574: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03004b00 of size 256 next 73
2025-01-22 11:18:52.646627: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03004c00 of size 256 next 74
2025-01-22 11:18:52.646678: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03004d00 of size 256 next 75
2025-01-22 11:18:52.646729: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03004e00 of size 256 next 76
2025-01-22 11:18:52.646780: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03004f00 of size 256 next 77
2025-01-22 11:18:52.646832: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03005000 of size 256 next 80
2025-01-22 11:18:52.646884: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03005100 of size 256 next 83
2025-01-22 11:18:52.646935: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03005200 of size 256 next 84
2025-01-22 11:18:52.646987: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03005300 of size 256 next 88
2025-01-22 11:18:52.647038: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03005400 of size 256 next 89
2025-01-22 11:18:52.647088: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03005500 of size 256 next 90
2025-01-22 11:18:52.647139: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03005600 of size 256 next 91
2025-01-22 11:18:52.647191: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03005700 of size 256 next 67
2025-01-22 11:18:52.647242: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03005800 of size 2304 next 68
2025-01-22 11:18:52.647293: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03006100 of size 1280 next 81
2025-01-22 11:18:52.647345: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03006600 of size 1280 next 82
2025-01-22 11:18:52.647396: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03006b00 of size 1280 next 86
2025-01-22 11:18:52.647447: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03007000 of size 1280 next 87
2025-01-22 11:18:52.647501: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03007500 of size 256 next 92
2025-01-22 11:18:52.647551: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03007600 of size 1024 next 93
2025-01-22 11:18:52.647601: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03007a00 of size 256 next 94
2025-01-22 11:18:52.647652: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03007b00 of size 256 next 96
2025-01-22 11:18:52.647701: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03007c00 of size 256 next 98
2025-01-22 11:18:52.647751: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03007d00 of size 256 next 99
2025-01-22 11:18:52.647804: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03007e00 of size 512 next 100
2025-01-22 11:18:52.647856: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03008000 of size 512 next 101
2025-01-22 11:18:52.647944: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03008200 of size 768 next 102
2025-01-22 11:18:52.648031: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03008500 of size 768 next 103
2025-01-22 11:18:52.648116: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03008800 of size 1536 next 104
2025-01-22 11:18:52.648203: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03008e00 of size 1536 next 106
2025-01-22 11:18:52.648287: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03009400 of size 768 next 108
2025-01-22 11:18:52.648374: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03009700 of size 768 next 109
2025-01-22 11:18:52.648462: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03009a00 of size 512 next 110
2025-01-22 11:18:52.648546: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03009c00 of size 512 next 112
2025-01-22 11:18:52.648630: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03009e00 of size 2304 next 113
2025-01-22 11:18:52.648716: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0300a700 of size 256 next 114
2025-01-22 11:18:52.648795: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0300a800 of size 1024 next 115
2025-01-22 11:18:52.648849: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0300ac00 of size 256 next 116
2025-01-22 11:18:52.648902: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0300ad00 of size 256 next 117
2025-01-22 11:18:52.648954: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0300ae00 of size 256 next 119
2025-01-22 11:18:52.649005: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0300af00 of size 256 next 121
2025-01-22 11:18:52.649056: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0300b000 of size 512 next 123
2025-01-22 11:18:52.649107: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0300b200 of size 512 next 125
2025-01-22 11:18:52.649158: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0300b400 of size 768 next 14
2025-01-22 11:18:52.649209: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0300b700 of size 20736 next 13
2025-01-22 11:18:52.649261: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03010800 of size 20736 next 95
2025-01-22 11:18:52.649311: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03015900 of size 20736 next 18
2025-01-22 11:18:52.649362: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0301aa00 of size 41472 next 17
2025-01-22 11:18:52.649413: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03024c00 of size 41472 next 97
2025-01-22 11:18:52.649464: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0302ee00 of size 124416 next 23
2025-01-22 11:18:52.649515: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0304d400 of size 82944 next 22
2025-01-22 11:18:52.649566: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03061800 of size 165888 next 27
2025-01-22 11:18:52.649617: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0308a000 of size 165888 next 26
2025-01-22 11:18:52.649668: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b030b2800 of size 663552 next 33
2025-01-22 11:18:52.649719: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03154800 of size 331776 next 32
2025-01-22 11:18:52.649770: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b031a5800 of size 331776 next 62
2025-01-22 11:18:52.649821: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b031f6800 of size 331776 next 37
2025-01-22 11:18:52.649871: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03247800 of size 663552 next 36
2025-01-22 11:18:52.649924: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b032e9800 of size 995328 next 59
2025-01-22 11:18:52.649975: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b033dc800 of size 1658880 next 43
2025-01-22 11:18:52.650026: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03571800 of size 1327104 next 42
2025-01-22 11:18:52.650078: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b036b5800 of size 1353728 next 18446744073709551615
2025-01-22 11:18:52.650131: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Next region of size 16777216
2025-01-22 11:18:52.650176: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03800000 of size 5013504 next 78
2025-01-22 11:18:52.650227: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b03cc8000 of size 5603328 next 54
2025-01-22 11:18:52.650278: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b04220000 of size 6160384 next 18446744073709551615
2025-01-22 11:18:52.650331: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Next region of size 33554432
2025-01-22 11:18:52.650376: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b04800000 of size 3981312 next 56
2025-01-22 11:18:52.650426: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b04bcc000 of size 5013504 next 79
2025-01-22 11:18:52.650478: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b05094000 of size 5013504 next 85
2025-01-22 11:18:52.650528: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0555c000 of size 5308416 next 105
2025-01-22 11:18:52.650580: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b05a6c000 of size 3981312 next 107
2025-01-22 11:18:52.650631: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b05e38000 of size 331776 next 111
2025-01-22 11:18:52.650682: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b05e89000 of size 41472 next 118
2025-01-22 11:18:52.650733: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b05e93200 of size 82944 next 120
2025-01-22 11:18:52.650784: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b05ea7600 of size 165888 next 122
2025-01-22 11:18:52.650835: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b05ecfe00 of size 331776 next 124
2025-01-22 11:18:52.650886: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b05f20e00 of size 663552 next 126
2025-01-22 11:18:52.650937: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b05fc2e00 of size 1327104 next 127
2025-01-22 11:18:52.650989: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06106e00 of size 768 next 128
2025-01-22 11:18:52.651043: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06107100 of size 2654208 next 129
2025-01-22 11:18:52.651095: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0638f100 of size 1536 next 130
2025-01-22 11:18:52.651146: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0638f700 of size 1536 next 133
2025-01-22 11:18:52.651197: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0638fd00 of size 4653824 next 18446744073709551615
2025-01-22 11:18:52.651250: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Next region of size 67108864
2025-01-22 11:18:52.651294: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06800000 of size 5308416 next 132
2025-01-22 11:18:52.651363: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06d10000 of size 768 next 134
2025-01-22 11:18:52.651426: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06d10300 of size 1327104 next 135
2025-01-22 11:18:52.651518: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06e54300 of size 768 next 136
2025-01-22 11:18:52.651610: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06e54600 of size 995328 next 137
2025-01-22 11:18:52.651700: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06f47600 of size 512 next 138
2025-01-22 11:18:52.651792: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06f47800 of size 331776 next 139
2025-01-22 11:18:52.651881: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06f98800 of size 512 next 140
2025-01-22 11:18:52.651971: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06f98a00 of size 2304 next 141
2025-01-22 11:18:52.652051: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06f99300 of size 256 next 142
2025-01-22 11:18:52.652103: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06f99400 of size 1024 next 143
2025-01-22 11:18:52.652156: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06f99800 of size 256 next 144
2025-01-22 11:18:52.652207: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06f99900 of size 20736 next 145
2025-01-22 11:18:52.652259: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06f9ea00 of size 256 next 146
2025-01-22 11:18:52.652312: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06f9eb00 of size 41472 next 147
2025-01-22 11:18:52.652364: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06fa8d00 of size 256 next 148
2025-01-22 11:18:52.652415: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06fa8e00 of size 82944 next 149
2025-01-22 11:18:52.652466: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06fbd200 of size 256 next 150
2025-01-22 11:18:52.652517: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06fbd300 of size 165888 next 151
2025-01-22 11:18:52.652569: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06fe5b00 of size 512 next 152
2025-01-22 11:18:52.652621: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06fe5d00 of size 331776 next 153
2025-01-22 11:18:52.652672: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b07036d00 of size 512 next 154
2025-01-22 11:18:52.652724: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b07036f00 of size 663552 next 155
2025-01-22 11:18:52.652775: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b070d8f00 of size 768 next 156
2025-01-22 11:18:52.652827: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b070d9200 of size 1327104 next 157
2025-01-22 11:18:52.652878: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0721d200 of size 768 next 158
2025-01-22 11:18:52.652929: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0721d500 of size 2654208 next 159
2025-01-22 11:18:52.652981: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b074a5500 of size 1536 next 160
2025-01-22 11:18:52.653033: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b074a5b00 of size 5308416 next 161
2025-01-22 11:18:52.653084: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b079b5b00 of size 1536 next 162
2025-01-22 11:18:52.653136: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b079b6100 of size 3981312 next 163
2025-01-22 11:18:52.653187: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b07d82100 of size 768 next 164
2025-01-22 11:18:52.653238: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b07d82400 of size 1327104 next 165
2025-01-22 11:18:52.653289: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b07ec6400 of size 768 next 166
2025-01-22 11:18:52.653340: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b07ec6700 of size 995328 next 167
2025-01-22 11:18:52.653393: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b07fb9700 of size 512 next 168
2025-01-22 11:18:52.653444: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b07fb9900 of size 331776 next 169
2025-01-22 11:18:52.653495: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0800a900 of size 512 next 170
2025-01-22 11:18:52.653547: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0800ab00 of size 2304 next 171
2025-01-22 11:18:52.653598: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0800b400 of size 256 next 172
2025-01-22 11:18:52.653649: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0800b500 of size 256 next 173
2025-01-22 11:18:52.653701: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0800b600 of size 256 next 174
2025-01-22 11:18:52.653753: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0800b700 of size 256 next 175
2025-01-22 11:18:52.653804: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0800b800 of size 256 next 176
2025-01-22 11:18:52.653855: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0800b900 of size 256 next 177
2025-01-22 11:18:52.653906: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0800ba00 of size 256 next 178
2025-01-22 11:18:52.653957: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0800bb00 of size 256 next 179
2025-01-22 11:18:52.654010: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0800bc00 of size 256 next 180
2025-01-22 11:18:52.654061: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0800bd00 of size 7526400 next 181
2025-01-22 11:18:52.654112: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b08739500 of size 256 next 182
2025-01-22 11:18:52.654163: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Free  at b08739600 of size 512 next 193
2025-01-22 11:18:52.654215: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b08739800 of size 256 next 194
2025-01-22 11:18:52.654266: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b08739900 of size 256 next 195
2025-01-22 11:18:52.654317: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Free  at b08739a00 of size 512 next 197
2025-01-22 11:18:52.654368: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b08739c00 of size 256 next 198
2025-01-22 11:18:52.654421: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b08739d00 of size 256 next 199
2025-01-22 11:18:52.654473: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Free  at b08739e00 of size 1280 next 202
2025-01-22 11:18:52.654524: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0873a300 of size 256 next 203
2025-01-22 11:18:52.654576: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0873a400 of size 256 next 204
2025-01-22 11:18:52.654627: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0873a500 of size 256 next 207
2025-01-22 11:18:52.654679: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Free  at b0873a600 of size 1250304 next 183
2025-01-22 11:18:52.654731: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0886ba00 of size 256 next 184
2025-01-22 11:18:52.654782: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0886bb00 of size 256 next 185
2025-01-22 11:18:52.654833: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0886bc00 of size 256 next 186
2025-01-22 11:18:52.654896: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0886bd00 of size 256 next 187
2025-01-22 11:18:52.654945: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0886be00 of size 256 next 188
2025-01-22 11:18:52.654995: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0886bf00 of size 256 next 189
2025-01-22 11:18:52.655046: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0886c000 of size 256 next 190
2025-01-22 11:18:52.655096: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Free  at b0886c100 of size 7526400 next 192
2025-01-22 11:18:52.655146: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b08f99900 of size 25585408 next 18446744073709551615
2025-01-22 11:18:52.655199: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Next region of size 536870912
2025-01-22 11:18:52.655244: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b0a800000 of size 536870912 next 18446744073709551615
2025-01-22 11:18:52.655296: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Next region of size 536870912
2025-01-22 11:18:52.655341: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b2a800000 of size 1254400 next 206
2025-01-22 11:18:52.655391: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b2a932400 of size 1254400 next 208
2025-01-22 11:18:52.655443: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b2aa64800 of size 7526400 next 209
2025-01-22 11:18:52.655493: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b2b192000 of size 1254400 next 210
2025-01-22 11:18:52.655543: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b2b2c4400 of size 1254400 next 211
2025-01-22 11:18:52.655593: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b2b3f6800 of size 1254400 next 212
2025-01-22 11:18:52.655643: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b2b528c00 of size 1254400 next 213
2025-01-22 11:18:52.655693: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b2b65b000 of size 521818112 next 18446744073709551615
2025-01-22 11:18:52.655745: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Next region of size 1073741824
2025-01-22 11:18:52.655790: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b4b200000 of size 120422400 next 200
2025-01-22 11:18:52.655841: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b524d8000 of size 240844800 next 196
2025-01-22 11:18:52.655891: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b60a88000 of size 240844800 next 215
2025-01-22 11:18:52.655942: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b6f038000 of size 60211200 next 216
2025-01-22 11:18:52.655992: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b729a4000 of size 120422400 next 214
2025-01-22 11:18:52.656042: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b79c7c000 of size 120422400 next 219
2025-01-22 11:18:52.656093: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b80f54000 of size 30105600 next 218
2025-01-22 11:18:52.656143: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b82c0a000 of size 60211200 next 217
2025-01-22 11:18:52.656194: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b86576000 of size 80257024 next 18446744073709551615
2025-01-22 11:18:52.656246: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Next region of size 618030592
2025-01-22 11:18:52.656290: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b8bc00000 of size 15052800 next 224
2025-01-22 11:18:52.656341: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b8ca5b000 of size 30105600 next 221
2025-01-22 11:18:52.656391: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b8e711000 of size 30105600 next 222
2025-01-22 11:18:52.656442: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b903c7000 of size 30105600 next 223
2025-01-22 11:18:52.656494: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b9207d000 of size 60211200 next 226
2025-01-22 11:18:52.656544: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b959e9000 of size 60211200 next 228
2025-01-22 11:18:52.656594: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b99355000 of size 60211200 next 229
2025-01-22 11:18:52.656644: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Free  at b9ccc1000 of size 60211200 next 227
2025-01-22 11:18:52.656694: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at ba062d000 of size 271816192 next 18446744073709551615
2025-01-22 11:18:52.656745: I tensorflow/core/common_runtime/bfc_allocator.cc:1071]      Summary of in-use Chunks by size:
2025-01-22 11:18:52.656803: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 86 Chunks of size 256 totalling 21.5KiB
2025-01-22 11:18:52.656857: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 16 Chunks of size 512 totalling 8.0KiB
2025-01-22 11:18:52.656910: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 16 Chunks of size 768 totalling 12.0KiB
2025-01-22 11:18:52.656964: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 4 Chunks of size 1024 totalling 4.0KiB
2025-01-22 11:18:52.657018: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 5 Chunks of size 1280 totalling 6.2KiB
2025-01-22 11:18:52.657071: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 8 Chunks of size 1536 totalling 12.0KiB
2025-01-22 11:18:52.657124: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 4 Chunks of size 2304 totalling 9.0KiB
2025-01-22 11:18:52.657177: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 4 Chunks of size 20736 totalling 81.0KiB
2025-01-22 11:18:52.657230: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 4 Chunks of size 41472 totalling 162.0KiB
2025-01-22 11:18:52.657284: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 3 Chunks of size 82944 totalling 243.0KiB
2025-01-22 11:18:52.657337: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 124416 totalling 121.5KiB
2025-01-22 11:18:52.657390: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 4 Chunks of size 165888 totalling 648.0KiB
2025-01-22 11:18:52.657444: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 8 Chunks of size 331776 totalling 2.53MiB
2025-01-22 11:18:52.657497: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 4 Chunks of size 663552 totalling 2.53MiB
2025-01-22 11:18:52.657551: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 3 Chunks of size 995328 totalling 2.85MiB
2025-01-22 11:18:52.657604: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 6 Chunks of size 1254400 totalling 7.18MiB
2025-01-22 11:18:52.657659: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 7 Chunks of size 1327104 totalling 8.86MiB
2025-01-22 11:18:52.657712: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 1353728 totalling 1.29MiB
2025-01-22 11:18:52.657765: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 1658880 totalling 1.58MiB
2025-01-22 11:18:52.657818: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 3 Chunks of size 2654208 totalling 7.59MiB
2025-01-22 11:18:52.657871: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 3080192 totalling 2.94MiB
2025-01-22 11:18:52.657926: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 3 Chunks of size 3981312 totalling 11.39MiB
2025-01-22 11:18:52.657979: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 4653824 totalling 4.44MiB
2025-01-22 11:18:52.658032: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 3 Chunks of size 5013504 totalling 14.34MiB
2025-01-22 11:18:52.658085: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 3 Chunks of size 5308416 totalling 15.19MiB
2025-01-22 11:18:52.658138: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 5603328 totalling 5.34MiB
2025-01-22 11:18:52.658191: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 6160384 totalling 5.88MiB
2025-01-22 11:18:52.658244: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 2 Chunks of size 7526400 totalling 14.36MiB
2025-01-22 11:18:52.658297: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 15052800 totalling 14.36MiB
2025-01-22 11:18:52.658351: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 25585408 totalling 24.40MiB
2025-01-22 11:18:52.658404: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 4 Chunks of size 30105600 totalling 114.84MiB
2025-01-22 11:18:52.658457: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 5 Chunks of size 60211200 totalling 287.11MiB
2025-01-22 11:18:52.658511: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 80257024 totalling 76.54MiB
2025-01-22 11:18:52.658564: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 3 Chunks of size 120422400 totalling 344.53MiB
2025-01-22 11:18:52.658617: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 2 Chunks of size 240844800 totalling 459.38MiB
2025-01-22 11:18:52.658671: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 271816192 totalling 259.22MiB
2025-01-22 11:18:52.658724: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 521818112 totalling 497.64MiB
2025-01-22 11:18:52.658778: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 536870912 totalling 512.00MiB
2025-01-22 11:18:52.658830: I tensorflow/core/common_runtime/bfc_allocator.cc:1078] Sum Total of in-use chunks: 2.64GiB
2025-01-22 11:18:52.658879: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] total_region_allocated_bytes_: 2899731968 memory_limit_: 2899732071 available bytes: 103 curr_region_allocation_bytes_: 4294967296
2025-01-22 11:18:52.658935: I tensorflow/core/common_runtime/bfc_allocator.cc:1086] Stats:
Limit:                      2899732071
InUse:                      2830741760
MaxInUse:                   2890953216
NumAllocs:                        1410
MaxAllocSize:                618030592
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2025-01-22 11:18:52.659036: W tensorflow/core/common_runtime/bfc_allocator.cc:474] **********************x******************************************************************_*******xxx
2025-01-22 11:18:52.659110: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at image_resizer_state.h:154 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[4,280,280,192] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\Scripts\sleap-train-script.py", line 33, in <module>
    sys.exit(load_entry_point('sleap==1.3.3', 'console_scripts', 'sleap-train')())
  File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\sleap\nn\training.py", line 2014, in main
    trainer.train()
  File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\sleap\nn\training.py", line 941, in train
    verbose=2,
  File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[4,280,280,192] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node model/stack0_dec1_s8_to_s4_interp_bilinear/resize/ResizeBilinear
 (defined at C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\backend.py:3334)
]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_14935]

Errors may have originated from an input operation.
Input Source operations connected to node model/stack0_dec1_s8_to_s4_interp_bilinear/resize/ResizeBilinear:
In[0] model/stack0_dec0_s16_to_s8_refine_conv1_act_relu/Relu (defined at C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\backend.py:4867)
In[1] model/stack0_dec1_s8_to_s4_interp_bilinear/mul (defined at C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\backend.py:3325)

Operation defined at: (most recent call last)
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\Scripts\sleap-train-script.py", line 33, in <module>
>>>     sys.exit(load_entry_point('sleap==1.3.3', 'console_scripts', 'sleap-train')())
>>>
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\sleap\nn\training.py", line 2014, in main
>>>     trainer.train()
>>>
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\sleap\nn\training.py", line 941, in train
>>>     verbose=2,
>>>
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\utils\traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\engine\training.py", line 1216, in fit
>>>     tmp_logs = self.train_function(iterator)
>>>
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\engine\training.py", line 878, in train_function
>>>     return step_function(self, iterator)
>>>
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\engine\training.py", line 867, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\engine\training.py", line 860, in run_step
>>>     outputs = model.train_step(data)
>>>
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\engine\training.py", line 808, in train_step
>>>     y_pred = self(x, training=True)
>>>
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\utils\traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\engine\base_layer.py", line 1083, in __call__
>>>     outputs = call_fn(inputs, *args, **kwargs)
>>>
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\utils\traceback_utils.py", line 92, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\engine\functional.py", line 452, in call
>>>     inputs, training=training, mask=mask)
>>>
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\engine\functional.py", line 589, in _run_internal_graph
>>>     outputs = node.layer(*args, **kwargs)
>>>
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\utils\traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\engine\base_layer.py", line 1083, in __call__
>>>     outputs = call_fn(inputs, *args, **kwargs)
>>>
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\utils\traceback_utils.py", line 92, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\layers\convolutional.py", line 2957, in call
>>>     interpolation=self.interpolation)
>>>
>>>   File "C:\Users\Jose_Teixeira\miniconda3\envs\sleap\lib\site-packages\keras\backend.py", line 3334, in resize_images
>>>     method=tf.image.ResizeMethod.BILINEAR)
>>>
INFO:sleap.nn.callbacks:Closing the reporter controller/context.
INFO:sleap.nn.callbacks:Closing the training controller socket/context.
Run Path: C:/Users/Jose_Teixeira\models\250122_111808.centered_instance.n=213

From reading the log I felt this error was caused by a lack of memory on my PC (which I didn't experience before I think), so I halved the batch size for the centered instance model training and it is working now. I should mention that I used the centroid model I just trained when this error was thrown, instead of training a new one. Do you think it's a better idea to just train each one at a time? I don't see a reason why training the centroid one just before would contribute to the issue, but to be honest I didn't test that. Let me know if it would be useful to test it! In any case, a lower batch size might lead to overfitting, which considering the goals of my implementation, I'm very adamant about preventing, so I would love to hear your insights about other ways I could potentially address this (apart from buying more memory, which I probably will have to do in the future anyway).

In any case, I think you can close this issue. Thank you again for helping!

talmo · 2025-01-23T01:08:05Z

Hi @gomesteixeira,

Ok great!

Also, I would suggest to maybe explain better in the docs/tutorial how these crops work and interact with one another (although maybe I just misunderstood/misinterpreted it).

Yeah, I think with the fix it won't really matter anymore, but maybe we should note this somewhere anyhow.

Regarding the new issue: yes, it's related to running out of memory on your GPU.

Here's the culprit:

Input shape: (1120, 1120, 1)

In top-down models, SLEAP will automatically calculate the size of the bounding box to crop around the animals based on the sizes present in your labels.

It seems like there is at least one instance that is quite large (1000x1000 px or larger).

I can't tell what the original size of your full frames are from the logs, but I'm guessing that that size is way too big no matter what.

This often happens when you have a stray annotation where maybe you forgot to mark a couple nodes as "not visible", so the bounding box is huge.

You should definitely look through your labels to make sure one of them isn't messed up, but either way, you can circumvent this problem by just setting the bounding box crop size manually.

From the original logs, it looks like they were (192, 192, 1), so if you just set the crop size to 192 in the centered instance configuration (default is "auto"), you should be back in business!

Let us know if that works for you :)

Cheers,

Talmo

gomesteixeira added the bug Something isn't working label Jan 21, 2025

roomrys added the fixed in future release Fix or feature is merged into develop and will be available in future release. label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training consistently stalling #2089

Training consistently stalling #2089

gomesteixeira commented Jan 21, 2025

talmo commented Jan 21, 2025

gomesteixeira commented Jan 22, 2025

talmo commented Jan 23, 2025

Training consistently stalling #2089

Training consistently stalling #2089

Comments

gomesteixeira commented Jan 21, 2025

Bug description

Expected behaviour

Actual behaviour

Your personal set up

Screenshots

How to reproduce

talmo commented Jan 21, 2025

gomesteixeira commented Jan 22, 2025

talmo commented Jan 23, 2025