Index 339 out of range for dataset of size 339 <-- save_to_file() #6389

jaggzh · 2023-11-08T12:52:09Z

Describe the bug

When saving out some Audio() data.
The data is audio recordings with associated 'sentences'.
(They use the audio 'bytes' approach because they're clips within audio files).
Code is below the traceback (I can't upload the voice audio/text (it's not even me)).

Traceback (most recent call last):
File "/mnt/ddrive/prj/voice/voice-training-dataset-create/./dataset.py", line 156, in <module>
create_dataset(args)
File "/mnt/ddrive/prj/voice/voice-training-dataset-create/./dataset.py", line 138, in create_dataset
hf_dataset.save_to_disk(args.outds, max_shard_size='50MB')
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 1531, in save_to_disk
for kwargs in kwargs_per_job:
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 1508, in <genexpr>
"shard": self.shard(num_shards=num_shards, index=shard_idx, contiguous=True),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 4609, in shard
return self.select(
^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 556, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 3797, in select
return self._select_contiguous(start, length, new_fingerprint=new_fingerprint)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 556, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 3857, in _select_contiguous
_check_valid_indices_value(start, len(self))
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 648, in _check_valid_indices_value
raise IndexError(f"Index {index} out of range for dataset of size {size}.")
IndexError: Index 339 out of range for dataset of size 339.

Steps to reproduce the bug

(I had to set the default max batch size down due to a different bug... or maybe it's related: #5717)

#!/usr/bin/env python3
import argparse
import os
from pathlib import Path
import soundfile as sf

import datasets
datasets.config.DEFAULT_MAX_BATCH_SIZE=35
from datasets import Features, Array2D, Value, Dataset, Sequence, Audio

import numpy as np
import librosa
import sys
import soundfile as sf
import io
import logging

logging.basicConfig(level=logging.DEBUG, filename='debug.log', filemode='w',
                    format='%(name)s - %(levelname)s - %(message)s')

# Define the arguments for the command-line interface
def parse_args():
    parser = argparse.ArgumentParser(description="Create a Huggingface dataset from labeled audio files.")
    parser.add_argument("--indir_labeled", action="append", help="Directory containing labeled audio files.", required=True)
    parser.add_argument("--outds", help="Path to save the dataset file.", required=True)
    parser.add_argument("--max_clips", type=int, help="Max count of audio samples to add to the dataset.", default=None)
    parser.add_argument("-r", "--sr", type=int, help="Sample rate for the audio files.", default=16000)
    parser.add_argument("--no-resample", action="store_true", help="Disable resampling of the audio files.")
    parser.add_argument("--max_clip_secs", type=float, help="Max length of audio clips in seconds.", default=3.0)
    parser.add_argument("-v", "--verbose", action='count', default=1, help="Increase verbosity")
    return parser.parse_args()

# Convert the NumPy arrays to audio bytes in WAV format
def numpy_to_bytes(audio_array, sampling_rate=16000):
    with io.BytesIO() as bytes_io:
        sf.write(bytes_io, audio_array, samplerate=sampling_rate,
                 format='wav', subtype='FLOAT') # float32
        return bytes_io.getvalue()

# Function to find audio and label files in a directory
def find_audio_label_pairs(indir_labeled):
    audio_label_pairs = []
    for root, _, files in os.walk(indir_labeled):
        for file in files:
            if file.endswith(('.mp3', '.wav', '.aac', '.flac')):
                audio_path = Path(root) / file
                if args.verbose>1:
                    print(f'File: {audio_path}')
                label_path = audio_path.with_suffix('.labels.txt')
                if label_path.exists():
                    if args.verbose>0:
                        print(f'  Pair: {audio_path}')
                    audio_label_pairs.append((audio_path, label_path))
    return audio_label_pairs

def process_audio_label_pair(audio_path, label_path, sampling_rate, no_resample, max_clip_secs):
    # Read the label file
    with open(label_path, 'r') as label_file:
        labels = label_file.readlines()

    # Load the full audio file
    full_audio, current_sr = sf.read(audio_path)
    if not no_resample and current_sr != sampling_rate:
        # You can use librosa.resample here if librosa is available
        full_audio = librosa.resample(full_audio, orig_sr=current_sr, target_sr=sampling_rate)

    audio_segments = []
    sentences = []

    # Process each label
    for label in labels:
        start_secs, end_secs, label_text = label.strip().split('\t')
        start_sample = int(float(start_secs) * sampling_rate)
        end_sample = int(float(end_secs) * sampling_rate)

        # Extract segment and truncate or pad to max_clip_secs
        audio_segment = full_audio[start_sample:end_sample]
        max_samples = int(max_clip_secs * sampling_rate)
        if len(audio_segment) > max_samples: # Truncate
            audio_segment = audio_segment[:max_samples]
        elif len(audio_segment) < max_samples: # Pad
            padding = np.zeros(max_samples - len(audio_segment), dtype=audio_segment.dtype)
            audio_segment = np.concatenate((audio_segment, padding))

        audio_segment = numpy_to_bytes(audio_segment)

        audio_data = {
            'path': str(audio_path),
            'bytes': audio_segment,
        }

        audio_segments.append(audio_data)
        sentences.append(label_text)

    return audio_segments, sentences

# Main function to create the dataset
def create_dataset(args):
    audio_label_pairs = []
    for indir in args.indir_labeled:
        audio_label_pairs.extend(find_audio_label_pairs(indir))

    # Initialize our dataset data
    dataset_data = {
        'path': [],        # This will be a list of strings
        'audio': [],       # This will be a list of dictionaries
        'sentence': [],    # This will be a list of strings
    }

    # Process each audio-label pair and add the data to the dataset
    for audio_path, label_path in audio_label_pairs[:args.max_clips]:
        audio_segments, sentences = process_audio_label_pair(audio_path, label_path, args.sr, args.no_resample, args.max_clip_secs)
        if audio_segments and sentences:
            for audio_data, sentence in zip(audio_segments, sentences):
                if args.verbose>1:
                    print(f'Appending {audio_data["path"]}')
                dataset_data['path'].append(audio_data['path'])
                dataset_data['audio'].append({
                    'path': audio_data['path'],
                    'bytes': audio_data['bytes'],
                })
                dataset_data['sentence'].append(sentence)

    features = Features({
      'path': Value('string'), # Path is redundant in common voice set also
      'audio': Audio(sampling_rate=16000),
      'sentence': Value('string'),
    })
    hf_dataset = Dataset.from_dict(dataset_data, features=features)

    for key in dataset_data:
        for i, item in enumerate(dataset_data[key]):
            if item is None or (isinstance(item, bytes) and len(item) == 0):
                logging.error(f"Invalid {key} at index {i}: {item}")
                import ipdb; ipdb.set_trace(context=16); pass

    hf_dataset.save_to_disk(args.outds, max_shard_size='50MB')
    # try:
    #     hf_dataset.save_to_disk(args.outds)
    # except TypeError as e:
    #     # If there's a TypeError, log the exception and the dataset data that might have caused it
    #     logging.exception("An error occurred while saving the dataset.")
    #     import ipdb; ipdb.set_trace(context=16); pass
    #     for key in dataset_data:
    #         logging.debug(f"{key} length: {len(dataset_data[key])}")
    #         if key == 'audio':
    #             # Log the first 100 bytes of the audio data to avoid huge log files
    #             for i, audio in enumerate(dataset_data[key]):
    #                 logging.debug(f"Audio {i}: {audio['bytes'][:100]}")
    #     raise

# Run the script
if __name__ == "__main__":
    args = parse_args()
    create_dataset(args)

Expected behavior

It shouldn't fail.

Environment info

datasets version: 2.14.7.dev0
Platform: Linux-6.1.0-13-amd64-x86_64-with-glibc2.36
Python version: 3.11.2
huggingface_hub version: 0.17.3
PyArrow version: 13.0.0
Pandas version: 2.1.2
fsspec version: 2023.9.2

The text was updated successfully, but these errors were encountered:

mariosasko · 2023-11-15T16:03:09Z

Hi! Can you make the above reproducer self-contained by adding code that generates the data?

jaggzh · 2023-11-24T09:14:12Z

I managed a workaround eventually but I don't know what it was (I made a lot of changes to seq2seq). I'll try to include generating code in the future. (If I close, I don't know if you see it. Feel free to close; I'll re-open if I encounter it again (if I can)).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index 339 out of range for dataset of size 339 <-- save_to_file() #6389

Index 339 out of range for dataset of size 339 <-- save_to_file() #6389

jaggzh commented Nov 8, 2023

mariosasko commented Nov 15, 2023

jaggzh commented Nov 24, 2023

Index 339 out of range for dataset of size 339 <-- save_to_file() #6389

Index 339 out of range for dataset of size 339 <-- save_to_file() #6389

Comments

jaggzh commented Nov 8, 2023

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

mariosasko commented Nov 15, 2023

jaggzh commented Nov 24, 2023