Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index 339 out of range for dataset of size 339 <-- save_to_file() #6389

Open
jaggzh opened this issue Nov 8, 2023 · 2 comments
Open

Index 339 out of range for dataset of size 339 <-- save_to_file() #6389

jaggzh opened this issue Nov 8, 2023 · 2 comments

Comments

@jaggzh
Copy link

jaggzh commented Nov 8, 2023

Describe the bug

When saving out some Audio() data.
The data is audio recordings with associated 'sentences'.
(They use the audio 'bytes' approach because they're clips within audio files).
Code is below the traceback (I can't upload the voice audio/text (it's not even me)).

Traceback (most recent call last):
File "/mnt/ddrive/prj/voice/voice-training-dataset-create/./dataset.py", line 156, in <module>
create_dataset(args)
File "/mnt/ddrive/prj/voice/voice-training-dataset-create/./dataset.py", line 138, in create_dataset
hf_dataset.save_to_disk(args.outds, max_shard_size='50MB')
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 1531, in save_to_disk
for kwargs in kwargs_per_job:
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 1508, in <genexpr>
"shard": self.shard(num_shards=num_shards, index=shard_idx, contiguous=True),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 4609, in shard
return self.select(
^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 556, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 3797, in select
return self._select_contiguous(start, length, new_fingerprint=new_fingerprint)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 556, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 3857, in _select_contiguous
_check_valid_indices_value(start, len(self))
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 648, in _check_valid_indices_value
raise IndexError(f"Index {index} out of range for dataset of size {size}.")
IndexError: Index 339 out of range for dataset of size 339.

Steps to reproduce the bug

(I had to set the default max batch size down due to a different bug... or maybe it's related: #5717)

#!/usr/bin/env python3
import argparse
import os
from pathlib import Path
import soundfile as sf

import datasets
datasets.config.DEFAULT_MAX_BATCH_SIZE=35
from datasets import Features, Array2D, Value, Dataset, Sequence, Audio

import numpy as np
import librosa
import sys
import soundfile as sf
import io
import logging

logging.basicConfig(level=logging.DEBUG, filename='debug.log', filemode='w',
                    format='%(name)s - %(levelname)s - %(message)s')

# Define the arguments for the command-line interface
def parse_args():
    parser = argparse.ArgumentParser(description="Create a Huggingface dataset from labeled audio files.")
    parser.add_argument("--indir_labeled", action="append", help="Directory containing labeled audio files.", required=True)
    parser.add_argument("--outds", help="Path to save the dataset file.", required=True)
    parser.add_argument("--max_clips", type=int, help="Max count of audio samples to add to the dataset.", default=None)
    parser.add_argument("-r", "--sr", type=int, help="Sample rate for the audio files.", default=16000)
    parser.add_argument("--no-resample", action="store_true", help="Disable resampling of the audio files.")
    parser.add_argument("--max_clip_secs", type=float, help="Max length of audio clips in seconds.", default=3.0)
    parser.add_argument("-v", "--verbose", action='count', default=1, help="Increase verbosity")
    return parser.parse_args()

# Convert the NumPy arrays to audio bytes in WAV format
def numpy_to_bytes(audio_array, sampling_rate=16000):
    with io.BytesIO() as bytes_io:
        sf.write(bytes_io, audio_array, samplerate=sampling_rate,
                 format='wav', subtype='FLOAT') # float32
        return bytes_io.getvalue()

# Function to find audio and label files in a directory
def find_audio_label_pairs(indir_labeled):
    audio_label_pairs = []
    for root, _, files in os.walk(indir_labeled):
        for file in files:
            if file.endswith(('.mp3', '.wav', '.aac', '.flac')):
                audio_path = Path(root) / file
                if args.verbose>1:
                    print(f'File: {audio_path}')
                label_path = audio_path.with_suffix('.labels.txt')
                if label_path.exists():
                    if args.verbose>0:
                        print(f'  Pair: {audio_path}')
                    audio_label_pairs.append((audio_path, label_path))
    return audio_label_pairs

def process_audio_label_pair(audio_path, label_path, sampling_rate, no_resample, max_clip_secs):
    # Read the label file
    with open(label_path, 'r') as label_file:
        labels = label_file.readlines()

    # Load the full audio file
    full_audio, current_sr = sf.read(audio_path)
    if not no_resample and current_sr != sampling_rate:
        # You can use librosa.resample here if librosa is available
        full_audio = librosa.resample(full_audio, orig_sr=current_sr, target_sr=sampling_rate)

    audio_segments = []
    sentences = []

    # Process each label
    for label in labels:
        start_secs, end_secs, label_text = label.strip().split('\t')
        start_sample = int(float(start_secs) * sampling_rate)
        end_sample = int(float(end_secs) * sampling_rate)

        # Extract segment and truncate or pad to max_clip_secs
        audio_segment = full_audio[start_sample:end_sample]
        max_samples = int(max_clip_secs * sampling_rate)
        if len(audio_segment) > max_samples: # Truncate
            audio_segment = audio_segment[:max_samples]
        elif len(audio_segment) < max_samples: # Pad
            padding = np.zeros(max_samples - len(audio_segment), dtype=audio_segment.dtype)
            audio_segment = np.concatenate((audio_segment, padding))

        audio_segment = numpy_to_bytes(audio_segment)

        audio_data = {
            'path': str(audio_path),
            'bytes': audio_segment,
        }

        audio_segments.append(audio_data)
        sentences.append(label_text)

    return audio_segments, sentences

# Main function to create the dataset
def create_dataset(args):
    audio_label_pairs = []
    for indir in args.indir_labeled:
        audio_label_pairs.extend(find_audio_label_pairs(indir))

    # Initialize our dataset data
    dataset_data = {
        'path': [],        # This will be a list of strings
        'audio': [],       # This will be a list of dictionaries
        'sentence': [],    # This will be a list of strings
    }

    # Process each audio-label pair and add the data to the dataset
    for audio_path, label_path in audio_label_pairs[:args.max_clips]:
        audio_segments, sentences = process_audio_label_pair(audio_path, label_path, args.sr, args.no_resample, args.max_clip_secs)
        if audio_segments and sentences:
            for audio_data, sentence in zip(audio_segments, sentences):
                if args.verbose>1:
                    print(f'Appending {audio_data["path"]}')
                dataset_data['path'].append(audio_data['path'])
                dataset_data['audio'].append({
                    'path': audio_data['path'],
                    'bytes': audio_data['bytes'],
                })
                dataset_data['sentence'].append(sentence)

    features = Features({
      'path': Value('string'), # Path is redundant in common voice set also
      'audio': Audio(sampling_rate=16000),
      'sentence': Value('string'),
    })
    hf_dataset = Dataset.from_dict(dataset_data, features=features)

    for key in dataset_data:
        for i, item in enumerate(dataset_data[key]):
            if item is None or (isinstance(item, bytes) and len(item) == 0):
                logging.error(f"Invalid {key} at index {i}: {item}")
                import ipdb; ipdb.set_trace(context=16); pass

    hf_dataset.save_to_disk(args.outds, max_shard_size='50MB')
    # try:
    #     hf_dataset.save_to_disk(args.outds)
    # except TypeError as e:
    #     # If there's a TypeError, log the exception and the dataset data that might have caused it
    #     logging.exception("An error occurred while saving the dataset.")
    #     import ipdb; ipdb.set_trace(context=16); pass
    #     for key in dataset_data:
    #         logging.debug(f"{key} length: {len(dataset_data[key])}")
    #         if key == 'audio':
    #             # Log the first 100 bytes of the audio data to avoid huge log files
    #             for i, audio in enumerate(dataset_data[key]):
    #                 logging.debug(f"Audio {i}: {audio['bytes'][:100]}")
    #     raise

# Run the script
if __name__ == "__main__":
    args = parse_args()
    create_dataset(args)

Expected behavior

It shouldn't fail.

Environment info

  • datasets version: 2.14.7.dev0
  • Platform: Linux-6.1.0-13-amd64-x86_64-with-glibc2.36
  • Python version: 3.11.2
  • huggingface_hub version: 0.17.3
  • PyArrow version: 13.0.0
  • Pandas version: 2.1.2
  • fsspec version: 2023.9.2
@mariosasko
Copy link
Collaborator

Hi! Can you make the above reproducer self-contained by adding code that generates the data?

@jaggzh
Copy link
Author

jaggzh commented Nov 24, 2023

I managed a workaround eventually but I don't know what it was (I made a lot of changes to seq2seq). I'll try to include generating code in the future. (If I close, I don't know if you see it. Feel free to close; I'll re-open if I encounter it again (if I can)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants