You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When saving out some Audio() data.
The data is audio recordings with associated 'sentences'.
(They use the audio 'bytes' approach because they're clips within audio files).
Code is below the traceback (I can't upload the voice audio/text (it's not even me)).
Traceback (most recent call last):
File "/mnt/ddrive/prj/voice/voice-training-dataset-create/./dataset.py", line 156, in <module>
create_dataset(args)
File "/mnt/ddrive/prj/voice/voice-training-dataset-create/./dataset.py", line 138, in create_dataset
hf_dataset.save_to_disk(args.outds, max_shard_size='50MB')
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 1531, in save_to_disk
for kwargs in kwargs_per_job:
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 1508, in <genexpr>
"shard": self.shard(num_shards=num_shards, index=shard_idx, contiguous=True),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 4609, in shard
return self.select(
^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 556, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 3797, in select
return self._select_contiguous(start, length, new_fingerprint=new_fingerprint)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 556, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 3857, in _select_contiguous
_check_valid_indices_value(start, len(self))
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 648, in _check_valid_indices_value
raise IndexError(f"Index {index} out of range for dataset of size {size}.")
IndexError: Index 339 out of range for dataset of size 339.
Steps to reproduce the bug
(I had to set the default max batch size down due to a different bug... or maybe it's related: #5717)
#!/usr/bin/env python3importargparseimportosfrompathlibimportPathimportsoundfileassfimportdatasetsdatasets.config.DEFAULT_MAX_BATCH_SIZE=35fromdatasetsimportFeatures, Array2D, Value, Dataset, Sequence, Audioimportnumpyasnpimportlibrosaimportsysimportsoundfileassfimportioimportlogginglogging.basicConfig(level=logging.DEBUG, filename='debug.log', filemode='w',
format='%(name)s - %(levelname)s - %(message)s')
# Define the arguments for the command-line interfacedefparse_args():
parser=argparse.ArgumentParser(description="Create a Huggingface dataset from labeled audio files.")
parser.add_argument("--indir_labeled", action="append", help="Directory containing labeled audio files.", required=True)
parser.add_argument("--outds", help="Path to save the dataset file.", required=True)
parser.add_argument("--max_clips", type=int, help="Max count of audio samples to add to the dataset.", default=None)
parser.add_argument("-r", "--sr", type=int, help="Sample rate for the audio files.", default=16000)
parser.add_argument("--no-resample", action="store_true", help="Disable resampling of the audio files.")
parser.add_argument("--max_clip_secs", type=float, help="Max length of audio clips in seconds.", default=3.0)
parser.add_argument("-v", "--verbose", action='count', default=1, help="Increase verbosity")
returnparser.parse_args()
# Convert the NumPy arrays to audio bytes in WAV formatdefnumpy_to_bytes(audio_array, sampling_rate=16000):
withio.BytesIO() asbytes_io:
sf.write(bytes_io, audio_array, samplerate=sampling_rate,
format='wav', subtype='FLOAT') # float32returnbytes_io.getvalue()
# Function to find audio and label files in a directorydeffind_audio_label_pairs(indir_labeled):
audio_label_pairs= []
forroot, _, filesinos.walk(indir_labeled):
forfileinfiles:
iffile.endswith(('.mp3', '.wav', '.aac', '.flac')):
audio_path=Path(root) /fileifargs.verbose>1:
print(f'File: {audio_path}')
label_path=audio_path.with_suffix('.labels.txt')
iflabel_path.exists():
ifargs.verbose>0:
print(f' Pair: {audio_path}')
audio_label_pairs.append((audio_path, label_path))
returnaudio_label_pairsdefprocess_audio_label_pair(audio_path, label_path, sampling_rate, no_resample, max_clip_secs):
# Read the label filewithopen(label_path, 'r') aslabel_file:
labels=label_file.readlines()
# Load the full audio filefull_audio, current_sr=sf.read(audio_path)
ifnotno_resampleandcurrent_sr!=sampling_rate:
# You can use librosa.resample here if librosa is availablefull_audio=librosa.resample(full_audio, orig_sr=current_sr, target_sr=sampling_rate)
audio_segments= []
sentences= []
# Process each labelforlabelinlabels:
start_secs, end_secs, label_text=label.strip().split('\t')
start_sample=int(float(start_secs) *sampling_rate)
end_sample=int(float(end_secs) *sampling_rate)
# Extract segment and truncate or pad to max_clip_secsaudio_segment=full_audio[start_sample:end_sample]
max_samples=int(max_clip_secs*sampling_rate)
iflen(audio_segment) >max_samples: # Truncateaudio_segment=audio_segment[:max_samples]
eliflen(audio_segment) <max_samples: # Padpadding=np.zeros(max_samples-len(audio_segment), dtype=audio_segment.dtype)
audio_segment=np.concatenate((audio_segment, padding))
audio_segment=numpy_to_bytes(audio_segment)
audio_data= {
'path': str(audio_path),
'bytes': audio_segment,
}
audio_segments.append(audio_data)
sentences.append(label_text)
returnaudio_segments, sentences# Main function to create the datasetdefcreate_dataset(args):
audio_label_pairs= []
forindirinargs.indir_labeled:
audio_label_pairs.extend(find_audio_label_pairs(indir))
# Initialize our dataset datadataset_data= {
'path': [], # This will be a list of strings'audio': [], # This will be a list of dictionaries'sentence': [], # This will be a list of strings
}
# Process each audio-label pair and add the data to the datasetforaudio_path, label_pathinaudio_label_pairs[:args.max_clips]:
audio_segments, sentences=process_audio_label_pair(audio_path, label_path, args.sr, args.no_resample, args.max_clip_secs)
ifaudio_segmentsandsentences:
foraudio_data, sentenceinzip(audio_segments, sentences):
ifargs.verbose>1:
print(f'Appending {audio_data["path"]}')
dataset_data['path'].append(audio_data['path'])
dataset_data['audio'].append({
'path': audio_data['path'],
'bytes': audio_data['bytes'],
})
dataset_data['sentence'].append(sentence)
features=Features({
'path': Value('string'), # Path is redundant in common voice set also'audio': Audio(sampling_rate=16000),
'sentence': Value('string'),
})
hf_dataset=Dataset.from_dict(dataset_data, features=features)
forkeyindataset_data:
fori, iteminenumerate(dataset_data[key]):
ifitemisNoneor (isinstance(item, bytes) andlen(item) ==0):
logging.error(f"Invalid {key} at index {i}: {item}")
importipdb; ipdb.set_trace(context=16); passhf_dataset.save_to_disk(args.outds, max_shard_size='50MB')
# try:# hf_dataset.save_to_disk(args.outds)# except TypeError as e:# # If there's a TypeError, log the exception and the dataset data that might have caused it# logging.exception("An error occurred while saving the dataset.")# import ipdb; ipdb.set_trace(context=16); pass# for key in dataset_data:# logging.debug(f"{key} length: {len(dataset_data[key])}")# if key == 'audio':# # Log the first 100 bytes of the audio data to avoid huge log files# for i, audio in enumerate(dataset_data[key]):# logging.debug(f"Audio {i}: {audio['bytes'][:100]}")# raise# Run the scriptif__name__=="__main__":
args=parse_args()
create_dataset(args)
I managed a workaround eventually but I don't know what it was (I made a lot of changes to seq2seq). I'll try to include generating code in the future. (If I close, I don't know if you see it. Feel free to close; I'll re-open if I encounter it again (if I can)).
Describe the bug
When saving out some Audio() data.
The data is audio recordings with associated 'sentences'.
(They use the audio 'bytes' approach because they're clips within audio files).
Code is below the traceback (I can't upload the voice audio/text (it's not even me)).
Steps to reproduce the bug
(I had to set the default max batch size down due to a different bug... or maybe it's related: #5717)
Expected behavior
It shouldn't fail.
Environment info
datasets
version: 2.14.7.dev0huggingface_hub
version: 0.17.3fsspec
version: 2023.9.2The text was updated successfully, but these errors were encountered: