Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Tutorial Proposal - Saurav #20 Draft #45

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
182 changes: 182 additions & 0 deletions projects/caption-generator-from-audio/caption-generator-from-audio.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
---
title: Caption Generator From Audio
author: Saurav Singh Rauthan
datePublished: 2022-10-31
description: Learn how to make a QR code using the qrcode library in this project tutorial
header:
tags:
- intermediate
- python
---

# Caption Generator From Audio

<AuthorAvatar author_name="Saurav Singh Rauthan" author_avatar="/images/projects/authors/saurav_singh_rauthan.jpg" />

![Header image](URL)

**Prerequisites:** Python fundamentals, Code editor
**Versions:** Python 3.10, Wav2vec 2.0
**Read Time:** 40 minutes

## [#](#-introduction) Introduction

Whenever we watch a video on youtube / a movie on any ott platform we have an option of enabling captions, which basically provides us the **transcipt** of the video i.e. a written copy of what is being spoken in the video.

Now the fun thing is that with the help of python we can generate this **transcipt** automatically 🤯 .

At the end of this tutorial we'll be able to generate captions/transcipt like this :

![result](output.gif)

Exciting 😎 ? let's code it!!!!

## [#](#-folder-structure) Folder Structure

To be in the same boat let's setup up our project with a basic folder structure.

create a `.py` or `.ipynb` file which will contain our code and name it accordingly. I'll name my `.ipynb` file `captionGenerator.ipynb`.

next we need to create a folder called `Audio` in our root directory which will contain the audio files that we will provide as input further in this tutorial

after creating these files the folder structure will look something like this :

![folder-structure](folder-structure.jpg)


_Note : the .vscode folder contains the settings for vscode editor such as identations and formatter used for prettier and is not required for this tutorial_

## [#](#-installing-packages) Installing Packages

now we will install the packages that are required for the tutorial and can read more about them in the [resources](#-more-resources) section

since i am using `ipynb` file i will install the packages with the following code :

```py
! pip install pydub
! pip install static_ffmpeg
! pip install pyaudio
! pip install librosa
! pip install torch
! pip install transformers
```

if you are facing error while installing packages you can follow [this tutorial](https://packaging.python.org/en/latest/tutorials/installing-packages/)

## [#](#-audio-preprocessing) Audio Pre-processing

The audio is available in various formats (ex : .mp3, .wav, .aac, .ogg) and the input can be any in of such format, but since our model takes input as a .wav audio file and also if the input file is very lengthy for example a 20 minutes audio file just imagine the amount the computational power required to process such audio 🤯 . So in this section we'll perform some audio pre-processing.

Lets start with splitting the audio into sub-parts so that we can process such sub-parts once at a time instead of feeding the entire 20 minutes audio to the model.

imports :

```py
from pydub import AudioSegment
from pydub.playback import play
import static_ffmpeg
static_ffmpeg.add_paths()
```

importing the audio file :

```py
audio = AudioSegment.from_file(r"<Path to the input file>")

# if you want to convert the audio as whole use the code below
# audio.export("audio export path here", format="wav")
```

splitting audio into 3 minutes parts :

```py
# we'll split the audio in 3 minutes each to minimize the computational power used by model
audio_length = len(audio) / (60 * 1000)

split_marker = 180
split_audio = [audio[:180 * 1000]]

for i in range(round(audio_length / (180 * 1000))):
split_audio.append(audio[split_marker * 1000:(split_marker + 180) * 1000])
split_marker += 180

#it will create the file in audio dir, make sure to create audio dir in root folder

count = 0
for count, audio_sample in enumerate(split_audio):
count += 1
with open(f'audio/{count}_audi_file.wav', 'wb') as out_f:
audio_sample.export(out_f, format='wav')
```

remember we made audio directory in [folder-structure](#-folder-structure) section, this directory will contains all the splitted audio files generated from our input file. Also since we require these files to be in .wav format so that we can feed it to the model which is performed by this line of code : ` audio_sample.export(out_f, format='wav')`

## [#](#-loading-model) Loading Model

Now we have our required audio file splitted and converted into .wav format. So let's now load our model.

imports :

```py
import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
```

loading model and tokenizer :

```py
# loading model and tokenizer

tokenizer = Wav2Vec2Tokenizer.from_pretrained('facebook/wav2vec2-base-960h')
model = Wav2Vec2ForCTC.from_pretrained('facebook/wav2vec2-base-960h')

print('model loaded working on audio with length :', f'{"{0:.2f}".format(audio_length)}s')
```

feeding splitted audio to model :

```py
text_arr = []
for i in range(len(split_audio)):
# loading audio in model from the audio dir
speech, rate = librosa.load(f'audio/{i+1}_audi_file.wav', sr=16000)

input_values = tokenizer(speech, return_tensors='pt').input_values
with torch.no_grad():
logits = model(input_values).logits

predicted_ids = torch.argmax(logits, dim=-1)
transcription = tokenizer.batch_decode(predicted_ids)[0]
text_arr.append(transcription)
```

and finally generating the transcript :

```py
final_speech = ''
for speech in text_arr:
final_speech += speech
print(final_speech)
```

Tada 🧙🎉 the transcript is ready.

## [#](#-conclusion) Conclusion

In this tutorial we explored on how to generate the caption (transcript) from an audio file for producing thee steps mentioned in the tutorial above.

As for next steps we can use this transcript and generate the important keywords from it using libraries like bert or spacy and use natural language processing for extraction from the produced transcript 👨‍💻

You can also explore various other models than Wav2vec and also implement your own model for the same.

## [#](#-more-resources) More Resources

- [Solution on GitHub](finalCode.ipynb)
- [Documentation: Wav2vec 2.0](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)
- [Documentation: pydub](https://pypi.org/project/pydub/)
- [Documentation: librosa](https://librosa.org/doc/latest/index.html)
- [Documentation: pyaudio](https://pypi.org/project/PyAudio/)
- [Documentation: transformers](https://pypi.org/project/transformers/)
- [Documentation: static_ffmpeg](https://pypi.org/project/static-ffmpeg/)
- [Documentation: torch](https://pypi.org/project/torch/)
128 changes: 128 additions & 0 deletions projects/caption-generator-from-audio/finalCode.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "2603103e",
"metadata": {},
"outputs": [],
"source": [
"import os \n",
"print(os.path.isfile(r\"input file path\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "88eb6f4a",
"metadata": {},
"outputs": [],
"source": [
"! pip install pydub\n",
"! pip install static_ffmpeg\n",
"! pip install pyaudio\n",
"! pip install librosa\n",
"! pip install torch\n",
"! pip install transformers"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "cd0d97c7",
"metadata": {},
"outputs": [],
"source": [
"# since the model take only wav file as input , so we'll convert the audio format to wav using pydub\n",
"\n",
"from pydub import AudioSegment\n",
"from pydub.playback import play\n",
"import static_ffmpeg\n",
"\n",
"static_ffmpeg.add_paths()\n",
"\n",
"\n",
"audio = AudioSegment.from_file(r\"<input file path>\")\n",
"\n",
"# if you want to convert the audio as whole uncomment the code below\n",
"# audio.export(\"audio export path here\", format=\"wav\")\n",
"\n",
"# we'll split the audio in 3 minutes each to minimize the computational power used by model\n",
"audio_length = len(audio) / (60 * 1000)\n",
"\n",
"split_marker = 180\n",
"split_audio = [audio[:180 * 1000]]\n",
"\n",
"for i in range(round(audio_length / (180 * 1000))):\n",
" split_audio.append(audio[split_marker * 1000:(split_marker + 180) * 1000])\n",
" split_marker += 180\n",
"\n",
"#it will create the file in audio dir, make sure to create audio dir in root folder\n",
"\n",
"count = 0\n",
"for count, audio_sample in enumerate(split_audio):\n",
" count += 1\n",
" with open(f'audio/{count}_audi_file.wav', 'wb') as out_f:\n",
" audio_sample.export(out_f, format='wav')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c24da00f",
"metadata": {},
"outputs": [],
"source": [
"import librosa\n",
"import torch\n",
"from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer\n",
"import os\n",
"\n",
"# loading model and tokenizer\n",
"tokenizer = Wav2Vec2Tokenizer.from_pretrained('facebook/wav2vec2-base-960h')\n",
"model = Wav2Vec2ForCTC.from_pretrained('facebook/wav2vec2-base-960h')\n",
"\n",
"print('model loaded working on audio with length :', f'{\"{0:.2f}\".format(audio_length)}s')\n",
"\n",
"text_arr = []\n",
"for i in range(len(split_audio)):\n",
" # loading audio in model from the audio dir\n",
" speech, rate = librosa.load(f'audio/{i+1}_audi_file.wav', sr=16000) \n",
" \n",
" input_values = tokenizer(speech, return_tensors='pt').input_values\n",
" with torch.no_grad():\n",
" logits = model(input_values).logits\n",
" \n",
" predicted_ids = torch.argmax(logits, dim=-1)\n",
" transcription = tokenizer.batch_decode(predicted_ids)[0]\n",
" text_arr.append(transcription)\n",
" \n",
"final_speech = ''\n",
"for speech in text_arr:\n",
" final_speech += speech\n",
"print(final_speech)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added projects/caption-generator-from-audio/output.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.