First, `SoundStream` needs to be trained on a large corpus of audio data
There are two options for the neural codec. If you want to use the pretrained 24kHz Encodec, just create an Encodec object as follows:
```python
fromaudiolm_pytorchimportEncodecWrapper
encodec=EncodecWrapper()
# Now you can use the encodec variable in the same way you'd use the soundstream variables below.
```
Otherwise, to stay more true to the original paper, you can use `SoundStream`. First, `SoundStream` needs to be trained on a large corpus of audio data
**Note**: do NOT type "y" to overwrite previous experiments/ checkpoints when running through the cells here unless you're ready to the entire results folder! Otherwise you will end up erasing things (e.g. you train SoundStream first, and if you choose "overwrite" then you lose the SoundStream checkpoint when you then train SemanticTransformer).
%% Cell type:markdown id: tags:
### SoundStream
%% Cell type:code id: tags:
```
```python
soundstream=SoundStream(
codebook_size=1024,
rq_num_quantizers=8,
)
trainer=SoundStreamTrainer(
soundstream,
folder=dataset_folder,
batch_size=4,
grad_accum_every=8,# effective batch size of 32
data_max_length=320*32,
save_results_every=2,
save_model_every=4,
num_train_steps=9
).cuda()
# NOTE: I changed num_train_steps to 9 (aka 8 + 1) from 10000 to make things go faster for demo purposes
# adjusting save_*_every variables for the same reason
trainer.train()
```
%% Output
training with dataset of 2 samples and validating with randomly splitted 1 samples
/usr/local/lib/python3.8/dist-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator MiniBatchKMeans from version 0.24.0 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
training with dataset of 2 samples and validating with randomly splitted 1 samples
do you want to clear previous experiment checkpoints and results? (y/n) n
0: loss: 6.648584365844727
0: valid loss 5.763116359710693
0: saving model to results
training complete
%% Cell type:markdown id: tags:
### CoarseTransformer
%% Cell type:code id: tags:
```
```python
wav2vec=HubertWithKmeans(
checkpoint_path=f'./{hubert_ckpt}',
kmeans_path=f'./{hubert_quantizer}'
)
soundstream=SoundStream(
codebook_size=1024,
rq_num_quantizers=8,
)
soundstream.load(f"./{soundstream_ckpt}")
coarse_transformer=CoarseTransformer(
num_semantic_tokens=wav2vec.codebook_size,
codebook_size=1024,
num_coarse_quantizers=3,
dim=512,
depth=6
)
trainer=CoarseTransformerTrainer(
transformer=coarse_transformer,
soundstream = soundstream,
codec=soundstream,
wav2vec=wav2vec,
folder=dataset_folder,
batch_size=1,
data_max_length=320*32,
save_results_every=2,
save_model_every=4,
num_train_steps=9
)
# NOTE: I changed num_train_steps to 9 (aka 8 + 1) from 10000 to make things go faster for demo purposes
# adjusting save_*_every variables for the same reason
trainer.train()
```
%% Output
/usr/local/lib/python3.8/dist-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator MiniBatchKMeans from version 0.24.0 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to: