it is done (e96259b7) · Commits · school / Capstone Design / 01 / MusicLM

README.md

+33 −1

Original line number	Diff line number	Diff line
		@@ -79,7 +79,39 @@ wavs = torch.randn(2, 1024)
		conds = quantizer(wavs = wavs, namespace = 'semantic') # (2, 8, 1024) - 8 is number of quantizers
		```

		After much training, you will pass your finetuned or trained-from-scratch `AudioLM` and `MuLaN` wrapped in `MuLaNEmbedQuantizer` to the `MusicLM`
		To train (or finetune) the three transformers that are a part of `AudioLM`, you simply follow the instructions over at `audiolm-pytorch` for training, but pass in the `MulanEmbedQuantizer` instance to the training classes under the keyword `audio_conditioner`

		ex. `SemanticTransformerTrainer`

		```python
		import torch
		from audiolm_pytorch import HubertWithKmeans, SemanticTransformer, SemanticTransformerTrainer

		wav2vec = HubertWithKmeans(
		checkpoint_path = './hubert/hubert_base_ls960.pt',
		kmeans_path = './hubert/hubert_base_ls960_L9_km500.bin'
		)

		semantic_transformer = SemanticTransformer(
		num_semantic_tokens = wav2vec.codebook_size,
		dim = 1024,
		depth = 6
		).cuda()

		trainer = SemanticTransformerTrainer(
		transformer = semantic_transformer,
		wav2vec = wav2vec,
		audio_conditioner = quantizer, # pass in the MulanEmbedQuantizer instance above
		folder ='/path/to/audio/files',
		batch_size = 1,
		data_max_length = 320 * 32,
		num_train_steps = 1
		)

		trainer.train()
		```

		After much training on all three transformers (semantic, coarse, fine), you will pass your finetuned or trained-from-scratch `AudioLM` and `MuLaN` wrapped in `MuLaNEmbedQuantizer` to the `MusicLM`

		```python
		musiclm = MusicLM(

+2 −0

Original line number	Diff line number	Diff line
		@@ -541,6 +541,8 @@ class MusicLM(nn.Module):
		mulan_embed_quantizer: MuLaNEmbedQuantizer
		):
		super().__init__()
		assert not exists(audio_lm.audio_conditioner), 'mulan must not have been passed into AudioLM. it will be managed externally now, embedding the text into the joint embedding space for text-to-audio synthesis'

		self.mulan_embed_quantizer = mulan_embed_quantizer
		self.audio_lm = audio_lm

+2 −2

Original line number	Diff line number	Diff line
		@@ -3,7 +3,7 @@ from setuptools import setup, find_packages
		setup(
		name = 'musiclm-pytorch',
		packages = find_packages(exclude=[]),
		version = '0.0.6',
		version = '0.0.7',
		license='MIT',
		description = 'MusicLM - AudioLM + Audio CLIP to text to music synthesis',
		author = 'Phil Wang',
		@@ -19,7 +19,7 @@ setup(
		'contrastive learning'
		],
		install_requires=[
		'audiolm-pytorch>=0.9.0',
		'audiolm-pytorch>=0.9.2',
		'beartype',
		'einops>=0.4',
		'vector-quantize-pytorch>=1.0.0',