JewelMusic - AI-Powered Music Distribution Platform

The Granularity Cascade

Generating coherent text is only the first step in creating a song. For lyrics to be truly effective, they must be "singable"—aligned with the rhythmic, melodic, and phonetic structure of music. This requirement pushes the task into computational linguistics, phonetics, and computer animation.

The Four Levels of Singability

Structural Level

Syllable-note alignment and verse-chorus structure

Prosodic Level

Stress patterns and rhythmic flow matching

Phonetic Level

Grapheme-to-phoneme conversion for vocal synthesis

Visual Level

Phoneme-to-viseme mapping for lip-sync animation

Modeling Prosody: Rhythm and Stress

The Two Pillars of Prosody

Syllabic Alignment

The number of syllables in a lyrical phrase must match the number of notes in the corresponding musical phrase.

Syllabic Alignment Example

1Melody: C4 D4 E4 F4 (4 notes)
2Lyrics: "Hel-lo my friend" (4 syllables) ✓
3Lyrics: "Greetings" (2 syllables) ✗

Stress Alignment

Linguistically stressed syllables align with musically strong beats or longer notes.

Stress Alignment Example

1Beat:   STRONG weak STRONG weak
2Good:   "MU-sic FLOWS through ME" ✓
3Poor:   "mu-SIC flows THROUGH me" ✗

Modern Approaches to Prosody

LYRA: Multi-task learning to predict syllable counts alongside text generation
Scansion-based: Formal poetic meter analysis to create rhythmic templates
Neural attention: Learning implicit alignment through attention mechanisms

Grapheme-to-Phoneme (G2P) Conversion

To move from text to acoustic synthesis, systems must understand how words sound. This requires converting written text (graphemes) into fundamental units of sound (phonemes).

The G2P Challenge in English

English has complex and inconsistent spelling-to-sound rules, making G2P particularly challenging:

Word	Graphemes	Phonemes (IPA)
through	t-h-r-o-u-g-h	/θruː/
though	t-h-o-u-g-h	/ðoʊ/
thought	t-h-o-u-g-h-t	/θɔːt/

Traditional Approaches

• Rule-based systems
• Dictionary lookup (CMU Pronouncing Dictionary)
• Hidden Markov Models

Modern Neural Approaches

• Transformer seq2seq models
• Character-level CNNs
• Phoneme-aware embeddings

Phoneme-to-Viseme Mapping for Lip-Sync

The final step in creating a full visual performance is animating a character to appear as if singing the generated lyrics. This relies on mapping phoneme sequences to visual mouth shapes (visemes).

The Viseme Concept

A viseme is the visual counterpart to a phoneme. Multiple phonemes can map to the same viseme since different sounds can produce identical lip shapes:

👄

Bilabial

/p/, /b/, /m/

😮

Rounded

/w/, /u/, /o/

😁

Spread

/i/, /e/

Modern Lip-Sync Pipeline

Extract audio and text/phoneme transcript
Analyze audio for precise phoneme timing
Map timed phonemes to viseme sequence
Generate keyframes for mouth animation
Interpolate between keyframes for smooth motion

End-to-End Performance Synthesis

The ultimate goal is integrating lyric generation, singing voice synthesis, and facial animation into unified end-to-end models capable of generating complete, synchronized audio-visual performances.

AudioGen-Omni

Multimodal

Unified framework using Multimodal Diffusion Transformer (MMDiT) for synchronized audio-video generation.

Key Innovation:

Lyrics-transcription encoder mapping graphemes and phonemes to frame-level representations for tight synchronization.

AV-Flow

4D Avatar

Animates 4D talking avatars directly from text, synthesizing speech and visual performance jointly.

Architecture:

Dual diffusion transformers with "highway" connections for audio-visual correlation learning.

Text2Lip

Viseme-Centric

Constructs explicit bridge between text and lip motion through structured viseme sequences.

Advantage:

Robust generation even in audio-free scenarios using linguistic priors.

Evaluation Metrics for Singability

Musical Alignment Metrics

• Syllable-Note Error Rate: Mismatch between syllable and note counts
• Stress Alignment Score: Correlation between linguistic and musical stress
• CLAP/CLaMP Score: Multimodal embedding correlation metrics

Performance Quality Metrics

• LSE-C/LSE-D: Lip Sync Error Confidence/Distance
• LMD: Mouth Landmark Distance
• Naturalness Score: Human perceptual evaluation

Key Takeaways

Beyond Text Generation

True singability requires understanding and modeling multiple levels of linguistic and acoustic structure, from syllables to visual performance.

The Multimodal Future

The field is moving toward unified models that generate complete performances, integrating text, audio, and visual modalities seamlessly.

Evaluation Complexity

Assessing singability requires specialized metrics that go beyond traditional NLP evaluation, considering musical, phonetic, and visual alignment.

References & Resources

AudioGen-Omni: Unified Audio-Visual Synthesis

Multimodal diffusion for synchronized performance

CMU Pronouncing Dictionary

Standard resource for English G2P conversion

AV-Flow: 4D Avatar Animation

Joint audio-visual generation framework

Continue Reading

Building Your Own AI Lyric Generator: Tools, Techniques, and Best Practices

Practical guide to fine-tuning LLMs, leveraging open-source models, and implementing music-conditioned generation.