JewelMusic - AI-Powered Music Distribution Platform

The Digital Revolution in Music

Digital Signal Processing (DSP) forms the mathematical backbone of modern music technology. From recording studios to streaming services, every piece of digital music has undergone the fundamental transformations described by DSP theory. Understanding these principles is essential for anyone working with AI music systems.

The Fundamental Challenge

Music exists as continuous pressure waves in air, but computers can only work with discrete numbers. DSP provides the mathematical framework to bridge this gap without losing the essence of the sound.

The Sampling Theorem: Foundation of Digital Audio

The Nyquist-Shannon sampling theorem is the cornerstone of digital audio. It states that a continuous signal can be perfectly reconstructed from its samples if the sampling rate is at least twice the highest frequency present in the signal.

Mathematical Formulation

For a band-limited signal $x(t)$ with maximum frequency $f_{max}$ , the minimum sampling rate must be:

f_s \geq 2 \cdot f_{max}

This critical frequency $f_s/2$ is called the Nyquist frequency. For CD-quality audio with $f_s = 44,100$ Hz, the maximum representable frequency is 22,050 Hz.

Why 44.1 kHz?

Human hearing typically extends to 20 kHz. The 44.1 kHz sampling rate provides a 2.05 kHz guard band to accommodate practical anti-aliasing filter designs.

Aliasing: When Sampling Goes Wrong

If the sampling theorem is violated, high frequencies "fold back" into the audible spectrum:

f_{alias} = |f_{signal} - n \cdot f_s|

where $n$ is the integer that minimizes the absolute value.

Quantization: From Continuous to Discrete Amplitude

While sampling discretizes time, quantization discretizes amplitude. Each sample must be represented by a finite number of bits, introducing quantization error.

Quantization Step Size

For $B$ bits with range $[-V, V]$ :

\Delta = \frac{2V}{2^B}

16-bit audio: 65,536 levels

24-bit audio: 16.7 million levels

Signal-to-Noise Ratio

Theoretical maximum SNR:

SNR_{dB} = 6.02B + 1.76

16-bit: ~98 dB SNR

24-bit: ~146 dB SNR

Quantization Error

The quantization error $e[n]$ is bounded by:

-\frac{\Delta}{2} \leq e[n] \leq \frac{\Delta}{2}

This error manifests as noise, with power:

P_{noise} = \frac{\Delta^2}{12}

The Short-Time Fourier Transform (STFT)

Music is inherently time-varying - notes start, stop, and change. The STFT provides a time-frequency representation by applying the Fourier transform to short, overlapping windows of the signal.

STFT Mathematical Definition

For signal $x[n]$ and window function $w[n]$ :

X(m, k) = \sum_{n=0}^{N-1} x[n + mH] \cdot w[n] \cdot e^{-j2\pi kn/N}

Where:

$m$ = frame index (time)
$k$ = frequency bin index
$N$ = FFT size
$H$ = hop size (frame shift)

Window Functions

Common windows and their properties:

Hann Window:

w[n] = 0.5 - 0.5\cos\left(\frac{2\pi n}{N-1}\right)

Hamming Window:

w[n] = 0.54 - 0.46\cos\left(\frac{2\pi n}{N-1}\right)

Time-Frequency Trade-off

The uncertainty principle limits resolution:

\Delta t \cdot \Delta f \geq \frac{1}{4\pi}

• Long window: Good frequency resolution
• Short window: Good time resolution
• Cannot have both simultaneously

STFT Implementation in Python

1import numpy as np
2import scipy.signal
3
4def stft(signal, fs=44100, window='hann', nperseg=2048, noverlap=512):
5    """
6    Compute Short-Time Fourier Transform
7    
8    Parameters:
9    - signal: Input audio signal
10    - fs: Sampling rate
11    - window: Window function
12    - nperseg: Length of each segment
13    - noverlap: Number of points to overlap
14    """
15    f, t, Zxx = scipy.signal.stft(
16        signal, 
17        fs=fs, 
18        window=window, 
19        nperseg=nperseg, 
20        noverlap=noverlap
21    )
22    
23    # Convert to magnitude spectrogram (dB)
24    magnitude = np.abs(Zxx)
25    magnitude_db = 20 * np.log10(magnitude + 1e-10)
26    
27    return f, t, magnitude_db
28
29# Example usage
30duration = 2.0  # seconds
31fs = 44100
32t = np.linspace(0, duration, int(fs * duration))
33
34# Create a chirp signal (frequency sweep)
35signal = scipy.signal.chirp(t, f0=100, f1=1000, t1=duration, method='linear')
36
37# Compute STFT
38frequencies, time_bins, spectrogram = stft(signal)
39print(f"Spectrogram shape: {spectrogram.shape}")
40print(f"Time resolution: {time_bins[1] - time_bins[0]:.3f} seconds")
41print(f"Frequency resolution: {frequencies[1] - frequencies[0]:.1f} Hz")

Spectrograms: Visualizing Time-Frequency Content

The magnitude of the STFT creates a spectrogram - a visual representation showing how the frequency content of a signal changes over time. This is fundamental to many MIR tasks.

Spectrogram Types

Linear Spectrogram

Direct magnitude of STFT:

S_{linear}(m, k) = |X(m, k)|

Log-Magnitude Spectrogram

Better matches human perception:

S_{log}(m, k) = 20 \cdot \log_{10}(|X(m, k)| + \epsilon)

Mel Spectrogram

Frequency axis warped to mel scale:

mel(f) = 2595 \cdot \log_{10}\left(1 + \frac{f}{700}\right)

Phase: The Often Overlooked Component

While magnitude spectrograms are widely used, phase information is crucial for signal reconstruction and contains important perceptual information.

Phase Representation

The complex STFT can be decomposed into magnitude and phase:

X(m, k) = |X(m, k)| \cdot e^{j\phi(m, k)}

Phase unwrapping for continuity:

\phi_{unwrapped}[n] = \phi[n] + 2\pi M[n]

where $M[n]$ is chosen to minimize phase discontinuities.

Practical DSP Pipeline for Music Analysis

Complete DSP Pipeline

1import numpy as np
2import librosa
3import librosa.display
4import matplotlib.pyplot as plt
5
6class MusicDSPPipeline:
7    def __init__(self, sr=22050, n_fft=2048, hop_length=512):
8        self.sr = sr
9        self.n_fft = n_fft
10        self.hop_length = hop_length
11        self.window = 'hann'
12        
13    def load_audio(self, filepath):
14        """Load and normalize audio file"""
15        audio, _ = librosa.load(filepath, sr=self.sr, mono=True)
16        # Normalize to [-1, 1]
17        audio = audio / np.max(np.abs(audio))
18        return audio
19    
20    def compute_stft(self, audio):
21        """Compute complex STFT"""
22        D = librosa.stft(
23            audio, 
24            n_fft=self.n_fft,
25            hop_length=self.hop_length,
26            window=self.window
27        )
28        return D
29    
30    def compute_features(self, audio):
31        """Extract various spectral features"""
32        D = self.compute_stft(audio)
33        
34        features = {
35            'magnitude': np.abs(D),
36            'phase': np.angle(D),
37            'power': np.abs(D) ** 2,
38            'log_magnitude': librosa.amplitude_to_db(np.abs(D)),
39            'mel_spectrogram': librosa.feature.melspectrogram(
40                S=np.abs(D)**2, 
41                sr=self.sr,
42                n_mels=128
43            ),
44            'mfcc': librosa.feature.mfcc(
45                S=librosa.power_to_db(np.abs(D)**2),
46                sr=self.sr,
47                n_mfcc=13
48            )
49        }
50        
51        return features
52    
53    def reconstruct_audio(self, magnitude, phase):
54        """Reconstruct audio from magnitude and phase"""
55        D_reconstructed = magnitude * np.exp(1j * phase)
56        audio_reconstructed = librosa.istft(
57            D_reconstructed,
58            hop_length=self.hop_length,
59            window=self.window
60        )
61        return audio_reconstructed
62    
63    def griffin_lim(self, magnitude, n_iter=32):
64        """Phase reconstruction using Griffin-Lim algorithm"""
65        return librosa.griffinlim(
66            magnitude,
67            n_iter=n_iter,
68            hop_length=self.hop_length,
69            window=self.window
70        )
71
72# Example usage
73pipeline = MusicDSPPipeline()
74
75# Process audio
76audio = pipeline.load_audio('song.wav')
77features = pipeline.compute_features(audio)
78
79# Reconstruct from magnitude only (Griffin-Lim)
80reconstructed = pipeline.griffin_lim(features['magnitude'])
81
82print(f"Original length: {len(audio)} samples")
83print(f"Spectrogram shape: {features['magnitude'].shape}")
84print(f"Mel-spectrogram shape: {features['mel_spectrogram'].shape}")
85print(f"MFCC shape: {features['mfcc'].shape}")

Advanced Topics: Beyond Basic DSP

Constant-Q Transform (CQT)

Logarithmic frequency spacing matching musical scales:

f_k = f_{min} \cdot 2^{k/B}

where $B$ is bins per octave (typically 12 for chromatic scale)

Wavelet Transform

Multi-resolution analysis with variable time-frequency resolution:

W(a, b) = \frac{1}{\sqrt{a}} \int_{-\infty}^{\infty} x(t) \psi^*\left(\frac{t-b}{a}\right) dt

Non-Negative Matrix Factorization

Decompose spectrogram into basis and activation:

V \approx W \cdot H

Used for source separation and feature learning

Key Takeaways

Sampling theorem is fundamental: Must sample at least 2× the highest frequency to avoid aliasing. This sets the foundation for all digital audio.
Quantization introduces noise: The bit depth determines dynamic range with approximately 6 dB per bit of SNR improvement.
STFT enables time-frequency analysis: Essential for understanding how musical content evolves over time.
Trade-offs are inevitable: The uncertainty principle limits simultaneous time and frequency resolution.

References & Resources

Introduction to Digital Filters

Julius O. Smith III - Stanford CCRMA

The Scientist and Engineer's Guide to DSP

Steven W. Smith - Comprehensive DSP textbook

Librosa Documentation

Python library for music and audio analysis

Continue Reading

Matrix Factorization Methods for Music Analysis

Explore ICA and NMF techniques for source separation and feature extraction in music signals.