Back to Blog
Mathematical Foundations
Post #17

Digital Signal Processing: From Analog to Digital Music

Explore the mathematical foundations of digital music - from the Nyquist-Shannon sampling theorem to the Short-Time Fourier Transform. Learn how continuous sound waves become discrete digital signals.

JewelMusic Research Team
January 20, 2025
15 min read
Digital signal processing visualization

The Digital Revolution in Music

Digital Signal Processing (DSP) forms the mathematical backbone of modern music technology. From recording studios to streaming services, every piece of digital music has undergone the fundamental transformations described by DSP theory. Understanding these principles is essential for anyone working with AI music systems.

The Fundamental Challenge

Music exists as continuous pressure waves in air, but computers can only work with discrete numbers. DSP provides the mathematical framework to bridge this gap without losing the essence of the sound.

The Sampling Theorem: Foundation of Digital Audio

The Nyquist-Shannon sampling theorem is the cornerstone of digital audio. It states that a continuous signal can be perfectly reconstructed from its samples if the sampling rate is at least twice the highest frequency present in the signal.

Mathematical Formulation

For a band-limited signal x(t)x(t) with maximum frequency fmaxf_{max}, the minimum sampling rate must be:

fs2fmaxf_s \geq 2 \cdot f_{max}

This critical frequency fs/2f_s/2 is called the Nyquist frequency. For CD-quality audio with fs=44,100f_s = 44,100 Hz, the maximum representable frequency is 22,050 Hz.

Why 44.1 kHz?

Human hearing typically extends to 20 kHz. The 44.1 kHz sampling rate provides a 2.05 kHz guard band to accommodate practical anti-aliasing filter designs.

Aliasing: When Sampling Goes Wrong

If the sampling theorem is violated, high frequencies "fold back" into the audible spectrum:

falias=fsignalnfsf_{alias} = |f_{signal} - n \cdot f_s|

where nn is the integer that minimizes the absolute value.

Quantization: From Continuous to Discrete Amplitude

While sampling discretizes time, quantization discretizes amplitude. Each sample must be represented by a finite number of bits, introducing quantization error.

Quantization Step Size

For BB bits with range [V,V][-V, V]:

Δ=2V2B\Delta = \frac{2V}{2^B}

16-bit audio: 65,536 levels

24-bit audio: 16.7 million levels

Signal-to-Noise Ratio

Theoretical maximum SNR:

SNRdB=6.02B+1.76SNR_{dB} = 6.02B + 1.76

16-bit: ~98 dB SNR

24-bit: ~146 dB SNR

Quantization Error

The quantization error e[n]e[n] is bounded by:

Δ2e[n]Δ2-\frac{\Delta}{2} \leq e[n] \leq \frac{\Delta}{2}

This error manifests as noise, with power:

Pnoise=Δ212P_{noise} = \frac{\Delta^2}{12}

The Short-Time Fourier Transform (STFT)

Music is inherently time-varying - notes start, stop, and change. The STFT provides a time-frequency representation by applying the Fourier transform to short, overlapping windows of the signal.

STFT Mathematical Definition

For signal x[n]x[n] and window function w[n]w[n]:

X(m,k)=n=0N1x[n+mH]w[n]ej2πkn/NX(m, k) = \sum_{n=0}^{N-1} x[n + mH] \cdot w[n] \cdot e^{-j2\pi kn/N}

Where:

  • mm = frame index (time)
  • kk = frequency bin index
  • NN = FFT size
  • HH = hop size (frame shift)
Window Functions

Common windows and their properties:

Hann Window:

w[n]=0.50.5cos(2πnN1)w[n] = 0.5 - 0.5\cos\left(\frac{2\pi n}{N-1}\right)

Hamming Window:

w[n]=0.540.46cos(2πnN1)w[n] = 0.54 - 0.46\cos\left(\frac{2\pi n}{N-1}\right)
Time-Frequency Trade-off

The uncertainty principle limits resolution:

ΔtΔf14π\Delta t \cdot \Delta f \geq \frac{1}{4\pi}
  • • Long window: Good frequency resolution
  • • Short window: Good time resolution
  • • Cannot have both simultaneously
STFT Implementation in Python
1import numpy as np
2import scipy.signal
3
4def stft(signal, fs=44100, window='hann', nperseg=2048, noverlap=512):
5    """
6    Compute Short-Time Fourier Transform
7    
8    Parameters:
9    - signal: Input audio signal
10    - fs: Sampling rate
11    - window: Window function
12    - nperseg: Length of each segment
13    - noverlap: Number of points to overlap
14    """
15    f, t, Zxx = scipy.signal.stft(
16        signal, 
17        fs=fs, 
18        window=window, 
19        nperseg=nperseg, 
20        noverlap=noverlap
21    )
22    
23    # Convert to magnitude spectrogram (dB)
24    magnitude = np.abs(Zxx)
25    magnitude_db = 20 * np.log10(magnitude + 1e-10)
26    
27    return f, t, magnitude_db
28
29# Example usage
30duration = 2.0  # seconds
31fs = 44100
32t = np.linspace(0, duration, int(fs * duration))
33
34# Create a chirp signal (frequency sweep)
35signal = scipy.signal.chirp(t, f0=100, f1=1000, t1=duration, method='linear')
36
37# Compute STFT
38frequencies, time_bins, spectrogram = stft(signal)
39print(f"Spectrogram shape: {spectrogram.shape}")
40print(f"Time resolution: {time_bins[1] - time_bins[0]:.3f} seconds")
41print(f"Frequency resolution: {frequencies[1] - frequencies[0]:.1f} Hz")

Spectrograms: Visualizing Time-Frequency Content

The magnitude of the STFT creates a spectrogram - a visual representation showing how the frequency content of a signal changes over time. This is fundamental to many MIR tasks.

Spectrogram Types

Linear Spectrogram

Direct magnitude of STFT:

Slinear(m,k)=X(m,k)S_{linear}(m, k) = |X(m, k)|

Log-Magnitude Spectrogram

Better matches human perception:

Slog(m,k)=20log10(X(m,k)+ϵ)S_{log}(m, k) = 20 \cdot \log_{10}(|X(m, k)| + \epsilon)

Mel Spectrogram

Frequency axis warped to mel scale:

mel(f)=2595log10(1+f700)mel(f) = 2595 \cdot \log_{10}\left(1 + \frac{f}{700}\right)

Phase: The Often Overlooked Component

While magnitude spectrograms are widely used, phase information is crucial for signal reconstruction and contains important perceptual information.

Phase Representation

The complex STFT can be decomposed into magnitude and phase:

X(m,k)=X(m,k)ejϕ(m,k)X(m, k) = |X(m, k)| \cdot e^{j\phi(m, k)}

Phase unwrapping for continuity:

ϕunwrapped[n]=ϕ[n]+2πM[n]\phi_{unwrapped}[n] = \phi[n] + 2\pi M[n]

where M[n]M[n] is chosen to minimize phase discontinuities.

Practical DSP Pipeline for Music Analysis

Complete DSP Pipeline
1import numpy as np
2import librosa
3import librosa.display
4import matplotlib.pyplot as plt
5
6class MusicDSPPipeline:
7    def __init__(self, sr=22050, n_fft=2048, hop_length=512):
8        self.sr = sr
9        self.n_fft = n_fft
10        self.hop_length = hop_length
11        self.window = 'hann'
12        
13    def load_audio(self, filepath):
14        """Load and normalize audio file"""
15        audio, _ = librosa.load(filepath, sr=self.sr, mono=True)
16        # Normalize to [-1, 1]
17        audio = audio / np.max(np.abs(audio))
18        return audio
19    
20    def compute_stft(self, audio):
21        """Compute complex STFT"""
22        D = librosa.stft(
23            audio, 
24            n_fft=self.n_fft,
25            hop_length=self.hop_length,
26            window=self.window
27        )
28        return D
29    
30    def compute_features(self, audio):
31        """Extract various spectral features"""
32        D = self.compute_stft(audio)
33        
34        features = {
35            'magnitude': np.abs(D),
36            'phase': np.angle(D),
37            'power': np.abs(D) ** 2,
38            'log_magnitude': librosa.amplitude_to_db(np.abs(D)),
39            'mel_spectrogram': librosa.feature.melspectrogram(
40                S=np.abs(D)**2, 
41                sr=self.sr,
42                n_mels=128
43            ),
44            'mfcc': librosa.feature.mfcc(
45                S=librosa.power_to_db(np.abs(D)**2),
46                sr=self.sr,
47                n_mfcc=13
48            )
49        }
50        
51        return features
52    
53    def reconstruct_audio(self, magnitude, phase):
54        """Reconstruct audio from magnitude and phase"""
55        D_reconstructed = magnitude * np.exp(1j * phase)
56        audio_reconstructed = librosa.istft(
57            D_reconstructed,
58            hop_length=self.hop_length,
59            window=self.window
60        )
61        return audio_reconstructed
62    
63    def griffin_lim(self, magnitude, n_iter=32):
64        """Phase reconstruction using Griffin-Lim algorithm"""
65        return librosa.griffinlim(
66            magnitude,
67            n_iter=n_iter,
68            hop_length=self.hop_length,
69            window=self.window
70        )
71
72# Example usage
73pipeline = MusicDSPPipeline()
74
75# Process audio
76audio = pipeline.load_audio('song.wav')
77features = pipeline.compute_features(audio)
78
79# Reconstruct from magnitude only (Griffin-Lim)
80reconstructed = pipeline.griffin_lim(features['magnitude'])
81
82print(f"Original length: {len(audio)} samples")
83print(f"Spectrogram shape: {features['magnitude'].shape}")
84print(f"Mel-spectrogram shape: {features['mel_spectrogram'].shape}")
85print(f"MFCC shape: {features['mfcc'].shape}")

Advanced Topics: Beyond Basic DSP

Constant-Q Transform (CQT)

Logarithmic frequency spacing matching musical scales:

fk=fmin2k/Bf_k = f_{min} \cdot 2^{k/B}

where BB is bins per octave (typically 12 for chromatic scale)

Wavelet Transform

Multi-resolution analysis with variable time-frequency resolution:

W(a,b)=1ax(t)ψ(tba)dtW(a, b) = \frac{1}{\sqrt{a}} \int_{-\infty}^{\infty} x(t) \psi^*\left(\frac{t-b}{a}\right) dt
Non-Negative Matrix Factorization

Decompose spectrogram into basis and activation:

VWHV \approx W \cdot H

Used for source separation and feature learning

Key Takeaways

  • Sampling theorem is fundamental: Must sample at least 2× the highest frequency to avoid aliasing. This sets the foundation for all digital audio.
  • Quantization introduces noise: The bit depth determines dynamic range with approximately 6 dB per bit of SNR improvement.
  • STFT enables time-frequency analysis: Essential for understanding how musical content evolves over time.
  • Trade-offs are inevitable: The uncertainty principle limits simultaneous time and frequency resolution.

References & Resources

Continue Reading

Next Article
Matrix Factorization Methods for Music Analysis
Explore ICA and NMF techniques for source separation and feature extraction in music signals.