Digital Signal Processing: From Analog to Digital Music
Explore the mathematical foundations of digital music - from the Nyquist-Shannon sampling theorem to the Short-Time Fourier Transform. Learn how continuous sound waves become discrete digital signals.
The Digital Revolution in Music
Digital Signal Processing (DSP) forms the mathematical backbone of modern music technology. From recording studios to streaming services, every piece of digital music has undergone the fundamental transformations described by DSP theory. Understanding these principles is essential for anyone working with AI music systems.
Music exists as continuous pressure waves in air, but computers can only work with discrete numbers. DSP provides the mathematical framework to bridge this gap without losing the essence of the sound.
The Sampling Theorem: Foundation of Digital Audio
The Nyquist-Shannon sampling theorem is the cornerstone of digital audio. It states that a continuous signal can be perfectly reconstructed from its samples if the sampling rate is at least twice the highest frequency present in the signal.
For a band-limited signal with maximum frequency , the minimum sampling rate must be:
This critical frequency is called the Nyquist frequency. For CD-quality audio with Hz, the maximum representable frequency is 22,050 Hz.
Why 44.1 kHz?
Human hearing typically extends to 20 kHz. The 44.1 kHz sampling rate provides a 2.05 kHz guard band to accommodate practical anti-aliasing filter designs.
If the sampling theorem is violated, high frequencies "fold back" into the audible spectrum:
where is the integer that minimizes the absolute value.
Quantization: From Continuous to Discrete Amplitude
While sampling discretizes time, quantization discretizes amplitude. Each sample must be represented by a finite number of bits, introducing quantization error.
For bits with range :
16-bit audio: 65,536 levels
24-bit audio: 16.7 million levels
Theoretical maximum SNR:
16-bit: ~98 dB SNR
24-bit: ~146 dB SNR
The quantization error is bounded by:
This error manifests as noise, with power:
The Short-Time Fourier Transform (STFT)
Music is inherently time-varying - notes start, stop, and change. The STFT provides a time-frequency representation by applying the Fourier transform to short, overlapping windows of the signal.
For signal and window function :
Where:
- = frame index (time)
- = frequency bin index
- = FFT size
- = hop size (frame shift)
Common windows and their properties:
Hann Window:
Hamming Window:
The uncertainty principle limits resolution:
- • Long window: Good frequency resolution
- • Short window: Good time resolution
- • Cannot have both simultaneously
1import numpy as np
2import scipy.signal
3
4def stft(signal, fs=44100, window='hann', nperseg=2048, noverlap=512):
5 """
6 Compute Short-Time Fourier Transform
7
8 Parameters:
9 - signal: Input audio signal
10 - fs: Sampling rate
11 - window: Window function
12 - nperseg: Length of each segment
13 - noverlap: Number of points to overlap
14 """
15 f, t, Zxx = scipy.signal.stft(
16 signal,
17 fs=fs,
18 window=window,
19 nperseg=nperseg,
20 noverlap=noverlap
21 )
22
23 # Convert to magnitude spectrogram (dB)
24 magnitude = np.abs(Zxx)
25 magnitude_db = 20 * np.log10(magnitude + 1e-10)
26
27 return f, t, magnitude_db
28
29# Example usage
30duration = 2.0 # seconds
31fs = 44100
32t = np.linspace(0, duration, int(fs * duration))
33
34# Create a chirp signal (frequency sweep)
35signal = scipy.signal.chirp(t, f0=100, f1=1000, t1=duration, method='linear')
36
37# Compute STFT
38frequencies, time_bins, spectrogram = stft(signal)
39print(f"Spectrogram shape: {spectrogram.shape}")
40print(f"Time resolution: {time_bins[1] - time_bins[0]:.3f} seconds")
41print(f"Frequency resolution: {frequencies[1] - frequencies[0]:.1f} Hz")
Spectrograms: Visualizing Time-Frequency Content
The magnitude of the STFT creates a spectrogram - a visual representation showing how the frequency content of a signal changes over time. This is fundamental to many MIR tasks.
Linear Spectrogram
Direct magnitude of STFT:
Log-Magnitude Spectrogram
Better matches human perception:
Mel Spectrogram
Frequency axis warped to mel scale:
Phase: The Often Overlooked Component
While magnitude spectrograms are widely used, phase information is crucial for signal reconstruction and contains important perceptual information.
The complex STFT can be decomposed into magnitude and phase:
Phase unwrapping for continuity:
where is chosen to minimize phase discontinuities.
Practical DSP Pipeline for Music Analysis
1import numpy as np
2import librosa
3import librosa.display
4import matplotlib.pyplot as plt
5
6class MusicDSPPipeline:
7 def __init__(self, sr=22050, n_fft=2048, hop_length=512):
8 self.sr = sr
9 self.n_fft = n_fft
10 self.hop_length = hop_length
11 self.window = 'hann'
12
13 def load_audio(self, filepath):
14 """Load and normalize audio file"""
15 audio, _ = librosa.load(filepath, sr=self.sr, mono=True)
16 # Normalize to [-1, 1]
17 audio = audio / np.max(np.abs(audio))
18 return audio
19
20 def compute_stft(self, audio):
21 """Compute complex STFT"""
22 D = librosa.stft(
23 audio,
24 n_fft=self.n_fft,
25 hop_length=self.hop_length,
26 window=self.window
27 )
28 return D
29
30 def compute_features(self, audio):
31 """Extract various spectral features"""
32 D = self.compute_stft(audio)
33
34 features = {
35 'magnitude': np.abs(D),
36 'phase': np.angle(D),
37 'power': np.abs(D) ** 2,
38 'log_magnitude': librosa.amplitude_to_db(np.abs(D)),
39 'mel_spectrogram': librosa.feature.melspectrogram(
40 S=np.abs(D)**2,
41 sr=self.sr,
42 n_mels=128
43 ),
44 'mfcc': librosa.feature.mfcc(
45 S=librosa.power_to_db(np.abs(D)**2),
46 sr=self.sr,
47 n_mfcc=13
48 )
49 }
50
51 return features
52
53 def reconstruct_audio(self, magnitude, phase):
54 """Reconstruct audio from magnitude and phase"""
55 D_reconstructed = magnitude * np.exp(1j * phase)
56 audio_reconstructed = librosa.istft(
57 D_reconstructed,
58 hop_length=self.hop_length,
59 window=self.window
60 )
61 return audio_reconstructed
62
63 def griffin_lim(self, magnitude, n_iter=32):
64 """Phase reconstruction using Griffin-Lim algorithm"""
65 return librosa.griffinlim(
66 magnitude,
67 n_iter=n_iter,
68 hop_length=self.hop_length,
69 window=self.window
70 )
71
72# Example usage
73pipeline = MusicDSPPipeline()
74
75# Process audio
76audio = pipeline.load_audio('song.wav')
77features = pipeline.compute_features(audio)
78
79# Reconstruct from magnitude only (Griffin-Lim)
80reconstructed = pipeline.griffin_lim(features['magnitude'])
81
82print(f"Original length: {len(audio)} samples")
83print(f"Spectrogram shape: {features['magnitude'].shape}")
84print(f"Mel-spectrogram shape: {features['mel_spectrogram'].shape}")
85print(f"MFCC shape: {features['mfcc'].shape}")
Advanced Topics: Beyond Basic DSP
Logarithmic frequency spacing matching musical scales:
where is bins per octave (typically 12 for chromatic scale)
Multi-resolution analysis with variable time-frequency resolution:
Decompose spectrogram into basis and activation:
Used for source separation and feature learning
Key Takeaways
- Sampling theorem is fundamental: Must sample at least 2× the highest frequency to avoid aliasing. This sets the foundation for all digital audio.
- Quantization introduces noise: The bit depth determines dynamic range with approximately 6 dB per bit of SNR improvement.
- STFT enables time-frequency analysis: Essential for understanding how musical content evolves over time.
- Trade-offs are inevitable: The uncertainty principle limits simultaneous time and frequency resolution.