Machine Learning & Signals Learning
17 Signal classification
-
• Direct approach: distance metrics and classifier.
Distance metrics examples: MSE, DTW
-
• Feature approach: feature extraction and classifier.
-
• Shapelets: discriminative sub-sequences of time-series data and classifier.
17.1 Dynamic Time Warping (DTW)
-
Goal: “Elastic” distance metric between two signals.
-
• Time-domain, non-linear metric that can stretch and compress the time axis to find the best possible match between sequences.
-
• Applied for k-NN and other classifiers.
-
• Historically, developed to compare same/different words spoken with different speed.
-
Definition
Two signal are defined [?],
\(\seteqnumber{0}{}{0}\)\begin{equation} \begin{aligned} X = \left (x_1,\ldots ,x_m,\ldots ,x_M\right )\\ Y = \left (y_1,\ldots ,x_n,\ldots ,x_N\right )\\ \end {aligned} \end{equation}
The dissimilarity function is the input of DTW algorithm,
\(\seteqnumber{0}{}{1}\)\begin{equation} d_{mn} = f(x_m,y_n)\ge 0 \end{equation}
For example, it can (1D) euclidean distance, \(d_{mn} = \abs {x_m-y_n}\). The selected path has the following restrictions. Start and end points are \(d_{11}\) and \(d_{MN}\). Possible moves are monotonous,
-
• Vertical: \((m,n)\rightarrow (m+1,n)\)
-
• Horizontal: \((m,n)\rightarrow (m,n+1)\)
-
• Diagonal: \((m,n)\rightarrow (m+1,n+1)\)
No skips or turns back are possible. The minimum distance path \(\sum d_{mn}\) is calculated. This path can serve as metric of “distance” between two stretch signals.
-
Example 17.1: Two signals of different lengths: \(X\) is a chirp of \(M=1000\) samples and \(Y\) is a sinusoid of \(N=399\) samples (Fig. 17.1). The cost matrix \(d_{mn}=\abs {x_m-y_n}\) and the optimal warping path are shown in Fig. 17.2. After alignment along the optimal path, the two signals match closely (Fig. 17.3), demonstrating how DTW stretches the time axis to find the best correspondence.
Multivariate DTW
Same as univariate, except distance definition. For example, for (complex signals) of dimension \(K\) the euclidean distance is
\(\seteqnumber{0}{}{2}\)\begin{equation} d_{mn} = \sqrt {\sum _{k=1}^K \left (x_{k,m}-y_{k,n}\right )\left (x_{k,m}-y_{k,n}\right )^*} \end{equation}
17.2 Shapelets
A shapelet is a sub-sequence \(S = (s_1,\ldots ,s_l)\) of length \(l\) that is “maximally representative of a class” [?].
Shapelet distance
Given a shapelet \(S\) of length \(l\) and a signal \(X = (x_1,\ldots ,x_M)\), the shapelet distance is the minimum distance over all possible alignments:
\(\seteqnumber{0}{}{3}\)\begin{equation} d(S, X) = \min _{i=1,\ldots ,M-l+1} \sqrt {\frac {1}{l}\sum _{j=1}^{l}(s_j - x_{i+j-1})^2} \end{equation}
This sliding-window RMSE measures how well the shapelet matches anywhere in the signal (Fig. 17.4).
Classification
-
1. Extract shapelet candidates from the training set (sub-sequences of varying lengths).
-
2. For each candidate, compute its distance to every training signal.
-
3. Select the shapelet(s) whose distances best separate the classes (e.g. by information gain).
-
4. Use the shapelet distances as features for a classifier (e.g. decision tree, SVM).
Properties
-
• A signal may contain more than one discriminative shapelet; multiple shapelets can be combined as a feature vector.
-
• Shapelets are interpretable — they correspond to physically meaningful signal patterns.
-
• Optimized for time-domain analysis; less effective for signals whose discriminative information lies in repeated patterns or the frequency domain.
-
• Shapelet discovery can be computationally expensive (\(O(M^2 \cdot l)\) per candidate); approximate and learned methods exist to reduce cost.
In Python: tslearn (shapelet learning), sktime (shapelet transform classifier).
17.3 Time-series forest
Time-series forest (TSF) is an interval-based method. Instead of operating on raw samples or shapelets, it extracts simple statistics from randomly chosen sub-intervals of the signal.
Algorithm
-
1. For each tree in the forest, randomly select \(\sqrt {M}\) intervals \([a_i, b_i] \subset \{1,\ldots ,M\}\).
-
2. From each interval, extract three summary features: mean, standard deviation, and slope.
-
3. Train a decision tree on the resulting feature vector (of length \(3\sqrt {M}\)).
-
4. Repeat for all trees and aggregate predictions by majority vote.
Properties
-
• Somewhat interpretable: each split corresponds to a statistic computed over a specific time interval.
-
• Computationally efficient compared to shapelet-based methods.
-
• Captures interval-level (phase) information rather than point-level or subsequence-level patterns.
Variants
Random Interval Spectral Ensemble (RISE) replaces the time-domain summary statistics with spectral features (e.g. periodogram coefficients), making it effective for signals whose discriminative information lies in the frequency domain.
In Python: sktime (interval-based classifiers including TSF and RISE).
17.4 Symbolic aggregate approximation (SAX)
Algorithm
-
1. Piecewise Aggregate Approximation (PAA): divide the signal of length \(M\) into \(w\) equal-length segments and replace each segment by its mean, producing a reduced representation of length \(w\).
-
2. Symbolization: map each mean value to a symbol from an alphabet of size \(\alpha \) using breakpoints derived from the standard normal distribution (assuming z-normalized data). The result is a string of length \(w\) over the alphabet \(\{a, b, \ldots \}\).
-
Example 17.2: A signal of \(M=128\) samples is converted to SAX with \(w=8\) segments and alphabet size \(\alpha =4\). The PAA reduces the signal to 8 mean values (Fig. 17.5, left). Each mean is mapped to a symbol based on the breakpoints of the standard normal distribution (center). The resulting SAX string is shown on the right.
SAX-VSM
SAX combined with the Vector Space Model (SAX-VSM) [?] applies text-classification ideas to time-series classification:
-
1. Convert each training signal to a SAX string using a sliding window, producing a “bag of words.”
-
2. Build a tf-idf weighted term-frequency vector for each class.
-
3. Classify a new signal by computing the cosine similarity of its bag-of-words vector to each class vector.
Properties
-
• Interpretable: discriminative patterns correspond to readable symbolic words.
-
• Dimensionality reduction is controlled by two parameters: number of segments \(w\) and alphabet size \(\alpha \).
-
• Sensitive to the choice of \(w\) and \(\alpha \); typically selected by cross-validation.
In Python: tslearn (PAA, SAX), sktime (SAX-based classifiers).
17.5 ROCKET
RandOm Convolutional KErnel Transform (ROCKET) [?].
Algorithm
-
1. Generate a large number (e.g. \(10{,}000\)) of random convolutional kernels with random length, weights, bias, dilation, and padding.
-
2. Convolve each kernel with the input signal.
-
3. From each convolution output, extract two features: the maximum value and the proportion of positive values (ppv).
-
4. Train a linear classifier (e.g. ridge regression) on the resulting feature vector.
The key insight is that a sufficiently large collection of random kernels captures diverse patterns (trends, spikes, oscillations at various scales) without any learning or search over the kernel space.
Properties
-
• May be slow: training scales linearly with the number of signals and kernels.
-
• Accuracy comparable to deep learning and ensemble methods (e.g. HIVE-COTE).
-
• Not interpretable: random kernels have no direct physical meaning.
MINIROCKET
MINIROCKET [?] is a faster, near-deterministic variant. The main differences are summarized below.
| ROCKET | MINIROCKET | |
| Kernel weights | Random (continuous) | Fixed \(\{-1,\,2\}\) |
| Features | Max value + ppv | ppv only |
| Determinism | Stochastic | Near-deterministic |
| Speed | Fast | Up to \(75\times \) faster |
In Python: sktime (RocketClassifier, MiniRocketClassifier).
17.6 HIVE-COTE
The Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE) [?] is a meta-ensemble that fuses the outputs of several diverse classifiers, each operating on a different signal domain:
-
• Time domain (e.g. shapelet transform, TSF).
-
• Frequency domain (e.g. RISE).
-
• Dictionary-based representation (e.g. SAX-based methods).
-
• Interval-based and whole-series distance-based classifiers.
Each component classifier produces a probability distribution over the classes. HIVE-COTE combines them using a weighted vote, where the weights reflect each component’s estimated accuracy.
Properties
-
• Among the most accurate non-DL classifiers on standard benchmarks.
-
• Computationally expensive: trains multiple full classifiers.
-
• Modular: individual components can be replaced or upgraded independently.
17.7 Summary
Method comparison
The methods presented in this chapter are compared below. The “domain” column indicates the signal representation used by each method. Accuracy and speed are qualitative rankings on standard benchmarks.
| Method | Domain | Accuracy | Speed | Interpretable |
| DTW + \(k\)-NN | Time (distance) | Moderate | Slow | No |
| Shapelets | Time (sub-sequence) | Moderate | Slow | Yes |
| TSF / RISE | Time / Frequency (interval) | Moderate | Fast | Partially |
| SAX-VSM | Symbolic | Moderate | Fast | Yes |
| ROCKET | Time (random convolution) | High | Fast | No |
| HIVE-COTE | Multi-domain (ensemble) | Highest | Very slow | No |
Classical vs. DL approaches
| Classical (non-DL) | DL | |
| Signal modeling | Not required | Not required |
| Multivariate signals | Less effective | Naturally supported |
| Interpretability | Limited (except shapelets, SAX) | Limited |
| Generalization | Low | High |
| Complexity | High vs. model-based | High |
| Feature engineering | Required (hand-crafted) | Learned automatically |
| Data requirements | Moderate | Large (orders of magnitude more) |
| Hyper-parameters | Few | Many; optimization required |
| Non-linear signals | Supported | Supported |