Skip to content

Commit 5d438ab

Browse files
Update Scaling preprocessors (#69)
* Renamed ZNormalizer to StandardScaler * Implement RobustScaler.py * Included more tests
1 parent c8c81eb commit 5d438ab

20 files changed

+730
-471
lines changed

docs/additional_information/changelog.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Added
1212
- Implemented ``ClusterBasedLocalOutlierFactor`` (CBLOF) anomaly detector.
1313
- Implemented ``KMeansAnomalyDetector`` anomaly detector.
1414
- Implemented ``CopulaBasedOutlierDetector`` (COPOD) anomaly detector.
15+
- Implemented ``RobustScaler`` preprocessor.
1516

1617
Changed
1718
^^^^^^^
@@ -25,6 +26,7 @@ Changed
2526

2627
Fixed
2728
^^^^^
29+
- Renamed ``ZNormalizer`` to ``StandardScaler``, to make it align with the Sklearn declaration.
2830

2931

3032
[0.2.3] - 2024-12-02

docs/api/preprocessing.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@ Preprocessing module
1212
.. autoclass:: dtaianomaly.preprocessing.ChainedPreprocessor
1313
.. autoclass:: dtaianomaly.preprocessing.Identity
1414
.. autoclass:: dtaianomaly.preprocessing.MinMaxScaler
15-
.. autoclass:: dtaianomaly.preprocessing.ZNormalizer
15+
.. autoclass:: dtaianomaly.preprocessing.StandardScaler
16+
.. autoclass:: dtaianomaly.preprocessing.RobustScaler
1617
.. autoclass:: dtaianomaly.preprocessing.MovingAverage
1718
.. autoclass:: dtaianomaly.preprocessing.ExponentialMovingAverage
1819
.. autoclass:: dtaianomaly.preprocessing.SamplingRateUnderSampler

docs/getting_started/examples/quantitative_evaluation.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -45,9 +45,9 @@ is applied.
4545
4646
preprocessors = [
4747
Identity(),
48-
ZNormalizer(),
49-
ChainedPreprocessor([MovingAverage(10), ZNormalizer()]),
50-
ChainedPreprocessor([ExponentialMovingAverage(0.8), ZNormalizer()])
48+
StandardScaler(),
49+
ChainedPreprocessor([MovingAverage(10), StandardScaler()]),
50+
ChainedPreprocessor([ExponentialMovingAverage(0.8), StandardScaler()])
5151
]
5252
5353
We will now initialize our anomaly detectors. Each anomaly detector will be combined with each
@@ -124,7 +124,7 @@ as follows:
124124
{ 'type': <name-of-component>, 'optional-param': <value-optional-parameter>}
125125
126126
The ``'type'`` equals the name of the component, for example ``'LocalOutlierFactor'``
127-
or ``'ZNormalizer'``. This string must exactly match the object name of the component
127+
or ``'StandardScaler'``. This string must exactly match the object name of the component
128128
you want to add to the workflow. In addition, it is possible to define hyperparameters
129129
of each component. For example for ``'LocalOutlierFactor'``, you must define a
130130
``'window_size'``, but can optionally also define a ``'stride'``. An error will be
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
2+
import numpy as np
3+
from typing import Optional, Tuple
4+
from sklearn.exceptions import NotFittedError
5+
6+
from dtaianomaly.utils import get_dimension
7+
from dtaianomaly.preprocessing.Preprocessor import Preprocessor
8+
9+
10+
class RobustScaler(Preprocessor):
11+
"""
12+
Scale the time series using robust statistics.
13+
14+
The :py:class:`~dtaianomaly.preprocessing.RobustScaler` is similar to
15+
:py:class:`~dtaianomaly.preprocessing.StandardScaler`, but uses robust
16+
statistics rather than mean and standard deviation. The center of the data
17+
is computed via the median, and the scale is computed as the range between
18+
two quantiles (by default uses the IQR). This ensures that scaling is less
19+
affected by outliers.
20+
21+
For a time series :math:`x`, center :math:`c` and scale :math:`s`, observation
22+
:math:`x_i` is scaled to observation :math:`y_i` using the following equation:
23+
24+
.. math::
25+
26+
y_i = \\frac{x_i - c}{s}
27+
28+
Notice the similarity with the formula for standard scaling. For multivariate
29+
time series, each attribute is scaled independently, each with an independent
30+
scale and center.
31+
32+
Parameters
33+
----------
34+
quantile_range: tuple of (float, float), default = (25.0, 75.0)
35+
Quantile range used to compute the ``scale_`` of the robust scaler.
36+
By default, this is equal to the Inter Quantile Range (IQR). The first
37+
value of the quantile range corresponds to the smallest quantile, the
38+
second value corresponds to the larger quantile. If the first value is
39+
not smaller than the second value, an error will be thrown. The values
40+
must also both be in the range [0, 100].
41+
42+
Attributes
43+
----------
44+
center_: array-like of shape (n_attributes)
45+
The median value in each attribute of the training data.
46+
scale_: array-like of shape (n_attributes)
47+
The quantile range for each attribute of the training data.
48+
49+
Raises
50+
------
51+
NotFittedError
52+
If the `transform` method is called before fitting this StandardScaler.
53+
"""
54+
quantile_range: (float, float)
55+
center_: np.array
56+
scale_: np.array
57+
58+
def __init__(self, quantile_range: (float, float) = (25.0, 75.0)):
59+
if not isinstance(quantile_range, tuple):
60+
raise TypeError("`quantile_range` should be tuple")
61+
if len(quantile_range) != 2:
62+
raise ValueError("'quantile_range' should consist of exactly two values (length of 2)")
63+
if not isinstance(quantile_range[0], (float, int)) or isinstance(quantile_range[0], bool):
64+
raise TypeError("The first element `quantile_range` should be a float or int")
65+
if not isinstance(quantile_range[1], (float, int)) or isinstance(quantile_range[1], bool):
66+
raise TypeError("The second element `quantile_range` should be a float or int")
67+
if quantile_range[0] < 0.0:
68+
raise ValueError("the first element in 'quantile_range' must be at least 0.0")
69+
if quantile_range[1] > 100.0:
70+
raise ValueError("the second element in 'quantile_range' must be at most 100.0")
71+
if not quantile_range[0] < quantile_range[1]:
72+
raise ValueError("the first element in 'quantile_range' must be at smaller than the second element in 'quantile_range'")
73+
self.quantile_range = quantile_range
74+
75+
def _fit(self, X: np.ndarray, y: Optional[np.ndarray] = None) -> 'RobustScaler':
76+
if get_dimension(X) == 1:
77+
# univariate case
78+
self.center_ = np.array([np.nanmedian(X)])
79+
q_min = np.percentile(X, q=self.quantile_range[0])
80+
q_max = np.percentile(X, q=self.quantile_range[1])
81+
self.scale_ = np.array([q_max - q_min])
82+
else:
83+
# multivariate case
84+
self.center_ = np.nanmedian(X, axis=0)
85+
q_min = np.percentile(X, q=self.quantile_range[0], axis=0)
86+
q_max = np.percentile(X, q=self.quantile_range[1], axis=0)
87+
self.scale_ = q_max - q_min
88+
return self
89+
90+
def _transform(self, X: np.ndarray, y: Optional[np.ndarray] = None) -> Tuple[np.ndarray, Optional[np.ndarray]]:
91+
if not (hasattr(self, 'center_') and hasattr(self, 'scale_')):
92+
raise NotFittedError(f'Call `fit` before using transform on {str(self)}')
93+
if not ((len(X.shape) == 1 and self.center_.shape[0] == 1) or X.shape[1] == self.center_.shape[0]):
94+
raise AttributeError(f'Trying to robust scale a time series with {X.shape[0]} attributes while it was fitted on {self.center_.shape[0]} attributes!')
95+
96+
X_ = (X - self.center_) / self.scale_
97+
return np.where(np.isnan(X_), X, X_), y

dtaianomaly/preprocessing/ZNormalizer.py renamed to dtaianomaly/preprocessing/StandardScaler.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@
66
from dtaianomaly.preprocessing.Preprocessor import Preprocessor
77

88

9-
class ZNormalizer(Preprocessor):
9+
class StandardScaler(Preprocessor):
1010
"""
11-
Rescale to zero mean, unit variance.
11+
Standard scale the data: rescale to zero mean, unit variance.
1212
1313
Rescale to zero mean and unit variance. A mean value and standard
1414
deviation is computed on a training set, after which these values
@@ -37,7 +37,7 @@ class ZNormalizer(Preprocessor):
3737
Raises
3838
------
3939
NotFittedError
40-
If the `transform` method is called before fitting this MinMaxScaler.
40+
If the `transform` method is called before fitting this StandardScaler.
4141
"""
4242
min_std: float
4343
mean_: np.array
@@ -46,7 +46,7 @@ class ZNormalizer(Preprocessor):
4646
def __init__(self, min_std: float = 1e-9):
4747
self.min_std = min_std
4848

49-
def _fit(self, X: np.ndarray, y: Optional[np.ndarray] = None) -> 'ZNormalizer':
49+
def _fit(self, X: np.ndarray, y: Optional[np.ndarray] = None) -> 'StandardScaler':
5050
if len(X.shape) == 1 or X.shape[1] == 1:
5151
# univariate case
5252
self.mean_ = np.array([np.nanmean(X)])
@@ -62,7 +62,7 @@ def _transform(self, X: np.ndarray, y: Optional[np.ndarray] = None) -> Tuple[np.
6262
if not (hasattr(self, 'mean_') and hasattr(self, 'std_')):
6363
raise NotFittedError(f'Call `fit` before using transform on {str(self)}')
6464
if not ((len(X.shape) == 1 and self.mean_.shape[0] == 1) or X.shape[1] == self.mean_.shape[0]):
65-
raise AttributeError(f'Trying to z-normalize a time series with {X.shape[0]} attributes while it was fitted on {self.min_.shape[0]} attributes!')
65+
raise AttributeError(f'Trying to standard scale a time series with {X.shape[0]} attributes while it was fitted on {self.mean_.shape[0]} attributes!')
6666

6767
# If the std of all attributes is 0, then no transformation happens
6868
if np.all((self.std_ < self.min_std)):

dtaianomaly/preprocessing/__init__.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,24 +8,26 @@
88
from .Preprocessor import Preprocessor, check_preprocessing_inputs, Identity
99
from .ChainedPreprocessor import ChainedPreprocessor
1010
from .MinMaxScaler import MinMaxScaler
11-
from .ZNormalizer import ZNormalizer
11+
from .StandardScaler import StandardScaler
1212
from .MovingAverage import MovingAverage
1313
from .ExponentialMovingAverage import ExponentialMovingAverage
1414
from .UnderSampler import SamplingRateUnderSampler, NbSamplesUnderSampler
1515
from .Differencing import Differencing
1616
from .PiecewiseAggregateApproximation import PiecewiseAggregateApproximation
17+
from .RobustScaler import RobustScaler
1718

1819
__all__ = [
1920
'Preprocessor',
2021
'check_preprocessing_inputs',
2122
'Identity',
2223
'ChainedPreprocessor',
2324
'MinMaxScaler',
24-
'ZNormalizer',
25+
'StandardScaler',
2526
'MovingAverage',
2627
'ExponentialMovingAverage',
2728
'SamplingRateUnderSampler',
2829
'NbSamplesUnderSampler',
2930
'Differencing',
30-
'PiecewiseAggregateApproximation'
31+
'PiecewiseAggregateApproximation',
32+
'RobustScaler'
3133
]

dtaianomaly/workflow/workflow_from_config.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -345,10 +345,10 @@ def preprocessing_entry(entry):
345345
raise TypeError(f'Too many parameters given for entry: {entry}')
346346
return preprocessing.MinMaxScaler()
347347

348-
elif processing_type == 'ZNormalizer':
348+
elif processing_type == 'StandardScaler':
349349
if len(entry_without_type) > 0:
350350
raise TypeError(f'Too many parameters given for entry: {entry}')
351-
return preprocessing.ZNormalizer()
351+
return preprocessing.StandardScaler()
352352

353353
elif processing_type == 'MovingAverage':
354354
return preprocessing.MovingAverage(**entry_without_type)
@@ -368,6 +368,9 @@ def preprocessing_entry(entry):
368368
elif processing_type == 'PiecewiseAggregateApproximation':
369369
return preprocessing.PiecewiseAggregateApproximation(**entry_without_type)
370370

371+
elif processing_type == 'RobustScaler':
372+
return preprocessing.RobustScaler(**entry_without_type)
373+
371374
elif processing_type == 'ChainedPreprocessor':
372375
if len(entry_without_type) != 1:
373376
raise TypeError(f'ChainedPreprocessor must have base_preprocessors as key: {entry}')

notebooks/Config.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,14 +20,14 @@
2020
],
2121
"preprocessors": [
2222
{"type": "Identity"},
23-
{"type": "ZNormalizer"},
23+
{"type": "StandardScaler"},
2424
{"type": "ChainedPreprocessor", "base_preprocessors": [
2525
{"type": "MovingAverage", "window_size": 10},
26-
{"type": "ZNormalizer"}
26+
{"type": "StandardScaler"}
2727
]},
2828
{"type": "ChainedPreprocessor", "base_preprocessors": [
2929
{"type": "ExponentialMovingAverage", "alpha": 0.8},
30-
{"type": "ZNormalizer"}
30+
{"type": "StandardScaler"}
3131
]}
3232
],
3333
"detectors": [

notebooks/Industrial-anomaly-detection.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -313,7 +313,7 @@
313313
"source": [
314314
"##### (2) Preprocessors\n",
315315
"\n",
316-
"Next, we can define zero, one or multiple preprocessors to process the data. ``dtaianomaly`` already offers a number of preprocessors, (e.g., ``MinMaxScaler``, ``ZNormalizer``, ``MovingAverage``, ``ChainedPreprocessor``, etc.), but it is also possible to develop a custom preprocessor. For example, the wind turbine data has missing values, which typically cannot be handled by anomaly detectors. To cope with these, we define an ``Imputer`` preprocessor as below. All we need to do for this is add ``Preprocessor`` as a parent of the class and implement the ``._fit()`` and ``._transform()`` methods. For the ``Imputer``, no fitting is required, and the missing values are replaced by the previous observed value. Note that more complex imputation strategies could be implemented as well. "
316+
"Next, we can define zero, one or multiple preprocessors to process the data. ``dtaianomaly`` already offers a number of preprocessors, (e.g., ``MinMaxScaler``, ``StandardScaler``, ``MovingAverage``, ``ChainedPreprocessor``, etc.), but it is also possible to develop a custom preprocessor. For example, the wind turbine data has missing values, which typically cannot be handled by anomaly detectors. To cope with these, we define an ``Imputer`` preprocessor as below. All we need to do for this is add ``Preprocessor`` as a parent of the class and implement the ``._fit()`` and ``._transform()`` methods. For the ``Imputer``, no fitting is required, and the missing values are replaced by the previous observed value. Note that more complex imputation strategies could be implemented as well. "
317317
],
318318
"id": "6d23ff059c6f7832"
319319
},

0 commit comments

Comments
 (0)