FitCrawler
The FitCrawler is derived from SequentialStreamCrawler and the intent is that the subclasser will modify it not by subclassing necessarily, but by providing at time of construction a means of sequentially fitting a distribution.
MicroCrawler
|
SequentialStreamCrawler
|
FitCrawler
Now SequentialStreamCrawler is useful if the process is well approximated by independent increments, or when each prediction is essentially a prediction of a single increment. Then in addition, FitCrawler helps out if you expect to be reading parameters from an offline fit, or at least trying to do so.
Prerequisite knowledge:distribution machines
To use FitCrawler effectively one must understand the FitDist class. This is intended to abstract the notion of a walking, talking cumulative distribution function that informs itself as it receives data one point at a time. A DistMachine is characterized by:
def update(self, value=None, dt=None, **kwargs):
# Incorporate new value, time passing, or both
# Typically will update the state but not params
raise NotImplementedError
def inv_cdf(self, p: float) -> float:
# Something like StatsConventions.norminv(p)
raise NotImplementedError
A LossDist is a DistMachine that also has the notion of a loss function (which might be log-likelihood). It’s contribution is wrapping up the ability to run itself over past data and report aggregate scores that can be used for estimation. The FitDist provides some additional functionality over a LossMachine, because it provides a default way of fitting the distribution (or we might say hyper-fitting it, by default using the HyperOpt package).
An example of a non-trivial FitDist is provided by expnormdist which, as the name suggests, attempts to maintain an exponential normal distribution (actually two of them).
To summarize, (and see /univariate):
DistMachine
|
LossMachine
|
FitDist
SequentialStreamCrawler
With that out of the way, let us return to SequentialStreamCrawler and observe that by default it provides remotely sensible update_state and sample_using_state methods. That’s why SequentialStreamCrawler can be used without subclassing. Instead you change the default distribution machine that is passed to it at construction time (the default is the t-Digest, which you can read about in the article Live Online Distribution Estimation Using t-Digests.
FitCrawler
Backing up one more step, we can now say that you’d want to use FitCrawler if the following is true:
- You are okay modeling the process as independent increments (or you care to override the default sample_using_state method)
- You like the idea of fitting parameters offline, and reading them from a url
If all of this sounds complicated, the usage isn’t. Indeed you can see from comal_cheetah that FitCrawler has taken away all of the trouble one would normally go to. All Comal Cheetah does is pick a distribution machine, instantiate itself with some parameters including a url where stored parameters are expected to be found, and runs. The model parameters are to be found here.
Video help
There is a video about the use of FitCrawler.
-+-
Documentation map