[R] fitting Markov chains

Thu Oct 2 10:54:10 CEST 2003

>>>>> "Tamas" == Tamas Papp <tpapp at axelero.hu>
>>>>>     on Wed, 1 Oct 2003 17:14:51 +0200 writes:

    Tamas> I need to find a computationally simple process for the movement of
    Tamas> interest rates. In this simplified model, an interest rate can have
    Tamas> 3--5 possible values, and its movement is characterized by a matrix of
    Tamas> transition probabilities (ie, it is a Markov process).

    Tamas> I would like to estimate this process from a given set of data. 

    Tamas> For example, let the interest rate time series be:

    Tamas> 7 3 8 2 5 9 6

    Tamas> Assume that the discretized variable can take the following values:
    Tamas> (3, 5, 8), then we find the nearest discrete point and give its index:

    Tamas> 3 1 3 1 2 3 2

    Tamas> Then estimate the transition probabilities.

    Tamas> I have the following questions:

    Tamas> - how should I select the discrete set of values that the variable can
    Tamas> assume? Eg simply get the maximum and minimum, and divide this
    Tamas> interval into, say, three pieces? Or estimate the mean, and make the
    Tamas> other two values mean plus-minus one standard deviation?

(see below)

    Tamas> - once the variable is discretized, how do I
    Tamas>   transform each data point to its discretized value
    Tamas>   (its index)?

that's the trivial part of answering your questions:  Use  cut()

    Tamas> - the most important: how should I estimate the transition
    Tamas> probabilities?

We (mainly Peter Buhlmann and his coworkers) have had good experiences
with discretizing such financial time series and then fitting so
called "Variable Length Markov Chains" (VLMCs).

To your 1st question: When doing so, one easy (and robust!)
approach was to use the sample quantiles (of the marginal
distribution), e.g. use the three quartiles,
 quantile(x, prob = (1:3)/4)  as cut points {in cut(), above}.
Much less easy is to determine how *many* values you want to
chose for cutting.  I think they have rarely used many pieces,
sometimes even had good results with binary discretization ---
iff used with VLMCs.

So this comes back to the question of model estimation.
A VLMC does *not* assume first-order Markov {as you do above},
but rather allows to look further back into the past, something
quite realistic in some cases. Now the "VL" part, "Variable
Length" is the novel idea of a past the length of which
*depends* on your current state (context).

There's the CRAN package "VLMC" 
  (http://cran.r-project.org/src/contrib/PACKAGES.html#VLMC)
and we have an applied paper 
    Mächler, M. and Bühlmann, P. (2003). 
    Variable Length Markov Chains: Methodology, Computing and Software.
    To appear in Journal of Computational and Graphical Statistics.

a pre-version of which you can get from
  http://stat.ethz.ch/Research-Reports/104.html

The paper contains references to the fundamental (mathematical) 
research on this.

    Tamas> References to introductory literature on estimating Markov chains like
    Tamas> this would be welcome. Most importantly, I need to know how robust an
    Tamas> estimation is to selecting the discrete points, or is there a simple
    Tamas> "goodness of fit" estimation.

Using  quantiles  instead of  mean/sd  for chosing cut points is
certainly robust.
one goodness of fit (GOF) measure we (also use with VLMCs) is 
`the'  AIC (-2 log likelihood). However you can only use it to compare
nested models; which doesn't help you here (I think) for chosing
the number of cutpoints. 
More applied is to use predictions {from a fitted model} and
compare them with the actual values.  Because of discretization
you get a confusion matrix on which you can consider quite a few
different GOF measures.  For realistic GOF measurement you also
need to do something like crossvalidation which in its simplest
case would be to work with "training" and "testing" data.

The whole topic is not at all trivial.
We had one Ph.D. thesis on this (plus extensions..).
Hoping this helps.