How useful is Sequence Clustering algorithm?

Discussion in 'microsoft.public.sqlserver.datamining' started by Bostjan Kozuh, Apr 4, 2007.

  1. Hello!

    I'm having serious problems with the performance of SequenceClustering algorithm and I would like to know whether it is I that am doing something wrong or is it the algorithm's limitations.

    Here is the situation:

    I need to examine clickstream data in order to be able to predict users actions for my research project. According to "Data mining with SQL Server 2005" book and all the posts I have read this is the typical application for the Sequence CLustering algoritm. So, I have constructed case table (Customers) and nested table (Clickpath) according to the instructions. I have identified 36,296 users that have in total visited around 192,000 pages (cleaned records) - the max number of visited pages per user is 64, while the number of different URLCategories is 122 (pages are already coverted categories).

    I use BI Development Studio for DM modelling - in the "Create DM Structure Wizard" I select case and nested tables and then
    CustomerGuid - KEY (Key, Text)
    URLCategory - INPUT, PREDICTABLE (Key Sequence, Long)
    SequenceID - KEY, INPUT (Discrete, Text)

    Then I process the model (MAXIMUM_STATES=1000, other settings = DEFAULT) and the viewers indicate pretty reasonable results. Algorith finds 16 differenct clusters and some of them are quite distinct from others which is OK. The problems start in the prediction phase - I use the following singleton DMX query (Test1 is model name) in the prediction tab

    SELECT Cluster(), (Select $Sequence,
    , URLCategory, PredictProbability(UrLCategory) As Prob
    From PredictSequence(ClickPath,5)) As Sequences
    FROM Test1
    (SELECT (SELECT 1 As SequenceID, 'CategoryA' AS URLCategory
    SELECT 2 As SequenceID, 'CategoryB' As URLCategory
    SELECT 3 As SequenceID, 'CategoryC' As URLCategory) As ClickPath) As t

    The results are not what I expect - very often I get recommendations with probability 1E-14 and sometimes I get "Internal error: An unexpected exeption occured." I tried to use URL sequences that are very distinct for a particular cluster, but the results do not differ much - the predicted cluster is often correct (ie. as expected), but the predicted pages (often) do not occur in that cluster at all.

    I would appreciate your help in determining how useful is Sequence Clustering algorithm for personalization of websites.

    Bostjan Kozuh, Apr 4, 2007
    1. Advertisements

  2. Bostjan Kozuh

    Dejan Sarka Guest

    I would say that this smells like a bug, especially the "internal error"
    part. Mayb you could test the algorithm a bit more - maybe you can create
    couple of models with different number of clusters, and check whether the
    memebrship prediction gets better with some of the new models? In addition,
    as this seems a bug or a feature with room for improvment, maybe you could
    post this issue to Microsoft Connect -
    Dejan Sarka, Apr 9, 2007
    1. Advertisements

  3. I would say that this smells like a bug, especially the "internal error"
    Thanks for your answer.

    I have indeed tested the algorithm in more detail and have identified the circumstances in which the algorithm produces recommendations with probability 1E-14 or the internal error. This occurs whenever I set the MAXIMIM_SEQUENCE_STATES parameter to value less than the number of distinct pages (non-sequence attributes).
    I was under the assumption that the MAXIMUM_SEQUENCE_STATES parameter tells the algorithm the max value of sequence attribute (and should it should be below 100 for meaningful models), while the MAXIMUM_STATES is the number of distinct values for non-sequence attribute. Am I getting this wrong or what?

    Anyway, now I set the value of both parameters well above the acutal numbers and the algorithm seems to predict pretty well.

    I have one follow-up question, though - is there any way to optimize the performance of AS for making singleton predictions? It takes a couple of seconds on average to get a recommendation from a model I described above (Intel Pentium 4, 3.2 GHz, 4 GB RAM) and this time rockets to 2 minutes and more with models that have around 2.000 distinct values of non-sequence attribute.

    Thans very much,
    Bostjan Kozuh, Apr 11, 2007
  4. Bostjan Kozuh

    Dejan Sarka Guest

    I have indeed tested the algorithm in more detail and have identified the
    AFAIK you are completely right aout the parameters. If the
    MAXIMUM_SEQUENCE_STATES parameter is lower that the actual number of states,
    then feature selection is invoked to use the most representative states
    only. It seems like you are getting a bug with this feature selection. Try
    to report this at MS Connect, as it seems you narrowed down the problem.
    You can try to warm up the AS cache with reading the model content in
    advance, before you start predictions, with DMX query like
    Dejan Sarka, Apr 19, 2007
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.