Training Decision Tree

Discussion in 'microsoft.public.sqlserver.datamining' started by anonymous_user, Mar 20, 2007.

  1. I have a data set where in the last column is the name of the class where each data row belongs. I want keep an amount of records for validation and the rest for training the decision tree. How can this be done?
     
    anonymous_user, Mar 20, 2007
    #1
    1. Advertisements

  2. anonymous_user

    Dejan Sarka Guest

    I have a data set where in the last column is the name of the class where
    You can use SQL Server Integration Services, Percentage Sampling and Row
    Sampling transformations. If you use SQL Server 2005, you can also use the
    new TABLESAMPLE clause of the SELECT statement. However, this clause works
    on page level, not on row level; you should use it for large tables only.
    Example:

    -- Use 80% of data to train
    SELECT *
    INTO vTM_Train
    FROM vTargetMail
    TABLESAMPLE (80 PERCENT)
    GO

    -- Select other 20% to test
    SELECT vTM_table.*, 1 AS TrainTest
    INTO vTM_Test
    FROM vTargetMail
    EXCEPT
    SELECT *
    FROM vTM_Train
    GO

    You can also create manual sampling query, works perfectly even on small
    sets (kudos to Steve Kass):

    -- Manual Bernoulli sampling
    SELECT RAND(CHECKSUM(NEWID())%1000000000+CustomerKey),*
    FROM vTargetMail
    WHERE RAND(CHECKSUM(NEWID())%1000000000+CustomerKey)< 0.1
    -- 0.1 is the desired probability of choosing a row, change as needed.
    GO
     
    Dejan Sarka, Mar 20, 2007
    #2
    1. Advertisements

  3. Thank you very much :)

    But if I want SQL Server to take a specific amount of data for validation, is there a way for that?
     
    anonymous_user, Mar 23, 2007
    #3
  4. anonymous_user

    Dejan Sarka Guest

    But if I want SQL Server to take a specific amount of data for validation,
    I am not sure if I understand what you mean. In all examples from my
    previous post you can easily specify any amount you need in percentages;
    with SSIS Row Sampling transform, you can also specify number of rows you
    need. If you want to get a fixed number of rows in T-SQL, you can use TOP
    operator, like

    SELECT TOP 1000 *
    FROM table
    ORDER BY NEWID()

    By using NEWID() function for ordering you get approximately random sample,
    usually good enough for analyses; however, SSIS Row Sampling transform gives
    you statisticaly random rows.
     
    Dejan Sarka, Mar 23, 2007
    #4
  5. Thank you :) What I mean is that I have two datasets onw to train the decision tree and one for its evaluation. I want to use that specific dataset to evaluate the decision tree and not one selected randomply form SQL Server. I am using C4.5 algorithm and I have a training dataset ans a test dataset. I want to use these two in SQL Server so I can compare the C4.5 algorithm with the decision tree algorithm of SQL Server(using entropy because C4.5 is an entropy based classifier).
     
    anonymous_user, Mar 28, 2007
    #5
  6. Thank you :) What I mean is that I have two datasets one to train the decision tree and one for its evaluation. I want to use that specific dataset to evaluate the decision tree and not one selected randomply from SQL Server. I am using C4.5 algorithm and I have a training dataset ans a test dataset. I want to use these two in SQL Server so I can compare the C4.5 algorithm with the decision tree algorithm of SQL Server(using entropy because C4.5 is an entropy based classifier).
     
    anonymous_user, Mar 28, 2007
    #6
  7. Thank you :) What I mean is that I have two datasets one to train the decision tree and one for its evaluation. I want to use that specific dataset to evaluate the decision tree and not one selected randomly from SQL Server. I am using C4.5 algorithm and I have a training dataset ans a test dataset. I want to use these two in SQL Server so I can compare the C4.5 algorithm with the decision tree algorithm of SQL Server(using entropy because C4.5 is an entropy based classifier).
     
    anonymous_user, Mar 28, 2007
    #7
  8. anonymous_user

    Dejan Sarka Guest

    Thank you :) What I mean is that I have two datasets onw to train the
    Ok, so you have two datasets. You can use Lift Chart and / or Calssification
    Matrix to measure the performance of the algorithm based on predictions on
    test data set. Check the Lift Chart topic in Books OnLine if this suits you
    (ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/uas9/html/ab77eca1-bd48-4fef-b27f-ff5b648e0501.htm).
    In MS Decision Trees, you can control couple of things through algorithm
    parameters. You can use entropy as the score method of a split if you set
    the
    SCORE_METHOD parameter to 1. Check the Microsoft Decision Trees Algorithm
    topic in Books OnLine
    (ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/uas9/html/95ffe66f-c261-4dc5-ad57-14d2d73205ff.htm).
     
    Dejan Sarka, Mar 28, 2007
    #8
  9. This is what i tried to do. First i created a mining structure using the test data set. I changed the parameters and i check the performance in lift chart.
    Then i created a SSIS package with OLE DB Source the validation data set and i used it as an input to a data mining query. I run the package and browse the decision tree. The problem is that i got the same results as the first time (when i used the test data set), something that it is not possible.
    What am i doing wrong?
     
    anonymous_user, Mar 29, 2007
    #9
  10. anonymous_user

    Dejan Sarka Guest

    This is what i tried to do. First i created a mining structure using the
    I am having a problem with understanding what you are doing. You create a
    model, then you use this model in SSIS. Ok, if you did not reprocess or
    change the model anyhow, why would you expect different results? And first
    of all, what results are you talking about? You say you browse the model in
    SSIS. What does this mean? You are using a CONTENT DMX query to browse it?
    Can you explain in details what are you trying to achieve and what exactly
    are you doing?
     
    Dejan Sarka, Apr 1, 2007
    #10
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.