"Missing" Results Values in Decision Tree with Probability > 0

Discussion in 'microsoft.public.sqlserver.datamining' started by Gordon Linoff, Dec 22, 2005.

  1. I am running the decision tree algorithm on a binary response variable that only takes on the values 0 and 1. There are no null values.

    In some leaves, I am getting results like:

    Value Cases Probability
    False 949 96.24%
    Missing 0 0.16%
    True 34 3.60%

    How is the probability greater than 0 if the number of cases = 0? If it makes a difference, this is using binary splits and split method 4.

    Gordon Linoff, Dec 22, 2005
    1. Advertisements

  2. Gordon Linoff

    ZhaoHui Tang Guest

    It is because we add prior support to each tree node.
    ZhaoHui Tang, Dec 23, 2005
    1. Advertisements

  3. Hi, Gordon

    The probabilities are smoothened. As a result, a state with no existing support 0 (and, implicitly, an existing probability of 0) is considered highly improbable rather than impossible. This does not change the outcome of the predictor and allows better modeling of examples such as "How likely is that a US president is female?" -- while highly improbable based on historical facts, it is not impossible.

    With this smoothening, all the possible states of an attribute end up with non 0 probabilities. Besides, we assume that an attribute always has a Missing state, which explains the non-0 probability you got for Missing

    Bogdan Crivat, Dec 23, 2005
  4. This makes sense and I even remember reading about it.

    My next question is how to prevent this from happening for a binary response variable.

    I think I want to set the "ALLOWNULL" property on the datasource to "FALSE" instead of "TRUE", but I'm not allowed to do this.

    Alternatively, I might want to use the SQL cast function, but "cast(clicker as int not NULL)" generates a SQL error.

    Do you have any ideas?

    anonymous_user, Dec 27, 2005
  5. You may also try the MODEL_EXISTENCE_ONLY flag on the mining model predictable column. This will model your variable with only two states: Missing and Existing, effectively combining all the non-missing states in a single one. MODEL_EXISTENCE_ONLY is supported, if I remember correctly, by all the algorithms. Definitely by Decision Trees.

    To try MODEL_EXISTENCE_ONLY, in the model designer select your predictable column and go to Properties and check the Model Existence Only flag

    Bogdan Crivat, Dec 27, 2005
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.