Decision trees Inductive inference. robust to noise in data and searches in complete hypothesis space. approximating discrete valued functions Representation: A decision tree (DT) A DT can be converted to sets of if-then-else rules Medical diagnosis to credit risk In general DT represents a disjunction of conjunctions of constraints on the attribute values of instances. Each branch represents an OR (disjuntion) A path to leaf represents an AND (conjunction) connect values of attributes using ANDs and ORs Sunny AND Normal OR Overcast OR Rain AND Weak The Algorithm. Find the attribute that best classifies instances Make a node for this attribute Make branches for each value of this attribute Sort training instances to each branch B_i gets all instances with value V_i for this attribute Recur/Repeat for each branch with current set of instances at each branch What is the best attribute? We use a statistical measure called Information Gain defined in terms of Entropy Given S = [9+,5-] Entropy(S) = -P+logP+ - P-logP- Log base 2. = - 9/14 log(9/14) - 5/14 log (5/14) = .940 Suppose S = [100+, 0-] Entropy(S) = - 100/100 log 100/100 - 0/100 log 0/100 = 0 Suppose S = [0+, 100-] Entropy(S) = -0/100 log 0/100 - 100/100 log 100/100 = 0 Suppose S = [50+, 50-] Entropy S = - 50/100 log 50/100 - 50/100 log 50/100 = -1/2 (-1) - 1/2 (-1) = 1/2 + 1/2 = 1 In general: For C target attribute values Entropy(S) = Sum(1,c)[ -P_i log2 P_i ] Information Gain: Expected reduction in Entropy in we partition examples on Attribute A Gain(S,A) : Gain of an attribute A relative to collection of examples/(training instances) S, is defined as: Gain(S,A) = Entropy(S) - Sum [|S_v|/|S| * Entropy (S_v)]