Copyright © Andrew W. Moore Slide 1 Probability Densities in Data Mining Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
Copyright © Andrew W. Moore Slide 2 Contenido Porque son importantes. Notacion y fundamentos de PDF continuas. PDFs multivariadas continuas. Combinando variables aleatorias discretas y continuas.
Copyright © Andrew W. Moore Slide 3 Porque son importantes? Real Numbers occur in at least 50% of database records Can’t always quantize them So need to understand how to describe where they come from A great way of saying what’s a reasonable range of values A great way of saying how multiple attributes should reasonably co-occur
Copyright © Andrew W. Moore Slide 4 Porque son importantes? Can immediately get us Bayes Classifiers that are sensible with real-valued data You’ll need to intimately understand PDFs in order to do kernel methods, clustering with Mixture Models, analysis of variance, time series and many other things Will introduce us to linear and non-linear regression
Copyright © Andrew W. Moore Slide 5 A PDF of American Ages in 2000
Copyright © Andrew W. Moore Slide 6 Poblacion de PR por grupo de edad groupfreqmidpointfreq.rela
Copyright © Andrew W. Moore Slide 7
Copyright © Andrew W. Moore Slide 8 A PDF of American Ages in 2000 Let X be a continuous random variable. If p(x) is a Probability Density Function for X then… = 0.36
Copyright © Andrew W. Moore Slide 9 Properties of PDFs That means…
Copyright © Andrew W. Moore Slide 10 Donde x-h/2<w<x+h/2). Luego, Asi p(w) tiende a p(x) cuando h tiende a cero
Copyright © Andrew W. Moore Slide 11 Notar que p(w) tiende a p(x) cuando h tiende a 0 Funcion de distribucion acumulativa. Esta es una funcion No decreciente Se ha mostrado que la derivada de la funcion de distribucion da la funcion de densidad
Copyright © Andrew W. Moore Slide 12 Properties of PDFs Therefore… La dcerivada de una fucnion no dcecreciente es mayor o igual que cero.
Copyright © Andrew W. Moore Slide 13 Cual es el significado de p(x)? Si p(5.31) = 0.06 and p(5.92) = 0.03 Entonces cuando un valor de X es muestreado de la distribucion, es dos veces mas probable que X este mas cerca a 5.31 que a 5.92.
Copyright © Andrew W. Moore Slide 14 Yet another way to view a PDF A recipe for sampling a random age. 1.Generate a random dot from the rectangle surrounding the PDF curve. Call the dot (age,d) 2.If d < p(age) stop and return age 3.Else try again: go to Step 1.
Copyright © Andrew W. Moore Slide 15 Test your understanding True or False:
Copyright © Andrew W. Moore Slide 16 Expectations E[X] = the expected value of random variable X = the average value we’d see if we took a very large number of random samples of X
Copyright © Andrew W. Moore Slide 17 Expectations E[X] = the expected value of random variable X = the average value we’d see if we took a very large number of random samples of X = the first moment of the shape formed by the axes and the blue curve = the best value to choose if you must guess an unknown person’s age and you’ll be fined the square of your error E[age]=35.897
Copyright © Andrew W. Moore Slide 18 Expectation of a function =E[f(X)] = the expected value of f(x) where x is drawn from X’s distribution. = the average value we’d see if we took a very large number of random samples of f(X) Note that in general:
Copyright © Andrew W. Moore Slide 19 Variance 2 = Var[X] = the expected squared difference between x and E[X] = amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play optimally
Copyright © Andrew W. Moore Slide 20 Standard Deviation 2 = Var[X] = the expected squared difference between x and E[X] = amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play optimally = Standard Deviation = “typical” deviation of X from its mean
Copyright © Andrew W. Moore Slide 21 Estadisticas para PR E(edad)=35.17 Var(edad)= Desv.Est(edad)=22.38
Copyright © Andrew W. Moore Slide 22 Funciones de densidad mas conocidas
Copyright © Andrew W. Moore Slide 23 Funciones de densidad mas conocidas La densidad uniforme o rectangular La densidad triangular La densidad exponencial La densidad Gamma y la Chi-square La densidad Beta La densidad Normal o Gaussiana Las densidades t y F.
Copyright © Andrew W. Moore Slide 24 La distribucion rectangular -w/20w/2 1/w
Copyright © Andrew W. Moore Slide 25 La distribucion triangular 0
Copyright © Andrew W. Moore Slide 26 The Exponential distribution
Copyright © Andrew W. Moore Slide 27 La distribucion Normal Estandar
Copyright © Andrew W. Moore Slide 28 La distribucion Normal General =100 =15
Copyright © Andrew W. Moore Slide 29 General Gaussian =100 =15 Shorthand: We say X ~ N( , 2 ) to mean “X is distributed as a Gaussian with parameters and 2 ”. In the above figure, X ~ N(100,15 2 ) Also known as the normal distribution or Bell- shaped curve
Copyright © Andrew W. Moore Slide 30 The Error Function Assume X ~ N(0,1) Define ERF(x) = P(X<x) = Cumulative Distribution of X
Copyright © Andrew W. Moore Slide 31 Using The Error Function Assume X ~ N( , 2 ) P(X<x| , 2 ) =
Copyright © Andrew W. Moore Slide 32 The Central Limit Theorem If (X 1,X 2, … X n ) are i.i.d. continuous random variables Then define As n-->infinity, p(z)--->Gaussian with mean E[X i ] and variance Var[X i ] Somewhat of a justification for assuming Gaussian noise is common
Copyright © Andrew W. Moore Slide 33 Estimadores de funcion de densidad Histograms K-nearest neighbors: Kernel density estimators h ancho de clase d k es la distancia hasta el k-esimo vecino
Copyright © Andrew W. Moore Slide 34 > x=c( 7.3, 6.8, 7.1, 2.5, 7.9, 6.5, 4.2, 0.5, 5.6, 5.9) > hist(x,freq=F,main="Estimacion de funcion de densidad- histograma") > rug(x,col=2)
Copyright © Andrew W. Moore Slide 35 Estimacion de densidad por knn en 20 pts con k=1,3,5,7
Copyright © Andrew W. Moore Slide 36 Estimación por kernels de una función de densidad univariada. En el caso univariado, el estimador por kernels de la función de densidad f(x) se obtiene de la siguiente manera. Consideremos que x 1,…x n es una variable aleatoria X con función de densidad f(x), definamos la función de distribución empirica por el cual es un estimador de la función de distribución acumulada F(x) de X. Considerando que la función de densidad f(x) es la derivada de la función de distribución F y usando aproximación para derivada se tiene que
Copyright © Andrew W. Moore Slide 37 donde h es un valor positivo cercano a cero. Lo anterior es equivalente a la proporción de puntos en el intervalo (x-h, x+h) dividido por 2h. La ecuación anterior puede ser escrita como: donde la función peso K está definida por 0 si |z|>1 K(z)= 1/2 si |z| 1
Copyright © Andrew W. Moore Slide 38 Muestra: 6, 8, 9 12, 20, 25,18, 31
Copyright © Andrew W. Moore Slide 39 este es llamado el kernel uniforme y h es llamado el ancho de banda el cual es un parámetro de suavización que indica cuanto contribuye cada punto muestral al estimado en el punto x. En general, K y h deben satisfacer ciertas condiciones de regularidad, tales como: K(z) debe ser acotado y absolutamente integrable en (- , ) Usualmente, pero no siempre, K(z) 0 y simétrico, luego cualquier función de densidad simétrica puede usarse como kernel.
Copyright © Andrew W. Moore Slide 40 Eleccion del ancho de banda h Donde n es el numero de datos y s la desviacion estandar de la muestra.
Copyright © Andrew W. Moore Slide 41 EL KERNEL GAUSSIANO En este caso el kernel representa una función peso más suave donde todos los puntos contribuyen al estimado de f(x) en x. Es decir,
Copyright © Andrew W. Moore Slide 42 EL KERNEL TRIANGULAR K(z)=1- |z| para |z|<1, 0 en otro caso. EL KERNEL "BIWEIGHT" 15/16(1-z 2 ) 2 para |z|<1 K(z)= 0 en otro caso
Copyright © Andrew W. Moore Slide 43 EL KERNEL EPANECHNIKOV para |z|< K(z)= 0 en otro caso
Copyright © Andrew W. Moore Slide 44 Estimacion de densidad en 20 pts usando kernel gaussiano con h=.5,”opt1”,”opt2”, 4
Copyright © Andrew W. Moore Slide 45 Variables aleatorias bidimensionales p(x,y) = probability density of random variables (X,Y) at location (x,y)
Copyright © Andrew W. Moore Slide 46 Estimadores de funcion de densidad bi-dimensionales Histogramas K-nearest neighbors: Kernel density estimators A area de la clase A k es el area ncluyendo hasta el k- esimo vecino
Copyright © Andrew W. Moore Slide 47 Estimacion de kernel bivariado Sean x i =(x,y) los valores observados y t=(t 1,t 2 ) un punto del plano donde se desea estimar la densidad conjunta Si h 1 =h 2 =h
Copyright © Andrew W. Moore Slide 48 (a1, a2)H -1 donde
Copyright © Andrew W. Moore Slide 49 Estimacion de densidad-Kernel Gaussiano bivariado
Copyright © Andrew W. Moore Slide 50 f1= kde2d(autompg1$V1, autompg1$V5,n=100) persp(f1$x,f1$y,f1$z)
Copyright © Andrew W. Moore Slide 51 contour(f1, levels=c(8e-6,2e-5, 2.8e-5),col=c(2,3,4), xlab="mpg",ylab="weight")
Copyright © Andrew W. Moore Slide 52 In 2 dimensions Let X,Y be a pair of continuous random variables, and let R be some region of (X,Y) space…
Copyright © Andrew W. Moore Slide 53 In 2 dimensions Let X,Y be a pair of continuous random variables, and let R be some region of (X,Y) space… P( 20<mpg<30 and 2500<weight<3000) = volumen under the 2-d surface within the red rectangle
Copyright © Andrew W. Moore Slide 54 In 2 dimensions Let X,Y be a pair of continuous random variables, and let R be some region of (X,Y) space… P( [(mpg-25)/10] 2 + [(weight-3300)/1500] 2 < 1 ) = volumen under the 2-d surface within the red oval
Copyright © Andrew W. Moore Slide 55 In 2 dimensions Let X,Y be a pair of continuous random variables, and let R be some region of (X,Y) space… Take the special case of region R = “everywhere”. Remember that with probability 1, (X,Y) will be drawn from “somewhere”. So..
Copyright © Andrew W. Moore Slide 56 In 2 dimensions Let X,Y be a pair of continuous random variables, and let R be some region of (X,Y) space…
Copyright © Andrew W. Moore Slide 57 In m dimensions Let (X 1,X 2,…X m ) be an n-tuple of continuous random variables, and let R be some region of R m …
Copyright © Andrew W. Moore Slide 58 Independence If X and Y are independent then knowing the value of X does not help predict the value of Y mpg,weight NOT independent
Copyright © Andrew W. Moore Slide 59 Independence If X and Y are independent then knowing the value of X does not help predict the value of Y the contours say that acceleration and weight are independent
Copyright © Andrew W. Moore Slide 60 Multivariate Expectation E[mpg,weight] = (24.5,2600) The centroid of the cloud
Copyright © Andrew W. Moore Slide 61 Multivariate Expectation > f1= kde2d(autompg1$mpg, autompg1$weight,n=100) > dx=f1$x[2]-f1$x[1] > dy=f1$y[2]-f1$y[1] > dx [1] > dy [1] > meanmpg=sum(f1$x*f1$z)*dx*dy [1] > meanweight=sum(f1$y*f1$z)*dx*dy [1] >#estimated mean > mean(autompg1$weight) [1] > mean(autompg1$mpg) [1]
Copyright © Andrew W. Moore Slide 62 Multivariate Expectation
Copyright © Andrew W. Moore Slide 63 Test your understanding All the time? Siempre Only when X and Y are independent? It can fail even if X and Y are independent?
Copyright © Andrew W. Moore Slide 64 Bivariate Expectation
Copyright © Andrew W. Moore Slide 65 Bivariate Covariance
Copyright © Andrew W. Moore Slide 66 Bivariate Covariance
Copyright © Andrew W. Moore Slide 67 Covarianza y desviacion estandar estimadas entre mpg y weight > cov(autompg1[,c(1,5)]) mpg weight mpg weight > sd(autompg1$mpg) [1] > sd(autompg1$weight) [1]
Copyright © Andrew W. Moore Slide 68 Covariance Intuition E[mpg,weight] = (24.5,2600)
Copyright © Andrew W. Moore Slide 69 Covariance Intuition E[mpg,weight] = (24.5,2600) Principal Eigenvector of
Copyright © Andrew W. Moore Slide 70 Regression Line Notice that the regression line pass trough ( x, y )
Copyright © Andrew W. Moore Slide 71 Regression Line >l1=lm(weight~mpg,data=autompg1) > l1 Call: lm(formula = weight ~ mpg, data = autompg1) Coefficients: (Intercept) mpg >#slope of regression line >slope= / [1]
Copyright © Andrew W. Moore Slide 72 Primer Principal component > a=cov(autompg1[,c(1,5)]) > eigen(a) $values [1] $vectors [,1] [,2] [1,] [2,] #slope of primer principal component >.99997/ [1] –
Copyright © Andrew W. Moore Slide 73 Covariance Fun Facts True or False: If xy = 0 then X and Y are independent. False True or False: If X and Y are independent then xy = 0. True True or False: If xy = x y then X and Y are deterministically related. True True or False: If X and Y are deterministically related then xy = x y. false How could you prove or disprove these?
Copyright © Andrew W. Moore Slide 74 Test your understanding All the time? Only when X and Y are independent? Cierto It can fail even if X and Y are independent?
Copyright © Andrew W. Moore Slide 75 Marginal Distributions
Copyright © Andrew W. Moore Slide 76 Conditional Distributions
Copyright © Andrew W. Moore Slide 77 Conditional Distributions Why?
Copyright © Andrew W. Moore Slide 78 Independence Revisited It’s easy to prove that these statements are equivalent…
Copyright © Andrew W. Moore Slide 79 More useful stuff Bayes Rule (These can all be proved from definitions on previous slides)
Copyright © Andrew W. Moore Slide 80 Mixing discrete and continuous variables Bayes Rule Bayes Rule
Copyright © Andrew W. Moore Slide 81 P(educacion,salario>50k)
Copyright © Andrew W. Moore Slide 82 Estimation of the posterior P(Class/Education)
Copyright © Andrew W. Moore Slide 83 Mixing discrete and continuous variables P(EduYears,Wealthy)
Copyright © Andrew W. Moore Slide 84 Mixing discrete and continuous variables P(EduYears,Wealthy) P(Wealthy| EduYears)
Copyright © Andrew W. Moore Slide 85 Mixing discrete and continuous variables Renormalized Axes P(EduYears,Wealthy) P(Wealthy| EduYears) P(EduYears|Wealthy)
Copyright © Andrew W. Moore Slide 86 Ejercicios Suppose X and Y are independent real- valued random variables distributed between 0 and 1: What is p[min(X,Y)]? What is E[min(X,Y)]? Prove that E[X] is the value u that minimizes E[(X-u) 2 ] What is the value u that minimizes E[|X-u|]?