Transcription

Introduction to R for Multivariate Data AnalysisFernando MiguezJuly 9, 2007email: [email protected]: N-211 Turner Halloffice hours: Wednesday 12pm or by appointment1IntroductionThis material is intended as an introduction to the study of multivariate statisticsand no previous knowledge of the subject or software is assumed. My main objectiveis that you become familiar with the software and by the end of the class you shouldhave an intuitive understanding of the subject. I hope that you also enjoy the learningprocess as much as I do. I believe that for this purpose R and GGobi will be excellentresources. R is free, open source, software for data analysis, graphics and statistics.Although GGobi can be used independently of R, I encourage you to use GGobi as anextension of R. GGobi is a free software for data visualization (www.ggobi.org). Thissoftware is under constant development and it still has occasional problems. However,it has excellent capabilities for learning from big datasets, especially multivariate data.I believe that the best way to learn how to use it is by using it because it is quite userfriendly. A good way of learning GGobi is by reading the book offered on-line andthe GGobi manual, which is also available from the webpage. It is worth saying thatGGobi can be used from R directly by a very simple call once the rggobi packageis installed. In the class we will also show examples in SAS which is the leadingcommercial software for statistics and data management.This manual is a work in progress. I encourage you to read it and provide mewith feedback. Let me know which sections are unclear, where should I expand thecontent, which sections should I remove, etc. If you can contribute in any way it willbe greatly appreciated and we will all benefit in the process. Also, if you want todirectly contribute, you can send me texts and figures and I will be happy to includeit in this manual.Let us get to work now and I hope you have fun!

1.1Installing Rhttp://www.r-project.org/Select a mirror and go to "Download and Install R"These are the steps you need to follow to install R and GGobi.1) Install R.2) type in the command r")3) In this step make sure you install the GTk environment and youbasically just choose all the defaults.4) You install the rggobi package from within R by going toPackages . Install packages . and so on.1.2First session and BasicsWhen R is started a prompt indicates that it is ready to receive input. In a questionand answer manner we can interrogate in different ways and receive an answer. Forexample, one of the simplest tasks in R is to enter an arithmetic expression and receivea result. 5 7[1] 12 # This is the result from RProbably the most important operation in R is the assignment, -. This is usedto create objects and to store information, data or functions in these objects. Anexample:res.1 - 5 7Now we have created a new object, res.1, which has stored the result of thearithmetic operation 5 7. The name res.1 stands for “result 1”, but virtually anyother name could have been used. A note of caution is that there are some conventionsfor naming objects in R. Names for objects can be built from letters, digits and theperiod, as long as the first symbol is not a period or a number. R is case sensitive.Additionally, some names should be avoided because they play a special role. Someexamples are: c, q, t, C, D, F, I, T, diff, df, pt.1.31.3.1Reading and creating data framesCreating a data frameThere are many ways of importing and/or creating data in R. We will look at justtwo of them for simplicity. The following code will prompt a spreadsheet where youcan manually input the data and, if you want, rename columns,2

dat - data.frame(a 0,b 0,c 0)fix(dat)# These numbers should be entered 2 5 4 ; 4 9 5 ; 8 7 9The previous code created an object called dat which contains three columnscalled a, b, and c. This object can now be manipulated as a matrix if all the entriesare numeric (see Matrix operations below).1.3.2Importing dataOne way of indicating the directory where the data file is located is by going to "File"and then to "Change dir.". Here you can browse to the location where you areimporting the data file from. In order to create a data frame we can use the followingcode for the data set COOKIE.txt from the textbook.cook - read.table("COOKIE.txt",header T,sep "",na.strings ".")cook[1:3,]pairs(cook[,5:9])The second line of code selects the first three rows to be printed (the columns werenot specified, so all of them are printed). The header option indicates that the firstrow contains the column names. The sep option indicates that entries are separatedby a space. Comma separated files can be read in this way by indicating a commain the sep option or by using the function read.csv. You can learn more about thisby using help (e.g. ?read.table, ?read.csv). The third line of code produces ascatterplot matrix which contains columns 5 through 9 only. This is a good way ofstarting the exploration of the data.1.4Matrix operationsWe will create a matrix A (R is case sensitive!) using the as.matrix function fromthe data frame dat. The numeral, #, is used to add comments to the code.A - as.matrix(dat)Ap - t(A)ApA - Ap %*% AAinv - solve(A)library(MASS)#####the function t() returns the transpose of Athe function %*% applies matrix multiplicationthe function solve can be used to obtain the inversethe function ginv from the package MASS calculatesa Moore-Penrose generalized inverse of AAginv - ginv(A)Identity - A %*% Ainv # checking the properties of the inverseround(Identity)# use the function round to avoid rounding errorsum(diag(A))# sums the diagonal elements of A; same as tracedet(A)# determinant of ANow we will create two matrices. The first one, A, has two rows and two columns.The second one, B, has two rows and only one column.3

A - matrix(c(9,4,7,2),2,2)B - matrix(c(5,3),2,1)C - A 5D - B*2DDD - A %*% B# matrix multiplicationDDDD - A * A# elementwise multiplicationE - cbind(B,c(7,8))# adds a columnF - rbind(A,c(7,8))# adds a rowdim(A)# dimensions of a matrixO - nrow(A)# number of rowsP - ncol(A)# number of columnsrowSums(A)# sums for each columnMean - colSums(A)/O# mean for each columnFor additional introductory material and further details about the R language see[7].1.5Some important definitionsSome concepts are important for understanding statistics and multivariate statisticsin particular. These will be defined informally here: Random Variable: the technical definition will not be given here but it is important to notice that unlike the common practice with other mathematicalvariables, a random variable cannot be assigned a value; a random variabledoes not describe the actual outcome of a particular experiment, but ratherdescribes the possible, as-yet-undetermined outcomes in terms of real numbers.As a convention we normally use X for a random variable and x as a realizationof that random variable. So x is an actual number even if it is unknown. Expectation: the expectation of a continuous random variable is defined as,ZE(X) µ xf (x)dxYou can think of the expectation as a mean. The integration is done over thedomain of the function, which in the case of the normal goes from to . Mean vector : A mean vector is the elementwise application of the Expectationto a column vector of random variables. E(X) E(X1 )E(X2 ).E(Xp )4 µ1µ2.µp µ

covariance matrix :Σ E(X µ )(X µ)0or Σ Cov(X) 1.6σ11 σ12 · · · σ1pσ21 σ22 · · · σ2p. . .σp1 σp2 · · · σpp Eigenvalues and eigenvectorsIn multivariate data analysis many methods use different types of decompositions withthe aim of describing, or explaining the data matrix (or, more typically the variancecovariance or correlation matrix). Eigenvalues and eigenvectors play an importantrole in the decomposition of a matrix. The definition of these terms and the theorycan be found in the notes or the textbook. Here we will see the functions in R usedto carry out these computations. Try the following,A - matrix(c(1,1,0,3),2,2)eigen(A)dm - scan()# Insert 13 -4 2 -4 13 -2 2 -2 10B - matrix(dm,3,3)eigen(B)C - edit(data.frame())# enter these numbers1615201061510610875106189815Cm - as.matrix(C)# coerce the data frame into a matrixmean(Cm)# mean vectorvar(Cm)# variance-covariance matrixcor(Cm)# correlation matrix1.7DistanceThe concept of distance can be defined as a spatial difference between two points.So, in a general sense distance is a difference, and it is a more abstract conceptthat mere spatial difference. For example, we could be interested in the difference inweight of two items. Some methods in this class will involve distance between points in5

multidimensional space. Most datasets involve variables that suggest spatial distance,others temporal distance and some might not imply a distance that can be measuredquantitatively. These later variables are usually named categorical (or classification)variables.More formally, if we consider the point P (x1 , x2 ) in the plane, the Euclideandistance, d(O, P ), from P to the origin O (0, 0) is, according to the Pythagoreantheorem,qd(O, P ) x21 x21(1)In general if the point P has p coordinates so that P (x1 , x2 , . . . , xp ), the Euclideandistance from P to the origin O (0, 0, . . . , 0) isqd(O, P ) x21 x22 · · · x2p(2)All points (x1 , x2 , . . . , xp ) that lie a constant squared distance, such as c2 , fromthe origin satisfy the equationd2 (O, P ) x21 x22 · · · x2p c2(3)Because this is the equation of a hypershpere (a circle if p 2), points equidistantfrom the origin lie on a hypershpere. The Euclidean distance between two arbitrarypoints P and Q with coordinates P (x1 , x2 , . . . , xp ) and Q (y1 , y2 , . . . , yp ) is givenbyqd(P, Q) (x1 y1 )2 (x2 y2 )2 . . . (xp yp )2(4)This measure of distance is not enough for most statistical methods. It is necessary, therefore, to find a ‘statistical’ measure of distance. Later we will see themultivariate normal distribution and interpret the statistical distance in that context.For further details about this section see [5].1.8Univariate graphics and summariesA good way of exploring data is by using graphics complemented by numerical summaries. For example, we can obtain summaries for columns 5 through 9 from theCOOKIE data sity(cook[,5]))# 5 numbers summary# boxplots for 5 variables# density plot for variables in column 5At this point it is clear that for a large dataset, this type of univariate exploration isonly marginally useful and we should probably look at the data with more powerfulgraphical approaches. For this, I suggest turning to a different software: GGobi.While GGobi is useful for data exploration, R is preferred for performing the numericalpart of multivariate data analysis.6

121086420CELLSIZE CELLUNIF SURF ROLOOSEFIRMFigure 1: Boxplots for 5 variables in COOKIE dataset7

2Multivariate data setsTypical multivariate data sets can be arranged into a data matrix with rows andcolumns. The rows indicate ‘experimental units’, ‘subjects’ or ‘individuals’, whichwill be referred as units from now on. This terminology can be applied to animals,plants, human subjects, places, etc. The columns are typically the variables measuredon the units. For most of the multivariate methods treated here we assume that thesevariables are continuous and it is also often convenient to assume that they behaveas random events from a normal distribution. For mathematical purpose, we arrangethe information in a matrix, X, of N rows and p columns, x11 x12 · · · x1p x21 x22 · · · x2p X . . ···. xN 1 xN 2 · · · xN pwhere N is the total number of ‘units’, p is the total number of variables, and xijdenotes the entry of the j th variable on the ith unit. In R you can find example datasets that are arranged in this way. For example,dim(iris)[1] 150 21.41.5SpeciessetosaversicolorvirginicaWe have asked R how many dimensions the iris dataset has using the dim function. The dataset has 150 rows and 5 columns. The rows are the experimental units,which in this case are plants. The first four columns are the response variables measured on the plants and the last column is the Species. The next command requestsprinting rows 1,51 and 120. Each of these is an example of the three species in thedataset. A nice visual approach to explore this dataset is the pairs command, whichproduces Figure 2. The code needed to produce this plot can be found in R by typing?pairs. Investigate!3Multivariate Normal DensityMany real world problems fall naturally within the framework of normal theory. Theimportance of the normal distribution rests on its dual role as both population modelfor certain natural phenomena and approximate sampling distribution for many statistics [5].8

Anderson's Iris Data 3 species2.03.04.0 2.5 1.50.5 5.5 Petal.Length 7.5 1234567Figure 2: Scatterplot matrix for iris dataset97.5