MA429 Algorithmic Techniques for Data Mining

Exercises 9 Homework is summative

Submission

Submit the summative solutions by 11.59pm of Thursday 2nd April (London time) to

the Moodle submission link. This homework is worth 10% of the final course mark. (This

is increased from the 5% planned, in order to decrease the weight on the final exam due to

coronavirus disruption.)

Submit a single pdf that contains the solutions to all questions. (This can include scanned

versions of handwritten arguments, although typesetting is preferred.) Submit your R code for

the programming question as an R script or Rmd file; if Rmd, please include the report that

includes the code’s output.

Your files should be named with your candidate exam number, which should also appear

within the files to be sure. Do not show your name or registration number anywhere: these

submissions are anonymous.

Details of whether you need to submit a plagiarism/receipt form, and how, are under dis-

cussion within Mathematics and will follow in due course.

You are to do this homework individually. You are welcome to use the course materials

provided and recommended, including all textbooks mentioned. You may also use generic online

reference sources such as Wikipedia pages, but (though I doubt that it would help anyway),

you must not search specifically for solutions to these problems.

1. The figure below shows some data points in two dimensions.

(a) What are the two principal component (PC) directions?

(b) If the coordinates of the red point are (3,−1), roughly what would be its coordinates

in the principal component space?

2. You wish to cluster the points shown below into two clusters, hoping for the clusters

indicated here by colour. We compare K-means with K = 2, DBSCAN, and spectral

clustering. What clusters would you expect each to produce? Which method(s) would

find the right clusters, as colour-coded here?

Gregory Sorkin (version: 2020.03.24 10:37) c©LSE 2020

MA429 Algorithmic Techniques for Data Mining Exercises 9 Homework is summative

3. You begin with standardised data with n data points in p dimensions.

(a) What is the sum of squares of all n× p scores, and why?

(b) You perform PCA on this data, getting p score vectors, each of dimension n. What

do they represent?

(c) What is the sum of squares of all pn entries of these PCA score vectors, and why?

(d) Is the variance of entries of the first score vector more or less than 1, and why?

(e) What is the interpretation of the variance of the first score vector divided by p?

(f) The biplot below is for the iris dataset, with 4 features. Explain the meaning of the

“42” near the top left corner, near coordinates (−2, 2.3).

(e) What does K-means give for K = 2 in the two cases? For both r = 1 and r = 2,

answer the following question: Is it true that you will always obtain the two natural

clusters for any choice of the initial centres?

6. 10 marks. Spectral clustering uses K-means on a different representation of the point

set via eigenvectors obtained from the normalised Laplacian matrix. Instead, we could try

using the principal components obtained by PCA and run K-means clustering on them;

let us call this method PCA clustering. Consider the example in the picture:

Let K = 3. We have seen in the lecture that spectral clustering would successfully identify

the three clusters corresponding to the three rings. Would PCA clustering also succeed

identifying these rings? Justify your answer. (Hint: What are the two principal component

directions? What would the points look like in that coordinate system?)

7. 15 marks.

(a) Generate a simulated data set with 20 observations in each of three classes (i.e. 60

observations total), and 50 variables. The three classes should be clearly separated

Page 5 of 6

MA429 Algorithmic Techniques for Data Mining Exercises 9 Homework is summative

from one another. (Hint: there are many ways to obtain such a data set. For

example you can generate the variables randomly, and then add different shifts to

certain variables for the different classes.)

(b) Perform PCA on the 60 observations and plot the first two principal component score

vectors. Use a different color to indicate the observations in each of the three classes.

If the three classes appear separated in this plot, then continue on to part (c). If not,

then return to part (a) and modify the simulation so that there is greater separation

between the three classes.

(c) Perform K-means clustering of the observations with K = 3. How well do the clusters

that you obtained in K-means clustering compare to the true class labels?

(d) Now perform PCA clustering: K-means clustering with K = 3 on the first two

principal component score vectors, rather than on the raw data. That is, perform K-

means clustering on the 60× 2 matrix of which the first column is the first principal

component score vector, and the second column is the second principal component

score vector. How do the clusters compare to the true class labels?

版权所有：留学生编程辅导网 2020,All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。