CSI 991

Seminar in Computational Statistics:

Data Mining

Spring/Summer, 2004

Fridays 3:00pm -- 5:00pm, Innovation Hall, Room 131

For Spring, the course is CSI 991, Section 004.
For Summer, the course is CSI 991, Section X01.

Contacts:
csutton@gmu.edu
jgentle@gmu.edu


Some links on this page are available only to members of the seminar group.

Members Enter


We will first continue working through some of the material in Principles of Data Mining by David J. Hand, Heikki Mannila, and Padhraic Smyth (HMS), beginning in Chapter 11. We will also explore various software packages for data mining.

As soon as we finish this book we will begin going through The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman (HTF).

Other plans include some competitive analyses of datasets.

Some datasets of interest can be found at the following sites:

And some more:


The CART datamining contest for students should also be of interest (see the "Competition" button).

Jim Shine provided a list of topographic datasets.

Yasmin Said provided some data from various sources in a zip file.

Charles Perry provided some nutrition data in a SAS dataset. There are 26 nutrients for 1309 dietary supplement products. Each record contains the product name followed by values for 26 nutrients. Each nutrient value has been adjusted to percentages of its recommended daily value.
The objective is to cluster the data into n groups that a set of expert nutritionists consider most logical based on their expert opinion. Since the data contains a lot of products containing only a single nutrient it would seem reasonable to first remove these products from the data set and group them by the nutrient they contain. Then to cluster the remaining products into n groups each containing similar nutrient values.

The other thing we want to do is to develop some test datasets, and possibly have contests among separate teams in the class.


Schedule


Here's a review of an interesting introductory statistics book, All of Statistics , by Larry Wasserman. (The review is in the typically silly style of the Technometrics book review editor, but it does indicate the type of material covered -- and the link has the full bibliographic info.)