For Spring, the course is CSI 991, Section 004.
For Summer, the course is CSI 991, Section X01.
Contacts:
csutton@gmu.edu
jgentle@gmu.edu
Members Enter |
As soon as we finish this book we will begin going through The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman (HTF).
Other plans include some competitive analyses of datasets.
Some datasets of interest can be found at the following sites:
And some more:
Jim Shine provided a list of topographic datasets.
Yasmin Said provided some data from various sources in a zip file.
Charles Perry
provided some nutrition data in a
SAS dataset.
There are 26 nutrients for 1309 dietary supplement products.
Each record contains the product name followed by values for 26 nutrients.
Each nutrient value has been adjusted to percentages of its recommended
daily value.
The objective is to cluster the data into n groups that a set of expert
nutritionists consider most logical based on their expert opinion.
Since the data contains a lot of products containing only a single
nutrient it would seem reasonable to first remove these products from
the data set and group them by the nutrient they contain.
Then to cluster the remaining products into n groups each containing
similar nutrient values.
The other thing we want to do is to develop some test datasets, and possibly have contests among separate teams in the class.
Review of some general topics in regression: projection pursuit, projection pursuit regression, IRWLS, etc.
Firehouse
Discuss material in Chapter 12 on data organization and databases.
Presentation on data organization and databases by Pragyansmita Nayak.
Fat's
Presentation on visualization of
data structure and Crystal Vision demo
by Mark Lukens.
Here's the Eureka dataset in S-Plus format
or in Crystal Vision (text) format.
Another example, together with Matlab code, may be of interest.
Presentation on rule-based regression by Kyle Caudle. An example using Cubist on the MPG data in the csv file yielded this output text file. The data, from US DoE and the US EPA, are described here. Some more information about rule-based regression is available in a paper by Luis Torgo, and information about Cubist is available from their site.
Other stuff in Chapter 13?
Fat's
Presentation on bump-hunting by Li Li, based on, among other things, the paper by Friedman and Fisher.
Further analysis of the MPG data in Cubist by Lyle Caudle.
Other stuff in Chapter 14?
Presentation on Chapter 14 by Hong Chai.
Presentation on
BOAT by Phil Sage.
See the paper and more info about the paper.
More on the bump-hunting algorithm.
More on the MPG data: run through CART. (We are going to eliminate the hybrid car from the data set.)
General exploratory analysis of the mpg data by Yaru Li.
TT's
General discussion of some left-over topics.
Presentation on use of entropy in categorical clustering by Daniel Barbara.
Carlos O'Kelly's
General discussion of Breiman's 2001 paper on random forests
(pp. 5-32 of Vol 45 (Oct. 2001) of Machine
Learning) ... to get a copy of this paper,
first go to
the GMU Library, click on E-journal finder, and then do
a search for Machine Learning.
This paper refers to another interesting paper, by Dietterich.
The correct reference is Machine Learning 40(2) 137--157 (2000),
and it is available online at the same site.
Jim Shine
will lead the discussion, but everyone should read the paper.
The question of what is "meta-learning" came up. Michael Johnson referred the group to the website http://www1.cs.columbia.edu/~sal/JAM/PROJECT/, where it is stated
Once derived local classifier agents or models are produced at some site(s), two or more such agents may be composed into a new classifier agent by a meta-learning agent. Meta-learning is a general strategy that provides the means of learning how to combine and integrate a number of separately learned classifiers or models, (each of which in the current context is a remote agent).and later,
The manner in which we learn the relationship between classifiers is to learn a new classifier (a "meta-level classifier") whose input is the set of predictions of two or more classifiers on common data. It is this latter view that we call meta-learning.In using "ensembles" of classifiers, which is how Breiman decribes in a general way what is being done in bagging, boosting, and randomization, there is an averaging or voting. In meta-learning, either arbiters or combiners are used to try to allow unequal influence from the base classifiers depending on some differential assessment of their performance.
Firehouse
Follow-up presentation on bump-hunting by Li Li. She used some of Friedman's programs.
Discussion of Breiman's Section 2.
LaTeX file that produced it.
Presentation on relational probability trees by Daniel Barbara, discussing, among other things, the paper by Neville et al. and the paper by Jensen et al. Following the presentation on Friday there was some discussion about the differences in RPTs and an approach of just flattening the space and using the features that RPTs use. Dr. Barbara added a comment about this issue in the pdf file linked above.
Permutation tests, presentation by Clifton Sutton.
Analyses of the Boston housing data by Jill McCracken.
Further analyses of the Boston housing data.
Discussion of material in first 21 pages of HTF.
How to simulate data on page 17.
Duplication of the experiment mentioned on page 17 by Li Li.
Modeling the prostate cancer data. Presentation by Mark Lukens.
Linear discriminant analysis for classification. Presentation by Jim Shine.
Use of generalized linear models in classification.
Data in example (cancer.dat).
Logistic regression. Presentation by Jill McCracken.
Assignment: Read through page 143 and the appendix at the end of Ch. 5; work exercises 4.5, 5.1, 5.4, and the supplemental exercise by Clifton Sutton.
Modeling the prostate cancer data. Presentation by Mark Lukens.
A simulation study of variable selection in regression modeling. Presentation by Mark Lukens.
More discussion on splines.
Applications of principal component analysis in multivariate time series models. Presentation by Irsal Imran
Explorations with MARS (the program).