[Washington Statistical Society]
[WSS home] [WSS Newsletter] [WSS Information] [Seminars] [Short Courses] [Employment] [Feedback] [Join WSS!]

Washington Statistical Society Seminars

Current | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 | 2001 | 2000 | 1999 | 1998 | 1997 | 1996 | 1995 | Methodology

January, 2007
9
Tues.
ROC Analysis of the Multiple-Biomarker Classifier Training and Testing Problem: The Influence Function and Specification of Uncertainties in ROC Summary Measures
12
Fri.
Georgetown University Seminar
Parameter Estimation for the Exponential-Normal Convolution Model for Background Correction of Affymetrix GeneChip Data
19
Fri.
Georgetown University Seminar
Considerations in Adapting Clinical Trial Design for Drug Development
29
Mon.
Economic Turbulence in the U.S. Economy
February, 2007
2
Fri.
Georgetown University Seminar
Adaptive "Simon" Designs for Heterogeneous Patient Populations in Phase II Cancer Trials
6
Tues.
Mortality in Iraq
March, 2007
3
Thur.
Georgetown University Seminar
Sequential Monitoring of Randomization Tests
7
Wed.
New Methods and Satellites: A Program Update on the NASS Cropland Data Layer Acreage Program
8
Thur.
Measurement and Statistical Analysis of Human Rights: A Model
8
Thur.
University of Maryland
Statistics Program Seminar
The Dominance Order
16
Fri.
Georgetown University Seminar
Use of a Visual Programming Environment for Creating and Optimizing Mass Spectrometry Diagnostic Workflows
23
Fri.
Bayesian Diagnostics for Detecting Hierarchical Structure
27
Tues.
President's Invited Panel Discussion on Finite Population Correction Factors
27
Tues.
U.S. Bureau Of Census
Statistical Research Division Seminar
The Role of Context in the Recall of Minimally Counterintuitive Concepts
28
Wed.
Applications of the Johnson SB Distribution to Environmental Data
29
Thur.
U.S. Bureau Of Census
Statistical Research Division Seminar
A Test of Association of a Two-Way Categorical Table for Correlated Counts
April, 2007
10
Tues.
Introduction to Data Mining Methodology for Statisticians
13
Fri.
University of Maryland
Statistics Program Seminar
Wait! Should We Use the Survey Weights to Weight?
16
Mon.
2006 Roger Herriot Award Bridging: Roger Herriot's Time to the Present
23
Mon.
American Community Survey Weighting and Estimation: ACS Family Equalization
May, 2007
2
Wed.
An Overview of the Semi-Competing Risk Problem
4
Fri.
Georgetown University Seminar
Systems Pharmacology of Type 2 Diabetes: A Case Study for Pharmaceutical Development
8
Tues.
The STATCOM Network: A Role for Students in Pro Bono Statistical Consulting to the Community
10
Thur.
Using the t-distribution to Deal with Outliers in Small Area Estimation
15
Tues.
Confidence Interval Coverage in Model-Based Estimation
17
Thur.
The Role of Statistics and Statisticians in Human Rights
22
Tues.
Characterization, Modeling and Management of Inferential Risk, Data Quality Risk and Operational Risk in Survey Procedures
June, 2007
12
Tues.
Book Signing and Wine Tasting
13
Wed.
The Role of Fringe Benefits in Employer and Workforce Dynamics
21
Thurs.
Spatial Association Between Speciated Fine Particles and Mortality
25
Mon.
National Health Interview Survey's 50th Anniversary Commemorative Conference
27
Wed.
BLS Statistical Seminar
Robust Prediction of Small Area Means and Distributions
July, 2007
11
Wed.
Estimation under Ignorable Response Mechanism and Unweighted Imputation
18
Wed.
Assessment of Coverage and Utility of Residential Address Lists
24
Tues.
Imputation Using Empirical Likelihood
September, 2007
4
Tues.
A Geostatistical Approach to Linking Geographically-Aggregated Data/A System for Detecting Arbitrarily Shaped Hotspots
7
Fri.
Modeling Multiple-Response Categorical Data From Complex Surveys
7
Fri.
Georgetown University Seminar
Bayesian Methods for Proteomic Biomarker Discovery Using Functional Mixed Models
7
Fri.
George Mason University
CDS/CCDS/Statistics Colloquium Series
Experiences with Congressional Testimony: Statistics and The Hockey Stick
12
Wed.
An Introduction to the Key National IndicatorsInitiative: the State of the USA
12
Wed.
New Experiments on the Design of Complex Survey Questions
18
Tues.
U.S. Bureau Of Census
Statistical Research Division Seminar
Unduplicating the 2010 Census
19
Wed.
Survey Methodology for Assessing Geographically Isolated Wetlands Map Accuracy
21
Fri.
Georgetown University Seminar
A Geometric Approach to Comparing Treatments for Rapidly Fatal Diseases
25
Tues.
American University
Department of Mathematics and Statistics Colloquium
A Bayesian IRT Model for the Comparison of Survey Item Characteristics under Dual Modes of Administration
26
Wed.
U.S. Bureau Of Census
Statistical Research Division Seminar
Alternative Survey Sample Designs, Seminar #1: Network, Spatial, and Adaptive Sampling
28
Fri.
Small Area Estimation: An Empirical Best Linear Unbiased Prediction Approach
28
Fri.
George Washington University
Department of Statistics Seminar
Multi-Stage Sampling for Genetic Studies
28
Fri.
George Mason University
CDS/CCDS/Statistics Colloquium Series
Text Data Mining in Defense Applications
October, 2007
4
Thur.
University of Maryland
Statistics Program Seminar
Two for the Price of One: Statistics in Natural Language Processing and Information Retrieval
5
Fri.
Georgetown University Seminar
The Statistical Challenge of Studies with Errors-in-Covariates When Only the Means are Modelled
12
Fri.
George Mason University
CDS/CCDS/Statistics Colloquium Series
Finding the Fittest Curve for the Binary Classification Problem
16
Tues.
Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata
19
Fri.
Georgetown University Seminar
Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies
19
Fri.
George Washington University
Department of Statistics Seminar
Limitations of the Non-homogeneous Poisson Process (NHPP) Model for Analyzing Software Reliability Data
24
Wed.
Estimating the Measurement Error in the Current Population Survey Labor Force - A Mixture Markov Latent Class Analysis Approach
25
Thur.
Statistical Issues and Challenges Arising from Analysis of Genome-Wide Association Studies
30
Tues.
17th Annual Morris Hansen Lecture
Assessing the Value of Bayesian Methods for Inference About Finite Population Quantities
November, 2007
2
Fri.
Georgetown University Seminar
Multilevel Functional Principal Component Analysis
2
Fri.
George Washington University
Department of Statistics Seminar
Multiphase Regression Models for Assessing Highly Multivariate Measurement Systems
7
Wed.
Cell Lines, Microarrays, Drugs and Disease: Trying to Predict Response to Chemotherapy
8
Thur.
Introduction to Number Theory and Modeling the Average Running Time of Computer Programs
9
Fri.
George Washington University
Department of Statistics Seminar
Evaluation of Trace Evidence in the Form of Multivariate Data and Sample Size Estimation in a consignment
9
Fri.
George Mason University
CDS/CCDS/Statistics Colloquium Series
Multi-modal Data and Text Mining
15
Thur.
University of Maryland
Statistics Program Seminar
An MM Algorithm for Multicategory Vertex Discriminant Analysis
16
Fri.
Georgetown University Seminar
Ranges of Association Measures for Dependent Binary Variables
Fri.
George Washington University
Department of Statistics Seminar
Sensitivity Analysis for Instrumental Variables Regression with Overidentifying Restrictions
16
Fri.
George Mason University
CDS/CCDS/Statistics Colloquium Series
Handwriting Identification: Identifying the Writer of a Questioned Document Using Statistical Analysis
28
Wed.
The Effects of Active Duty on the Income of Reservists and the Labor Market Participation of Spouses
29
Thur.
Tests of Unit Roots in Time Series Data
29
Thur.
Analyzing Forced Unfolding of Protein Tandems via Order Statistics
December, 2007
5
Wed.
Evaluating Alternative One-Sided Coverage Intervals for an Extreme Binomial Proportion
6
Thur.
Evaluating Continuous Training Programs Using the Generalized Propensity Score
7
Fri.
Disparate Modes of Survey Data Collection
10
Mon.
Empirical Likelihood Based Calibration Method in Missing Data Problems
12
Wed.
Approaches to Reducing and Evaluating Nonresponse Bias, With Applications to Adult Literacy Surveys



WSS Home | Newsletter | WSS Info | Seminars | Courses | Employment | Feedback | Join!


Title: ROC Analysis of the Multiple-Biomarker Classifier Training and Testing Problem: The Influence Function and Specification of Uncertainties in ROC Summary Measures

Abstract:

One of the central biomedical issues for our time is the identification and fusion of multiple biomarkers for a specified diagnostic task. The fusion stage can be recognized immediately as a special case of the problem of statistical learning. That is, one trains a statistical learning machine (SLM) with cases whose health status or outcome is already known and then tests the learning machine on cases previously unseen. Almost all investigators of SLMs are familiar with early optimism, tempered by later experience. Assessment methods are needed that provide estimates not only of mean performance, but also of uncertainties associated with the finite size of the training and testing samples. Taking the work of Efron and Tibshirani as a point of departure, we have developed methods for calculating the statistical influence function for figures of merit based not only on probability of misclassification but also on the full receiver operating characteristic (ROC) or true-positive versus false-positive rate and several of its summary measures and their uncertainties. These methods have broad applicability across most diagnostic fields that plan to use multiple biomarkers and, in particular, are useful for designing a target database size based on a pilot study.

GEORGETOWN UNIVERSITY SEMINAR

Title: Parameter Estimation for the Exponential-Normal Convolution Model for Background Correction of Affymetrix GeneChip Data

Abstract:

There are many methods of correcting microarray data for non-biological sources of error. Authors routinely supply software or code so that interested analysts can implement their methods. Even with a thorough reading of associated references, it is not always clear how requisite parts of the method are calculated in the software packages. However, it is important to have an understanding of such details, as this understanding is necessary for proper use of the output, or for implementing extensions to the model.

In this paper, the calculation of parameter estimates used in Robust Multichip Average (RMA), a popular preprocessing algorithm for Affymetrix GeneChip brand microarrays, is elucidated. The background correction method for RMA assumes that the perfect match (PM) intensities observed result from a convolution of the true signal, assumed to be exponentially distributed, and a background noise component, assumed to have a normal distribution. A conditional expectation is calculated to estimate signal. Estimates of the mean and variance of the normal distribution and the rate parameter of the exponential distribution are needed to calculate this expectation. Simulation studies show that the current estimates are flawed; therefore, new ones are suggested. We examine the performance of preprocessing under the exponential-normal convolution model using several different methods to estimate the parameters.

GEORGETOWN UNIVERSITY SEMINAR

Title: Considerations in Adapting Clinical Trial Design for Drug Development

Abstract:

Enhancing flexibility of clinical trial designs is one of the hot topics nowadays. Proper adaptation of clinical trial design is one of the ways for achieving this goal and has drawn much attention from clinical trialists. In past decades, the classical design has been improved to allow the flexibility for terminating the trial early if the experimental treatment is proven effective or deemed harmful or futile, based on the data accumulating during the course of the trial. Statistical validity of such an enhanced design in terms of type I error is maintained. The operational aspects of this design can still be an issue but, by and large, there have been many good models for how to deal with these aspects. As the flexibility of trial design is enhanced further, the potential risk that the resulting trial may not be interpretable increases. In this presentation we shall share our review experience, discuss the many issues arising from use of more flexible designs and hopefully stimulate further research in this area.

Title: Economic Turbulence in the U.S. Economy

Abstract:

Turbulent change is the hallmark of the U.S. economy, and one of the reasons for its success. Every week, in every part of the economy, and in every corner of the country, some firms are shutting down and others are starting up, some jobs are being created and others are being destroyed, some workers are being hired and others are quitting or being laid off.

The presentation will summarize the analysis from a new book "Economic Turbulence" derived from the use of the LEHD data at the Census Bureau, as well as from interviews with firms and workers in each industry.

Three key topics will be discussed:

  1. Firm performance and survival: What is the relationship between workforce quality, turnover, and firm survival?
  2. Worker career paths: What impact do firms have on workers' career paths? What is the long run impact of firm stability and instability on a worker's earnings growth?
  3. Wage distribution: What has happened to worker earnings over time? What has happened to middle, low, and high income jobs? Do new firms pay more or less than old?

GEORGETOWN UNIVERSITY SEMINAR

Title: Adaptive "Simon" Designs for Heterogeneous Patient Populations in Phase II Cancer Trials

Abstract:

In a Phase II cancer trial it may be advantageous to open enrollment to several patient populations, each with a very different null probability of response. For example in a trial of a novel therapeutic agent for relapsed Acute Myelogenous Leukemia (AML), patients in a first relapse may have a 30% probability of response under standard treatment, while patients in second relapse or higher may have only a 10% probability of response. These Phase II trials are generally uncontrolled (they often use "historical controls"), and the experimental agent may be expected to induce certain Grade 3 toxicities which would not be considered dose limiting. Furthermore, historically most of these Phase II trials can be expected to prove no better than standard-of-care. Phase II trials with these characteristics are usually designed with an early stopping rule which checks for initial evidence of efficacy after a first stage enrollment target is met. If there is insufficient evidence, the trial stops for futility. We discuss the standard two-stage optimal designs in this situation, and describe their operating characteristics under heterogeneous patient enrollment. These are compared to other approaches in the literature. Simple, approximately optimal designs which account for heterogeneity are presented. We recommend a practical adaptive design strategy which we have implemented at Moores UCSD Cancer Center.

Title: Mortality in Iraq

Abstract:

In unstable situations, population based data are the most reliable method of estimating mortality and other health indicators. In many conflicts and fragile state settings, however, collecting such data is difficult to do. Aside from the physical dangers, there is often an incomplete understanding of population numbers, population locations, migration patterns, and health status of the population. That lack of understanding contributes to many methodological challenges. However, population based data are increasingly important in planning protection of and assistance to affected populations, as well as for reconstruction policy.

In Iraq wehave undertaken two population-based national surveys of mortality related to conflict using a cluster survey approach. The first covered the period from January 2002 until July 2004, using 33 clusters with 988 households and 7,868 persons. That survey estimated an excess mortality of over \,000 persons following the March 2003 invasion. The second survey covered the period from January 2002 until July 2006. That survey included 47 clusters containing 1,849 households and 12,801 persons. From that survey an excess mortality of 654,965 (CI 392 797-942 636) was estimated, with 601,027 deaths attributed to violent causes.

The presentations will discuss the methodological and ethical issues involved in conducting our research in Iraq.

GEORGETOWN UNIVERSITY SEMINAR

Title: Sequential Monitoring of Randomization Tests

Abstract:

Randomization provides a basis for inference, but it is rarely taken advantage of. We discuss randomization tests based on the family of linear rank tests in the context of sequential monitoring of clinical trials. Such tests are applicable for categorical, continuous, and survival time outcomes. We prove the asymptotic joint normality of sequentially monitored test statistics, which allows the computation of sequential monitoring critical values under the Lan-DeMets procedure. Since randomization tests are not based on likelihoods, the concept of information is murky. We give an alternate definition of randomization and show how to compute it for different randomization procedures. The randomization procedures we discuss are the permuted block design, stratified block design, and stratified urn design. We illustrate these results by reanalyzing a clinical trial in retinopathy.

Title: New Methods and Satellites: A Program Update on the NASS Cropland Data Layer Acreage Program

Abstract:

The USDA/National Agricultural Statistics Service (NASS) annually produces remote sensing based crop specific classifications and acreage estimates over the major growing regions of the United States using medium resolution satellite imagery. The classifications are published in the public domain as the Cropland Data Layer (CDL) after the publication of the official release of county estimates. This program has mapped 24 total states since 1997 and is currently mapping 11 states annually (AR, IA, IL, IN, LA, MO, MS, ND, NE, WA and WI). This program previously used Landsat TM and ETM+ satellite imagery, the NASS June Agricultural Survey (JAS) segments for the ground truth information, and NASS public domain Peditor software for producing the classification and regression estimates. The unpredictability of the aging Landsat program assets, the labor intensive nature of digitizing June Agricultural Survey input for the Cropland Data Layer program, and the potential efficiency gains using commercial software warranted the need to investigate new program methods.

In 2004, NASS investigated alternative sensors to the Landsat platform, annually acquiring ResourceSat-1 Advanced Wide Field Sensor (AWiFS) data over the active Cropland Data Layer states. Additionally, evaluations were carried out on alternative ground truth methodologies to the June Agricultural Survey, using data collected through the USDA/Farm Service Agency (FSA) Common Land Unit (CLU) program. Testing and comparisons with regression tree See5 software against Peditor began in 2006 to produce the Cropland Data Layer. The goal was to determine which application was more efficient and delivered the most accurate estimates.

Accuracy assessments and acreage indications determined that the AWiFS significantly reduced the statistical variance of acreage indications from using the June Agricultural Survey area sampling frame, delivering a potential successor to the Landsat platform. In 2006, pilot testing was complete and the AWiFS sensor was selected as the exclusive source of imagery for the production of the Cropland Data Layer and acreage estimates. The Farm Service Agency Common Land Unit program provides a comprehensive national digitized and attributed GIS dataset collected annually for inclusion into programs like the Cropland Data Layer. Commercial image processing programs such as See5 were tested in 2006 against the AWiFS imagery and Common Land Unit datasets, providing evidence of efficiency gains in statistical accuracy, scope of coverage, and time of delivery.

Title: Measurement and Statistical Analysis of Human Rights: A Model

Abstract:

The study of human rights violations and the development of statistical models that can offer explanations are severely handicapped by a lack of adequate data. Most information on human rights is embedded in qualitative reports. Quantitative data that do exist tend to be limited to rough counts of violations or numeric indexes with little if any methodological transparency. This presentation will describe an extensive and rigorous coding project which uses the annual U.S. State Department's International Religious Freedom Reports as the primary information source and the procedures developed to check the coded data against alternative sources. The usefulness of these coded data will be demonstrated by testing an explanatory theory of religious persecution using structural equation modeling. The presentation will conclude with a discussion of how this research could be extended to the measurement and statistical analysis of other human rights.

GEORGETOWN UNIVERSITY SEMINAR

Title: Use of a Visual Programming Environment for Creating and Optimizing Mass Spectrometry Diagnostic Workflows

Abstract:

The use of mass spectrometry for clinical applications has extraordinary potential for accurate, early, and minimally invasive diagnoses of complex diseases, such as cancer, which require sensitive diagnostic tools for prognosis and development of flexible treatment strategies. Unfortunately, current mass spectrometry data analysis options available to researchers often require improvised combinations of tools provided by instrument manufacturers, third-parties, and in-house development. The lack of unified interfaces to access existing resources presents a significant bottleneck in the research and discovery process.

In this seminar, we present a modular software tool for the analysis of mass spectrometry profiling data that aims to address this bottleneck. The modules that comprise the analysis workflows can be broadly classified into three categories: signal processing tools, variable selection algorithms, and classification utilities. The software tool provides a platform that allows researchers to construct, validate, and optimize classification workflows of serum samples analyzed with time-of-flight mass spectrometry. Our work suggests that this type of flexible and interactive architecture is highly useful for 1) the development of mass spectrometry workflows and 2) biomarker discovery and validation in clinical environments.

Title: Bayesian Diagnostics for Detecting Hierarchical Structure

Abstract:

Motivated by an increasing number of Bayesian hierarchical model applications, we investigate several diagnostic techniques when the fitted model includes some hierarchical structure, but the data are from a model with additional, unknown hierarchical structure. We start by studying the simple situation where the data come from a normal model with two-stage hierarchical structure while the fitted model does not have any hierarchical structure, and then extend this to the case where the fitted model has two-stage normal hierarchical structure while the data come from a model with three-stage normal structure. Our investigation suggests two promising techniques: distribution of individual posterior predictive p values and the conventional posterior predictive p value with the F statistic as a checking function. Finally, we apply these two techniques to examine the fit of a model for data from the Patterns of Care Study, a two-stage cluster sample of cancer patients undergoing radiation therapy.

Title: President's Invited Panel Discussion on Finite Population Correction Factors

Abstract:

It is common practice to use finite population correction factors (fpc) in estimating variances when sampling from a finite population. Various approximate fpcs are used with more complex designs sometimes. When the interest is in a wider population than the specific finite sampling frame, many argue that it suffices to drop the fpc from the variance estimates, but others maintain this is appropriate only in a limited number of contexts.

U.S. BUREAU OF CENSUS
STATISTICAL RESEARCH DIVISION SEMINAR

Title: The Role of Context in the Recall of Minimally Counterintuitive Concepts

Abstract:

Counterintuitive concepts have been identified as major aspects of religious belief, and have been used to explain the retention and transmission of such beliefs. To resolve inconsistencies within this literature, three experiments were conducted to study the effect of context on recall. Context was found to be the key element affecting recall and the discrepancy among prior studies was resolved. The results imply that the nature of the surrounding context must be included in any account of the formation and transmission of religious concepts. A recent extension of this work involving type of context (science or religion) will also be introduced.

This seminar is physically accessible to persons with disabilities. For TTY callers, please use the Federal Relay Service at 1-800-877-8339. This is a free and confidential service. To obtain Sign Language Interpreting services/CART (captioning real time) or auxiliary aids, please send your requests via e-mail to EEO Interpreting & CART: eeo.interpreting.&.CART@census.gov or TTY 301-457-2540, or by voice mail at 301-763-2853, then select #2 for EEO Program Assistance.

Title: Applications of the Johnson SB Distribution to Environmental Data

Abstract:

In analyzing environmental data, it is common practice to assume that such data are from a 2-parameter lognormal if right skew and from a normal distribution if symmetrical. It is not generally recognized that the Johnson SB Distribution provides a continuum of distributions between the normal and lognormal distributions that constitute SB asymptotes. The Johnson SB transforms experimental data bounded by a minimum value (Xmin) and a maximum value (Xmax) into a normally distributed variable Y = ln [(x - Xmin) / (Xmax - x)] which is bounded as -infinity < Y < +infinity. As Xmax goes to +infinity and Xmin goes to 0, the distribution is asymptotically 2-parameter lognormal. As Xmax goes to +infinity and Xmin goes to -infinity the distribution is asymptotically normal.

Methods of objectively determining 4 optimal parameters for the SB distribution (Xmin, Xmax, mu, sigma) by the maximum likelihood estimation procedures are reviewed. Bruce Hill (1963) showed that the maximum likelihood solution for the three parameter lognormal yields degenerate and absurd solutions as Xmin goes in the limit to the minimum observation; the likelihood of the minimum observation tends to infinity, as the likelihood of all other observations tend to zero. Although somewhat surprising, Hill's result conforms with known general problems with likelihood methods when the support points of the probability distribution are a function of the parameters of the distribution, in this case the parameters Xmin and Xmax. Several modifications of the maximum likelihood methods are proposed. It is also shown that for the standard likelihood function a local maximum occurs within natural parameter space.

Different methods of resolving this problem are discussed along with other methods of obtaining the SB parameters, by fitting to 4 percentiles, by method of moments, and by a graphical technique that plots the data and minimizes the Kolmogorov-Smirnov statistic.

U.S. BUREAU OF CENSUS
STATISTICAL RESEARCH DIVISION SEMINAR

Title: A Test of Association of a Two-Way Categorical Table for Correlated Counts

Abstract:

When the counts in a two-way categorical table are formed from the correlated members of a cluster, the common chi-squared test no longer applies. There are several approximate adjustments to the common chi-squared test. For example, Choi and McHugh (1989, Biometrics, 45) showed how to adjust the chi-squared statistic for clustered and weighted data. However, our main contribution is the construction and analysis of a Bayesian model that removes analytical approximation especially when the expected cell is empty or small. This is an extension of a standard multinomial Dirichlet model to include the intra-class correlation associated with the individual within a cluster. We have used the formula described by Altham (1976, Biometrika, 63) to incorporate the intra-class correlation. This intra-cluster correlation varies with the size of the cluster, but assume that it is the same for all clusters of the same size for the same variable. We use MCMC to fit our model, and to make posterior inference about the intra-class correlation and the cell probabilities. Also, using Monte Carlo integration with a binomial importance function, we obtain the Bayes factor for a test of no association. To demonstrate the performance of the alternative test and estimation procedure, we have used data on activity limitation status and age from the National Health Interview Survey and a simulation study.

This seminar is physically accessible to persons with disabilities. For TTY callers, please use the Federal Relay Service at 1-800-877-8339. This is a free and confidential service. To obtain Sign Language Interpreting services/CART (captioning real time) or auxiliary aids, please send your requests via e-mail to EEO Interpreting & CART: eeo.interpreting.&.CART@census.gov or TTY 301-457-2540, or by voice mail at 301-763-2853, then select #2 for EEO Program Assistance.

Title: Introduction to Data Mining Methodology for Statisticians

Abstract:

This presentation will introduce data mining methodology and address some of the common questions from statisticians about data mining. It will include a discussion of typical questions from statisticians about data mining. Sample questions include what is common to data mining and statistical analysis, what is the role of the statistician in the analysis and interpretation of results from data mining, how results are validated, how data mining came to be, datasets appropriate for data mining, and why have computer scientists led so much of the data mining development. Data mining is considered to be the application of modern, highly automated nonparametric analytical methods to recognize enduring patterns in data. Several of the major tools of data mining will be discussed, including decision trees (CART), artificial neural networks, multivariate adaptive regression splines (MARS), rule induction, RandomForests, Multiple additive regression trees (TreeNet/MART Stochastic Gradient Boosting) and several others. Finally, case study examples will be provided for a variety of data mining methods.

THE UNIVERSITY OF MARYLAND
STATISTICS PROGRAM SEMINAR

Title: Wait! Should We Use the Survey Weights to Weight?

Abstract:

The lecture will discuss the use of weights in survey inference. A fundamental idea in survey sampling is to weight cases by the inverse of their probabilities of inclusion, when deriving survey inferences. The weight indicates the number of population units the included case represents, and thus can be seen as a fundamental feature of the design-based survey inference. Modelers, on the other hand, seem more ambivalent about weighting, and argue that (at least in some settings) weighting is unnecessary. Dr. Little will discuss various perspectives and myths about survey weights. He will argue that, from a robust Bayesian perspective, weights are a key feature of the data that cannot be ignored, but weighting may not be the best way to use them.

2006 ROGER HERRIOT AWARD

Title: Bridging: Roger Herriot's Time to the Present

Abstract:

In the 1980s, the Census Bureau and outside collaborators bridged the transition from the industry and occupation coding system for the 1970 census to that for the 1980 census by creating multiple imputations of 1980-system codes for 1970 census public-use samples. The imputation models were fitted using a relatively small "double-coded" (both 1970 and 1980 systems) sample from the 1970 census. This project had roots in the Population Division at the Bureau. Roger Herriot, as Chief of the Division, was very supportive of the project and contributed ideas to it, and the project was described in William Butz's 1995 ASA Proceedings article in memory of Herriot (http://www.amstat.org/sections/sgovt/outofbox.htm) as one of Herriot's major innovations at the Bureau. This talk will discuss the industry and occupation code project and statistical lessons learned from it. Two recent bridging projects at the National Center for Health Statistics, one addressing the transition from single-race to multiple-race reporting in Federal data collections and one adjusting for differences between self-reported and clinical data in surveys, will be discussed as well. The talk will highlight similarities and differences among the three bridging projects and will point out some outstanding methodological issues.

Title: American Community Survey Weighting and Estimation: ACS Family Equalization

Abstract:

Historically the American Community Survey (ACS) has produced inconsistent estimates of households and householders and inconsistent estimates of husbands and wives in married couple households even though logically these estimates should be equal. In the 2005 ACS, the size of these inconsistencies at the national level was approximately 3.7 million more householders than households and approximately 1.8 million more spouses than married-couple households. Likewise, for unmarried-partner households there are approximately 176,000 more unmarried-partners than unmarried-partner households. The cause of these data inconsistencies were rooted in the current person weighting methodology which was independent of the housing unit weighting and did not consider relationship to the householder. This paper describes the current weighting methodology and changes introduced to reduce these data inconsistencies while having a minimal impact on other estimates and on the variances of the estimates. A three-dimensional raking methodology is used where the marginal control totals are derived from the survey itself rather than an independent source for the first two dimensions related to equalizing spouses and householders. Changes in the estimation of housing unit characteristics are also discussed. Empirical results from the implementation of this new methodology are presented based on the 2004 and 2005 ACS data.

Title: An Overview of the Semi-Competing Risk Problem

Abstract:

Semi-competing risks problem refers to a special bivariate time-to-event data structure, where one event is terminal and the other is non-terminal. Since the terminal event may censor the non-terminal event, we may only observe both events if the non-terminal event occurs earlier. This type of data frequently arise in studies of human health and behavior as multiple event times from subjects are routinely studied. The association between the two times and their marginal distributions may be of interest. In clinical trial setting or heterogeneous study population, covariate effect on either event time may be the focus. However, inference based on semi-competing risks data is often complicated by administrative censoring and potentially dependent censoring on the non-terminal event from the terminal event if the two event-times are associated. This talk will describe the unique feature of semi-competing risks data by comparing them with bivariate right censored time-to-event data and competing risks data, discuss identifiability issue and review on recent methodology advances for making inferences based on semi-competing risks data.

More Information about this and other talks sponsored by the Division of Cancer Prevention: http://www3.cancer.gov/prevention/pob/fellowship/colloquia.html

GEORGETOWN UNIVERSITY SEMINAR

Title: Systems Pharmacology of Type 2 Diabetes: A Case Study for Pharmaceutical Development

Abstract:

A keyissue in drug discovery is the appropriate use of animal models to study human disease and therapeutic drug response. Animal models have, in general, been used to mimic human disease based upon relatively few points of analogy. The richness of open discovery "omics" platforms allows a comprehensive measurement of disease and drug response across a range of analyte classes, allowing investigators to better understand the predictive value of animal models for human disease. In a study conducted by GlaxoSmithKline, we compared disease effects and treatment response in two mouse models of type 2 diabetes and in a parallel human study. Three registered medicines for diabetes (rosglitazone, metformin, and glyburide) were studied, and detailed measurements of transcripts, lipids, metabolites, and proteins were obtained in tissues and biofluids. Integrated data analysis using various multivariate techniques allowed for the generation of predictive fingerprints which shorten the time required to demonstrate treatment efficacy in diabetes trials, as well as allowing the identification of patients most likely to respond to a particular therapy form baseline measurements. In addition, analysis uncovered a previously unsuspected mechanism for rosiglitazone activity in diabetic adipose. The use of Systems Biology approaches with large "omic" datasets holds great promise for deeper understandings of disease biology and pharmacology.

Title: The STATCOM Network: A Role for Students in Pro Bono Statistical Consulting to the Community

Abstract:

The Statistics in the Community (STATCOM) Network is a graduate student-run consulting service that provides free statistical consulting to local governmental and nonprofit community groups. A need for statistical expertise in the local community was identified by a graduate student at Purdue University who founded STATCOM in 2001. Students who participate in STATCOM work in teams on community projects, while applying classroom knowledge and gaining marketable skills.

STATCOM also has a P-12 Outreach component, which serves as an effort to increase interest and achievement in statistics among pre-college students by involvement in community events and classrooms. STATCOM, through a Strategic Initiatives Grant from the American Statistical Association, is currently developing a network across institutions of students devoting time to pro bono statistical consulting. This talk will cover the structure of the STATCOM Network, from a national and local level. In addition, this talk will address how the STATCOM Network can help fill a niche in pro bono statistical efforts and be supported by professional statisticians.

This is joint work with Alexander E. Lipka, Amy E. Watkins and Nilupa S. Gunaratna.

Cherie A. Ochsenfeld

Cherie A. Ochsenfeld received a B.S. in Mathematics/Economics, a M.A. in Teacher Education, and a California Teaching Credential in Mathematics from the University of California, Los Angeles. She received a M.S. in Applied Statistics from California State University, Hayward, a M.S. in Mathematical Statistics and is currently a Ph.D. student in Statistics at Purdue University. Her research interests include statistical genetics, nonparametric statistics, and QTL analysis. Cherie is the current Director of STATCOM at Purdue University and has served within the organization for three years.

Gayla R. Olbricht

Gayla R. Olbricht received a B.S. in Mathematics from Missouri State University. She received a M.S. in Applied Statistics and is currently a Ph.D. student in Statistics at Purdue University. Her research interests include statistical genetics, hidden Markov models, and epigenomics. Gayla is the current Student Advisor of STATCOM at Purdue University and has served within the organization for four years.

Shail Butani

Ms. Butani is Chief of the Statistical Methods Staff in the Office of Employment and Unemployment Statistics, U.S. Bureau of Labor Statistics (BLS). She received both her B.A. and M.A. in mathematical statistics from George Washington University. Last year, she was one of the organizers of the ASA Special Interest Group for Volunteers.

In the early 1990's to mid 1990's, she led a very successful quantitative literacy (QL) effort for Washington Statistical Society particularly in the Fairfax County, VA. Major activities were: 1) Conducted and organized speakers and materials for career days for over 100 math classes each year. 2) Participated and provided consultants for QL workshops conducted by ASA for local teachers. 3) Provided statisticians to assist in developing math curricula for Fairfax County Public Schools. 4) Conducted and provided statisticians for elementary schools teachers' workshops. 5) Presented materials at Female Achieving Mathematics Equity (FAME) project. 6) Provided speakers for Girls Excelling in Math and Science (GEMS) programs. 7) Conducted and provided consultants for girls scouts' workshops.

Title: Using the t-distribution to Deal with Outliers in Small Area Estimation

Abstract:

Small area estimation using linear area level models typically assumes normality of the area level random effects (model errors) and of the survey errors of the direct survey estimates. Outlying observations can be a concern, and can arise from outliers in either the model errors or the survey errors, two possibilities with very different implications. We consider both possibilities here and investigate empirically how use of a Bayesian approach with a t-distribution assumed for one of the error components can address potential outliers. The empirical examples use models for U.S. state poverty ratios from the U.S. Census Bureau's Small Area Income and Poverty Estimates program, extending the usual Gaussian models to assume a t-distribution for the model error or survey error. Results are examined to see how they are affected by varying the number of degrees of freedom (assumed known) of the t-distribution. We find that using a t-distribution with low degrees of freedom can diminish the effects of outliers, but in the examples discussed the results do not go as far as approaching outright rejection of observations.

Title: Confidence Interval Coverage in Model-Based Estimation

Abstract:

When there is a strongly related auxiliary variable, model-based estimation can yield more precise estimates from smaller samples. Assumptions are made to build the models, produce estimates, and calculate confidence intervals. The first talk explores confidence interval coverage with deep stratification under scenarios when the assumptions are not quite correct, such as failing to assume correct scedasticity, recognize curvature, or incorporate an intercept. In these settings confidence interval coverage can be poor, robust, or ultra conservative. The second talk explores confidence interval coverage and Satterthwaite's approximation to the degrees of freedom when two or more model based estimates are summed in complex sample designs.

Title: The Role of Statistics and Statisticians in Human Rights

Abstract:

This seminar, designed with human rights practitioners in mind, outlines some examples of situations in which statisticians were asked to contribute to human rights projects. Our hope is to allow networking between the statistical community and the human rights community so that the unique contributions that statisticians can make towards human rights advocacy will be utilized in the future.

David Banks - A Katrina Experience

In 2005 the NSF sponsored a number of research projects on the aftermath of Katrina. This talk describes a survey led by Duke, UNC-Charlotte, and Tulane to study the factors that affected whether or not New Orleans residents chose to evacuate in advance of the storm, and what factors affected their post-Katrina experience. As part of this effort we found that some aspects of classic survey methodology do not work well with unsettled populations, and we developed workarounds that often were surprisingly successful.

Gary Shapiro - Guatemala Police Records

Several warehouses were discovered in Guatemala that contain millions of documents belonging to the National Police prior to 1996. The documents are of interest because some provide information on instances of police violence. The Human Rights Data Analysis Group at Benetech was asked to provide technical assistance for understanding and analyzing the archives. In turn, a group of ASA members provided assistance to Benetech on how sampling of these documents could be done. This talk discusses the complex structure of the archives, the sampling that is now being done, and the type of assistance provided to Benetech.

Paul Zador - Darfur What Could Have Been

Several estimates of deaths during the Darfur crisis will be summarized. The methods used to derive them, and their reliability, will be reviewed and critiqued based in part on comments recently published in GAO's report on the Darfur crisis. The question will be raised: How do we determine the practical difference having precise disaster estimates of deaths, hunger, injuries, etc. might make? A volunteer group designed a survey of refugee camps in Chad, but the survey was never conducted. We will describe the survey's design, and discuss why it never happened.

Title: Characterization, Modeling and Management of Inferential Risk, Data Quality Risk and Operational Risk in Survey Procedures

Abstract:

This paper explores someconceptual and methodological issues that are important in the design, operation and analysis of large-scale government surveys. We view the design of survey procedures (including initial planning, sample design, data collection, inference and dissemination) as a mixture of optimization and risk management efforts in the presence of constraints and incomplete information. This in turn suggests several potentially rich areas for research in mathematical and applied statistics.

Five topics receive principal attention, beginning with some relatively well-defined technical issues and then expanding to several broader topics related to data quality and risk management. First, a review of the goals, constraints and risk profiles of survey practice suggests a spectrum of potential approaches to survey work, ranging from rigorously predetermined survey procedures at one extreme to highly exploratory analyses of previously collected data at the other extreme. Classical randomization-based procedures are arguably compatible with a mandate for predetermined methodology. Nonetheless, these procedures have limitations arising from efficiency issues, the presence of nonsampling error, and prospective inferential interest beyond the finite population that was sampled. These limitations lead to review of a second class of approaches to the analysis of survey data, based on models for survey variables, auxiliary variables and nonsampling error processes. Third, we use the framework of risk management to explore six dimensions of survey data quality suggested in Brackstone (1999): accuracy (incorporating all of the components of error considered in standard models for total survey error), timeliness, relevance, interpretability, accessibility and coherence. Fourth, we expand our discussion of risk management by considering operational risk, i.e., the risk that one or more steps in a survey procedure may not be carried out as specified. Finally, we note that work with large-scale surveys will involve a mixture of statistical science and statistical technology, and we suggest that the literature on adoption and diffusion of technology can offer important insights into the distribution of expectations, utility functions and behaviors of large survey organizations, data analysts and other data users.

Special WSS Session: Book Signing and Wine Tasting

They invite all of you to celebrate with them. All the authors are longtime WSS members and will each say a few words about the book even signing copies if requested.

Reiter's Book Store, a Washington Landmark for over 60 years, is hosting this special event. There will be wine and cheese as long at it lasts.

Easyto get too, Reiter's' is on 20th street at the Southwest corner of 20th and K. Just two blocks on K street from the Farragut West Metro or 4 blocks down 20th from the Metro at Dupont Circle.

Title: The Role of Fringe Benefits in Employer and Workforce Dynamics

Abstract:

This paper examines how the evolution of a firm's human capital stock is related to firms' benefit choices using integrated data on firms, their employees, and their benefit offerings from the Census Bureau's Longitudinal Employer-Household Dynamics Program and from IRS Form 5500. It then estimates the relationship between compensation packages and firm productivity and survival, controlling for workforce characteristics. The authors find that firms that offer benefits have significantly lower turnover rates and faster growth rates. Benefit-offering firms have higher labor productivity and higher survival rates, even when controlling for firm and workforce characteristics and the level of wage compensation. Greater labor productivity explains some but not all of the differences in survival rates.

Title: Spatial Association Between Speciated Fine Particles and Mortality

Abstract:

articulate matter (PM) has been linked to a range of serious cardiovascular and respiratory health problems, including premature mortality. The main objective of our research is to quantify uncertainties about the impacts of fine PM exposure on mortality. A multivariate spatial regression model is developed for the estimation of the risk of mortality associated to fine PM and its components across all counties the coterminous United States. Different sources of uncertainty in the data and model are explored using the spatial structure of the mortality data and the speciated fine PM. A flexible Bayesian hierarchical model is proposed for a space-time series of counts (mortality) by constructing a likelihood-based version of a generalized Poisson regression model that combines methods for point-level misaligned data and change of support regression. Our results seem to suggest an increase by a factor of two in the risk of mortality due to fine particles with respect to coarse particles. Our study also shows that in the Western United States, the nitrate and crustal components of the speciated fine PM seem to have more impact on mortality than the other components. On the other hand, in the Eastern United States, sulfate and ammonium explain most of the PM fine effect.

BLS STATISTICAL SEMINAR

Title: Robust Prediction of Small Area Means and Distributions

Abstract:

Small area estimation techniques typically rely on mixed models containing random area effects to characterise between area variability. In contrast, Chambers and Tzavidis (2006) describe an approach to small area estimation based on regression M-quantiles. This approach avoids conventional Gaussian assumptions and problems associated with specification of random effects, allowing between area differences to be characterized by the variation of area-specific M-quantile coefficients. However, the resulting M-quantile predictors of small area means can be biased. In this talk I will describe a general framework for robust bias adjusted small area prediction that corrects this problem, and is based on representing a small area predictor as a functional of the Chambers and Dunstan (1986) predictor of the within area distribution function of the target variable. An important advantage of this framework is that it allows integrated prediction of small area means and quantiles. I will demonstrate the usefulness of this framework through both model-based as well as design-based simulation, with the latter based on two realistic survey data sets containing small area information. The talk also includes an application of the bias adjusted M-quantile approach to predicting key percentiles of district level distributions of per-capita household consumption expenditure in Albania in 2002.

Title: Estimation under Ignorable Response Mechanism and Unweighted Imputation

Abstract:

In many surveys, unweighted imputation methods are employed because of the unavailability of survey weights at the time of imputing missing survey data. In such situations, it is well known that certain customary design-based estimators with imputed data generally are biased even under the usual uniform response mechanism assumption. In this paper, we present the expression of the bias of a design-based estimator under more realistic ignorable response mechanism and then use this expression to propose a bias-corrected estimator. The second part of the paper deals with a variance estimator that captures different sources of uncertainties. Both theory and results from a Monte Carlo simulation study are presented to justify our approach.

Keywords: ratio imputation, bias-adjusted estimator, variance estimation, small area estimation

Title: Assessment of Coverage and Utility of Residential Address Lists

Abstract:

Coverage and Utility of Purchased Residential Address Lists: A Detailed Review of Selected Local Areas. Sylvia Dohrmann

Recently there has been much interest in using address lists originating from the United States Postal Service (USPS) as area sampling frames in place of on-site enumerations of dwelling units. While it has become clear that purchased USPS lists are less costly than the process of on-site enumeration, it is still unclear as to whether these lists are adequate as substitutes for them. In this presentation, we compare the coverage of purchased lists for a selection of PSUs (Primary Sampling Units), differing in size and composition, compared to area sample frames created using on-site enumeration. We will examine the coverage of the USPS lists by comparing them to enumerated lists and review what type of areas are more completely covered by the USPS lists. We will also demonstrate how the extent to which the addresses on the purchased lists can be geocoded relates to their usefulness as the basis for area sampling frames.

Suitability of the USPS Delivery Sequence File as a Commercial-Building Frame. Stephanie Eckman, Michael Colicchia, Colm O'Muircheartaigh, NORC.

The USPS Delivery Sequence File (DSF) has proven to be an accurate and low-cost frame for household surveys. However, no research organization has evaluated the use of the DSF as a frame of non-residential buildings. Given the success that we and other organizations have had using the DSF as a household frame, we are optimistic that the database will provide good coverage of non- residential buildings as well. But we must assess its accuracy and coverage. We have conducted such an assessment in eleven segments across the county. For each segment, we have both a recent field listing of commercial buildings as well as the DSF database of non-residential delivery points. We will compare the two frames, presenting match rates and maps showing the discrepancies between the frames.

BLS STATISTICAL SEMINAR

Title: Imputation Using Empirical Likelihood

Abstract:

Imputation isone of the most popular methods in dealing with nonrespondents in survey problems. In this presentation I focus on the use of empirical likelihood method in imputation that leads to more efficient and/or robust imputation than other methods such as the parametric regression imputation, nonparametric kernel imputation, and random hot deck imputation. More specifically, (1) an empirical likelihood imputation method using information provided by covariates and the propensity function is introduced to produce efficient and doubly robust estimators of population means; (2) an empirical likelihood method is introduced for creating imputation cells in hot deck random imputation where imputation cells are constructed using a categorical covariate; (3) an empirical likelihood method is studied in the case of non-ignorable nonrespondents with either categorical or continuous covariates. Simulation results are presented to show the efficiency and robustness properties of the proposed methods.

The work of Jun Shao was generously supported by grant DMS-0404535 from the National Science Foundation: Methodology, Measurement, and Statistics Program in the Division of Social and Economic Sciences.

Title: A Geostatistical Approach to Linking Geographically-Aggregated Data/A System for Detecting Arbitrarily Shaped Hotspots

Abstracts:

1. A Geostatistical Approach to Linking Geographically-Aggregated Data From Different Sources
Carol A. Gotway Crawford,Office of Workforce and Career Development, CDC; and Linda J. Young, Department of Statistics, University of Florida, Gainesville, FL USA

The widespread availability of digital spatial data and the capabilities of Geographic Information Systems (GIS) make it possible to easily synthesize spatial data from a variety of sources. More often than not, data have been collected at different geographic scales, and each of the scales may be different from the one of interest. Geographic information systems effortlessly handle these types of problems through raster and geoprocessing operations based on proportional allocation and centroid smoothing techniques. However, these techniques do not provide a measure of uncertainty in the estimates and lack the ability to incorporate important covariate information that may be used to improve the estimates. They also often ignore the different spatial supports (e.g., shape and orientation) of the data. On the other hand, statistical solutions to change of support problems are rather specific and difficult to implement. In this presentation, we present a general geostatistical framework for linking geographic data from different sources. This framework incorporates aggregation and disaggregation of spatial data, as well as prediction problems involving overlapping geographic units. It explicitly incorporates the supports of the data, can adjust for covariate values measured on different spatial units at different scales, provides a measure of uncertainty for the resulting predictions, and is computationally feasible within a GIS. The new framework we develop also includes a new approach for simultaneous estimation of mean and covariance functions from aggregated data using generalized estimating equations.

2. Upper Level Set Scan Statistic System for Detecting Arbitrarily/span>
Shaped Hotspots by Reza Modarres, Professor and chair, Dept of Statistics at GWU

The Upper Level Scan Statistic (ULS), its theory, design and implementation, and its extension to the bivariate data are discussed. We provide the ULS-Hotspot algorithm that maintains a list of connected components of the rate surface at each level of the ULS tree. The tree is grown in the immediate successor list, which provides a computationally efficient method for likelihood evaluation, visualization and storage. An example shows how the zones are formed and the likelihood function is developed for each candidate zone. The general theory of bivariate hotspot detection is discussed, including the bivariate binomial and Poisson models and the multivariate exceedance approach. We propose the joint and intersection methods for detecting bivariate hotspots and study the sensitivity of the joint hotspots to the degree of association between the variables. We investigate the hotspots in two diverse applications, one in Microbial Risk Assessment and the other in Mapping of Crime hotspots.

Title: Modeling Multiple-Response Categorical Data From Complex Surveys

Abstract:

Although "choose all that apply" questions are common in modern surveys, methods for analyzing associations among responses to such questions have only recently been developed. These methods are generally valid only for simple random sampling, but many "choose all that apply" and related questions appear in surveys conducted under more complex sampling plans. The purpose of this talk is to provide statistical analysis methods that can be applied to "choose all that apply" questions in complex survey sampling situations. Loglinear models fit to marginal data are used to describe associations among the multiple responses that occur with this type of data. Model comparison test statistics along with their asymptotic distributions are presented in order to choose a good fitting model. Estimates of odds ratios and their corresponding standard errors are provided in order to measure associations among responses.

GEORGETOWN UNIVERSITY SEMINAR

Title: Bayesian Methods for Proteomic Biomarker Discovery Using Functional Mixed Models

Abstract:

Various proteomic assays yield spiky functional data, for example MALDI-TOF and SELDI-TOF yield one-dimensional spectra with many peaks, and 2D gel electrophoresis and LC-MS yield two-dimensional images with spots that correspond to peptides present in the sample. In this talk, I will discuss how to identify candidate biomarkers for various types of proteomic data using methods based on the Bayesian wavelet-based functional mixed models. This approach models the functions in their entirety, so avoid reliance on peak or spot detection methods. The flexibility of this framework in modeling nonparametric fixed and random effect functions enables it to model the effects of multiple factors simultaneously, allowing one to perform inference on multiple factors of interest using the same model fit, while adjusting for clinical for experimental covariates that may affect both the intensities and locations of the peaks and spots in the data. I will demonstrate how to identify regions of the functions that are differentially expressed across experimental conditions, in a way that takes both statistical and clinical significance into account and controls the Bayesian false discovery rate to a pre-specified level. Time allowing, I will also demonstrate how to use this framework as the basis for classifying future samples based on their proteomic profiles in a way that can also combine information across multiple sources of data, including proteomic, genomic, and clinical, and may also discuss improvements of the modeling framework that result in more robust inference. These methods will be applied to a series of proteomic data sets from cancer-related studies.

GEORGE MASON UNIVERSITY CDS/CCDS/STATISTICS COLLOQUIUM SERIES

Title: Experiences with Congressional Testimony: Statistics and The Hockey Stick

Abstract:

Rarely does the federal government need advice on theoretical statistics. I would like to talk about one exception. Efforts to persuade Congress to enact legislation that affects public policy are constantly being made by lobbyists who are paid by special interests. While this mode of operation is frequently extremely effective for achieving the goals of the special interest groups, it often does not serve the public interests in the best possible way. As counterpoint to this mode of operation, pro bono interaction with individual legislators and especially testimony in Congressional hearings can be remarkably effective in presenting a balanced picture. The debate on anthropogenic global warming has in many ways left scientific discourse and landed in political polemic. In this talk I will discuss our positive and negative experiences in formulating testimony on this topic.

Title: An Introduction to the Key National Indicators Initiative: the State of the USA

Abstract:

Several countries around the world have developed organized systems of statistical indicators that are used to inform civil discourse, to track the change in basic economic, social, and environmental statuses of the country. These key national indicator systems have audiences that are both the policy makers in central and local governments but also interested citizens.

The State of the USA is envisioned to be a web-based resources permitting user-friendly presentation of key indicators at national and subnational levels. It will have explicit quality criteria and interest thresholds that inform what indicators are contained in the system. It will include official government statistics, private sector statistics, and academic statistics.

The State of the USA is currently funded by grants from several private foundations and is being incubated in the National Academies.

This WSS session will provide an introduction to the inception and development of the State of the USA, its basic goals, and its emergent organization. A demonstration of a test web site, illustrating some of the features of the indicator presentation will be given.

Title: New Experiments on the Design of Complex Survey Questions

Abstract:

Survey researchers often need their questions to convey very specific information to respondents for example, questions may include complex definitions, instructions to include or exclude various considerations while answering, and a particular set of closed-ended responses. Although questionnaire design principles provide some advice on constructing complex questions, little empirical evidence demonstrates the superiority of certain decisions over others. For example, in some questions, important respondent instructions "dangle" after the core question has been asked; one alternative is to provide such definitions before asking the core question.

We have conducted several rounds of RDD telephone surveys with split-ballot experiments to explore such issues. This seminar reports on the latest round of 425 interviews conducted via an RDD telephone survey, in which respondents received alternative versions of various survey questions. For example, in some experiments, alternative questions used the same words but were structured differently. Other experiments compared the use of examples vs. definitions to explain complex concepts, compared the use of one vs. two questions to measure the same phenomenon, and compared questions before and after cognitive interviews had been used to clarify key concepts. With permission, interviews were tape recorded and behavior-coded, making it possible to compare various interviewer and respondent difficulties across question versions, in addition to comparing differences in response distributions.

Taken in conjunction with findings from previous rounds of experiments, the results begin to suggest some general design principles for complex questions. For example, the disadvantages of "dangling qualifiers" are becoming clear, as are the advantages of using multiple questions to disentangle certain complex concepts. The seminar will report results of these and other experimental comparisons, with an eye toward providing more systematic questionnaire design guidance.

Following the seminar, all are welcome and invited to attend a social hour at Capitol City Brewing Company, located in the same building as the talk.

U.S. BUREAU OF CENSUS
STATISTICAL RESEARCH DIVISION SEMINAR

Topic: Unduplicating the 2010 Census

Abstract:

The current plan for the 2010 Census includes a nationwide unduplication operation. One potential problem is the possibility of large numbers of false positives. To help evaluate the extent of this problem, the unduplication procedures have been run on the data from the 2000 Census.

The first section of the talk describes a simple approach to take full advantage of multiple processors through writing C programs and using basic UNIX commands. This section also describes the programming concepts used in unduplicating the entire country using BigMatch and the SRD Matcher. It also describes metaprogramming techniques used in this large system and documents some of the errors and problems made during development. One example involves keeping track of all files so that multiple runs do not interfere with each other. A similar system is currently expected to be used for unduplication of the 2010 census on production machines.

The second section of the talk gives an overview of the results of our analysis. Most of the problem with apparent false matches seems to be concentrated in the most common surnames and the most common Hispanic surnames, especially for matches outside the state. Name frequency does not seem to have much effect when there are multiple links of reasonable quality between housing units or when the phone number matches.

This event is accessible to persons with disabilities. Please direct all requests for sign language interpreting services, Computer Aided Real-time Translation (CART), or other accommodation needs, to HRD.Disability.Program@census.gov. If you have any questions concerning accommodations, please contact the Disability Program Office at 301-763-4060 (Voice), 301-763-0376 (TTY), or by voice mail at 301-763-2853, then select #2 for EEO Program Assistance.

Title: Survey Methodology for Assessing Geographically Isolated Wetlands Map Accuracy

Abstract:

Wetlands provide significant environmental benefits such as assimilation of pollutants, flood water storage, water recharge and fish and wildlife habitat. Geographically isolated wetlands (GIW) can provide the same benefits as wetlands in general, and are particularly vulnerable to losses from urbanization and agriculture precisely because they are geographically isolated and have varying amounts of regulatory protection. Currently, there is not a dependable and cost-effective method to generate an accurate GIW map without sending a field scientist to perform surveys or requiring image technicians to perform heads-up digitalization of aerial photography. By using statistically valid estimates of accuracy rates one can evaluate the quality of the information contained in GIW maps. Accuracy rates are used to describe the misclassification errors of the maps. A probability sampling survey methodology that balances statistical considerations, expert opinion and operational considerations is proposed for assessing the accuracy of GIW maps. The proposed sampling design is based on a stratified multi-stage sampling design that addresses sampling size requirements for the different strata and types of GIWs and also recognizes the need for spatial coverage while minimizing operational efforts. Expressions for design-based accuracy estimates and an estimate of the number of GIW, as well as their corresponding variances are also provided.

A simulation exercise is used to illustrate the proposed sampling methodology. A GIW map for Brunswick County in North Carolina, created using historical data was used as the sampling frame. The GIW map was created from a combination of satellite imagery, classification tools to process the imagery and auxiliary information. The sampling methodology was used to randomly select sites from this GIW map. An updated GIW map for the same counties showing exact location of GIW was used to provide "ground-truth" observations from wetland delineations approved by the US Army Corps of Engineers. Accuracy estimates was calculated by comparing site classification differences obtained by using both the original and updated GIW maps. Survey based accuracy estimates and their corresponding variance estimates were calculated.

GEORGETOWN UNIVERSITY SEMINAR

Title: A Geometric Approach to Comparing Treatments for Rapidly Fatal Diseases

Abstract:

In therapy of rapidly fatal diseases, early treatment efficacy often is characterized by an event, "response," which is observed relatively quickly. Since the risk of death decreases at the time of response, it is desirable not only to achieve a response, but to do so as rapidly as possible. We propose a Bayesian method for comparing treatments in this setting based on a competing risks model for response and death without response. Treatment effect is characterized by a two-dimensional parameter consisting of the probability of response within a specified time and the mean time to response. Several target parameter pairs are elicited from the physician so that, for a reference covariate vector, all elicited pairs embody the same improvement in treatment efficacy compared to a fixed standard. A curve is fit to the elicited pairs and used to determine a two-dimensional parameter set in which a new treatment is considered superior to the standard. Posterior probabilities of this set are used to construct rules for the treatment comparison and safety monitoring. The method is illustrated by a randomized trial comparing two cord blood transplantation methods.

AMERICAN UNIVERSITY
DEPARTMENT OF MATHEMATICS AND STATISTICS COLLOQUIUM

Title: A Bayesian IRT Model for the Comparison of Survey Item Characteristics under Dual Modes of Administration

Abstract:

Ordinal scale survey response items are often used in quantifying a latent trait. When the survey is offered in multiple modes of administration, e.g., telephone interview or self-administered questionnaire, the mode of administration may affect the characteristics of the survey items, such that an individualÕs responses may differ depending on the mode. Using a mental health survey as a case study, the Bayesian Differential Mode Effects Model (BDMEM) is introduced as an Item Response Theory (IRT) model-based solution for the detection, quantification and reconciliation of mode of administration effects at the item, response category, and scale levels. The BDMEM is compared to the popular approach of differential item functioning (DIF), and its advantages over DIF are highlighted, including the optimal use of repeated measures, the detection of differences in categorical response probabilities, and the automatic equating of results under different modes.

U.S. BUREAU OF CENSUS
STATISTICAL RESEARCH DIVISION SEMINAR

Topic: Alternative Survey Sample Designs, Seminar #1: Network, Spatial, and Adaptive Sampling

Abstract:

The Census Bureau's Demographic Survey Sample Redesign Program, among other things, is responsible for research into improving the designs of demographic surveys, particularly focused on the design of survey sampling. Historically, the research into improving sample design has been restricted to the "mainstream" methods like basic stratification, multi-stage designs, systematic sampling, probability-proportional-to size sampling, clustering, and simple random sampling. Over the past thirty years or more, we have increasingly faced reduced response rates and higher costs coupled with an increasing demand for more data on all types of populations. More recently, dramatic increases in computing power and availability of auxiliary data from administrative records have indicated that we may have more options than we did when we established our current methodology.

This seminar series is the beginning of an exploration into alternative methods of sampling. In this first seminar, from 9:30 to 10:30, we will hear about Professor Thompson's work on network, spatial, and adaptive sampling. He will discuss various alternative approaches and their statistical properties. Following Professor Thompson's presentation, there will be a 15-minute break, and then from 10:45 - 11:30, Professor Jean Opsomer will provide discussion about the methods and their potential in demographic surveys, particularly focusing on impact on estimation. The seminar will conclude with an open discussion session from 11:30 - 11:45 with 15 additional minutes available if necessary.

Seminar #2 is currently slated for December 10, 2007 and will feature Professor Sharon Lohr of Arizona State University discussing multiple overlapping frame designs.

This event is accessible to persons with disabilities. Please direct all requests for sign language interpreting services, Computer Aided Real-time Translation (CART), or other accommodation needs, to HRD.Disability.Program@census.gov. If you have any questions concerning accommodations, please contact the Disability Program Office at 301-763-4060 (Voice), 301-763-0376 (TTY) or by voice mail at 301-763-2853, then select #2 for EEO Program Assistance.

Title: Small Area Estimation: An Empirical Best Linear Unbiased Prediction Approach

Abstract:

In this paper, based on the general Fay-Herriot model we evaluate the performance of different variance component estimation methods in the model-based point estimates and interval predictions. Following Morris' comments, we propose a new approach to estimate the model variance, which can always produce the positive estimates. Its positiveness and consistency are established also. A parametric bootstrap prediction interval method using the weighted least square estimator and ADM estimator under the general Fay-Herriot model is also proposed, and obtain coverage accuracy of O(mÁ3=2). Extensive simulation and real life data analysis are conducted. Our results suggest that this new approach performs better.

GEORGE WASHINGTON UNIVERSITY
DEPARTMENT OF STATISTICS SEMINAR

Title: Multi-Stage Sampling for Genetic Studies

Abstract:

In the firstpart of the talk, I will review various multi-stage sampling in classical genetic linkage and association studies. This part does not involve much statistics. In the second part, I will focus on a cost-effective two-stage design for genome-wide case-control association studies. Some test statistics for this two-stage design will also be discussed. Most of the talk is based on an article with Robert Elston and Danyu Lin to appear in Annual Review of Genomics and Human Genetics (Sept 2007).

For a complete listing of our current seminars, visit http://www.gwu.edu/~stat/seminar.htm. For more information about the George Washington University Department of Statistics Seminars, contact:
Efstathia Bura. Department of Statistics
E-mail: ebura@gwu.edu, Phone: 202-994-6358
Joseph L. Gastwirth, Department of Statistics
E-mail:jlgast@gwu.edu, Phone: 202-994-6548

GEORGE MASON UNIVERSITY
CDS/CCDS/STATISTICS COLLOQUIUM SERIES

Title: Text Data Mining in Defense Applications

Abstract:

This