WSS Home | Newsletter | WSS Info | Seminars | Courses | Employment | Feedback | Join!
Abstract:
One of the central biomedical issues for our time is the identification and fusion of multiple biomarkers for a specified diagnostic task. The fusion stage can be recognized immediately as a special case of the problem of statistical learning. That is, one trains a statistical learning machine (SLM) with cases whose health status or outcome is already known and then tests the learning machine on cases previously unseen. Almost all investigators of SLMs are familiar with early optimism, tempered by later experience. Assessment methods are needed that provide estimates not only of mean performance, but also of uncertainties associated with the finite size of the training and testing samples. Taking the work of Efron and Tibshirani as a point of departure, we have developed methods for calculating the statistical influence function for figures of merit based not only on probability of misclassification but also on the full receiver operating characteristic (ROC) or true-positive versus false-positive rate and several of its summary measures and their uncertainties. These methods have broad applicability across most diagnostic fields that plan to use multiple biomarkers and, in particular, are useful for designing a target database size based on a pilot study.
Abstract:
There are many methods of correcting microarray data for non-biological sources of error. Authors routinely supply software or code so that interested analysts can implement their methods. Even with a thorough reading of associated references, it is not always clear how requisite parts of the method are calculated in the software packages. However, it is important to have an understanding of such details, as this understanding is necessary for proper use of the output, or for implementing extensions to the model.
In this paper, the calculation of parameter estimates used in Robust Multichip Average (RMA), a popular preprocessing algorithm for Affymetrix GeneChip brand microarrays, is elucidated. The background correction method for RMA assumes that the perfect match (PM) intensities observed result from a convolution of the true signal, assumed to be exponentially distributed, and a background noise component, assumed to have a normal distribution. A conditional expectation is calculated to estimate signal. Estimates of the mean and variance of the normal distribution and the rate parameter of the exponential distribution are needed to calculate this expectation. Simulation studies show that the current estimates are flawed; therefore, new ones are suggested. We examine the performance of preprocessing under the exponential-normal convolution model using several different methods to estimate the parameters.
Abstract:
Enhancing flexibility of clinical trial designs is one of the hot topics nowadays. Proper adaptation of clinical trial design is one of the ways for achieving this goal and has drawn much attention from clinical trialists. In past decades, the classical design has been improved to allow the flexibility for terminating the trial early if the experimental treatment is proven effective or deemed harmful or futile, based on the data accumulating during the course of the trial. Statistical validity of such an enhanced design in terms of type I error is maintained. The operational aspects of this design can still be an issue but, by and large, there have been many good models for how to deal with these aspects. As the flexibility of trial design is enhanced further, the potential risk that the resulting trial may not be interpretable increases. In this presentation we shall share our review experience, discuss the many issues arising from use of more flexible designs and hopefully stimulate further research in this area.
Abstract:
Turbulent change is the hallmark of the U.S. economy, and one of the reasons for its success. Every week, in every part of the economy, and in every corner of the country, some firms are shutting down and others are starting up, some jobs are being created and others are being destroyed, some workers are being hired and others are quitting or being laid off.
The presentation will summarize the analysis from a new book "Economic Turbulence" derived from the use of the LEHD data at the Census Bureau, as well as from interviews with firms and workers in each industry.
Three key topics will be discussed:
Abstract:
In a Phase II cancer trial it may be advantageous to open enrollment to several patient populations, each with a very different null probability of response. For example in a trial of a novel therapeutic agent for relapsed Acute Myelogenous Leukemia (AML), patients in a first relapse may have a 30% probability of response under standard treatment, while patients in second relapse or higher may have only a 10% probability of response. These Phase II trials are generally uncontrolled (they often use "historical controls"), and the experimental agent may be expected to induce certain Grade 3 toxicities which would not be considered dose limiting. Furthermore, historically most of these Phase II trials can be expected to prove no better than standard-of-care. Phase II trials with these characteristics are usually designed with an early stopping rule which checks for initial evidence of efficacy after a first stage enrollment target is met. If there is insufficient evidence, the trial stops for futility. We discuss the standard two-stage optimal designs in this situation, and describe their operating characteristics under heterogeneous patient enrollment. These are compared to other approaches in the literature. Simple, approximately optimal designs which account for heterogeneity are presented. We recommend a practical adaptive design strategy which we have implemented at Moores UCSD Cancer Center.
Abstract:
In unstable situations, population based data are the most reliable method of estimating mortality and other health indicators. In many conflicts and fragile state settings, however, collecting such data is difficult to do. Aside from the physical dangers, there is often an incomplete understanding of population numbers, population locations, migration patterns, and health status of the population. That lack of understanding contributes to many methodological challenges. However, population based data are increasingly important in planning protection of and assistance to affected populations, as well as for reconstruction policy.
In Iraq wehave undertaken two population-based national surveys of mortality related to conflict using a cluster survey approach. The first covered the period from January 2002 until July 2004, using 33 clusters with 988 households and 7,868 persons. That survey estimated an excess mortality of over \,000 persons following the March 2003 invasion. The second survey covered the period from January 2002 until July 2006. That survey included 47 clusters containing 1,849 households and 12,801 persons. From that survey an excess mortality of 654,965 (CI 392 797-942 636) was estimated, with 601,027 deaths attributed to violent causes.
The presentations will discuss the methodological and ethical issues involved in conducting our research in Iraq.
Abstract:
Randomization provides a basis for inference, but it is rarely taken advantage of. We discuss randomization tests based on the family of linear rank tests in the context of sequential monitoring of clinical trials. Such tests are applicable for categorical, continuous, and survival time outcomes. We prove the asymptotic joint normality of sequentially monitored test statistics, which allows the computation of sequential monitoring critical values under the Lan-DeMets procedure. Since randomization tests are not based on likelihoods, the concept of information is murky. We give an alternate definition of randomization and show how to compute it for different randomization procedures. The randomization procedures we discuss are the permuted block design, stratified block design, and stratified urn design. We illustrate these results by reanalyzing a clinical trial in retinopathy.
Abstract:
The USDA/National Agricultural Statistics Service (NASS) annually produces remote sensing based crop specific classifications and acreage estimates over the major growing regions of the United States using medium resolution satellite imagery. The classifications are published in the public domain as the Cropland Data Layer (CDL) after the publication of the official release of county estimates. This program has mapped 24 total states since 1997 and is currently mapping 11 states annually (AR, IA, IL, IN, LA, MO, MS, ND, NE, WA and WI). This program previously used Landsat TM and ETM+ satellite imagery, the NASS June Agricultural Survey (JAS) segments for the ground truth information, and NASS public domain Peditor software for producing the classification and regression estimates. The unpredictability of the aging Landsat program assets, the labor intensive nature of digitizing June Agricultural Survey input for the Cropland Data Layer program, and the potential efficiency gains using commercial software warranted the need to investigate new program methods.
In 2004, NASS investigated alternative sensors to the Landsat platform, annually acquiring ResourceSat-1 Advanced Wide Field Sensor (AWiFS) data over the active Cropland Data Layer states. Additionally, evaluations were carried out on alternative ground truth methodologies to the June Agricultural Survey, using data collected through the USDA/Farm Service Agency (FSA) Common Land Unit (CLU) program. Testing and comparisons with regression tree See5 software against Peditor began in 2006 to produce the Cropland Data Layer. The goal was to determine which application was more efficient and delivered the most accurate estimates.
Accuracy assessments and acreage indications determined that the AWiFS significantly reduced the statistical variance of acreage indications from using the June Agricultural Survey area sampling frame, delivering a potential successor to the Landsat platform. In 2006, pilot testing was complete and the AWiFS sensor was selected as the exclusive source of imagery for the production of the Cropland Data Layer and acreage estimates. The Farm Service Agency Common Land Unit program provides a comprehensive national digitized and attributed GIS dataset collected annually for inclusion into programs like the Cropland Data Layer. Commercial image processing programs such as See5 were tested in 2006 against the AWiFS imagery and Common Land Unit datasets, providing evidence of efficiency gains in statistical accuracy, scope of coverage, and time of delivery.
Abstract:
The study of human rights violations and the development of statistical models that can offer explanations are severely handicapped by a lack of adequate data. Most information on human rights is embedded in qualitative reports. Quantitative data that do exist tend to be limited to rough counts of violations or numeric indexes with little if any methodological transparency. This presentation will describe an extensive and rigorous coding project which uses the annual U.S. State Department's International Religious Freedom Reports as the primary information source and the procedures developed to check the coded data against alternative sources. The usefulness of these coded data will be demonstrated by testing an explanatory theory of religious persecution using structural equation modeling. The presentation will conclude with a discussion of how this research could be extended to the measurement and statistical analysis of other human rights.
Abstract:
The use of mass spectrometry for clinical applications has extraordinary potential for accurate, early, and minimally invasive diagnoses of complex diseases, such as cancer, which require sensitive diagnostic tools for prognosis and development of flexible treatment strategies. Unfortunately, current mass spectrometry data analysis options available to researchers often require improvised combinations of tools provided by instrument manufacturers, third-parties, and in-house development. The lack of unified interfaces to access existing resources presents a significant bottleneck in the research and discovery process.
In this seminar, we present a modular software tool for the analysis of mass spectrometry profiling data that aims to address this bottleneck. The modules that comprise the analysis workflows can be broadly classified into three categories: signal processing tools, variable selection algorithms, and classification utilities. The software tool provides a platform that allows researchers to construct, validate, and optimize classification workflows of serum samples analyzed with time-of-flight mass spectrometry. Our work suggests that this type of flexible and interactive architecture is highly useful for 1) the development of mass spectrometry workflows and 2) biomarker discovery and validation in clinical environments.
Abstract:
Motivated by an increasing number of Bayesian hierarchical model applications, we investigate several diagnostic techniques when the fitted model includes some hierarchical structure, but the data are from a model with additional, unknown hierarchical structure. We start by studying the simple situation where the data come from a normal model with two-stage hierarchical structure while the fitted model does not have any hierarchical structure, and then extend this to the case where the fitted model has two-stage normal hierarchical structure while the data come from a model with three-stage normal structure. Our investigation suggests two promising techniques: distribution of individual posterior predictive p values and the conventional posterior predictive p value with the F statistic as a checking function. Finally, we apply these two techniques to examine the fit of a model for data from the Patterns of Care Study, a two-stage cluster sample of cancer patients undergoing radiation therapy.
Abstract:
It is common practice to use finite population correction factors (fpc) in estimating variances when sampling from a finite population. Various approximate fpcs are used with more complex designs sometimes. When the interest is in a wider population than the specific finite sampling frame, many argue that it suffices to drop the fpc from the variance estimates, but others maintain this is appropriate only in a limited number of contexts.
Abstract:
Counterintuitive concepts have been identified as major aspects of religious belief, and have been used to explain the retention and transmission of such beliefs. To resolve inconsistencies within this literature, three experiments were conducted to study the effect of context on recall. Context was found to be the key element affecting recall and the discrepancy among prior studies was resolved. The results imply that the nature of the surrounding context must be included in any account of the formation and transmission of religious concepts. A recent extension of this work involving type of context (science or religion) will also be introduced.
This seminar is physically accessible to persons with disabilities. For TTY callers, please use the Federal Relay Service at 1-800-877-8339. This is a free and confidential service. To obtain Sign Language Interpreting services/CART (captioning real time) or auxiliary aids, please send your requests via e-mail to EEO Interpreting & CART: eeo.interpreting.&.CART@census.gov or TTY 301-457-2540, or by voice mail at 301-763-2853, then select #2 for EEO Program Assistance.
Abstract:
In analyzing environmental data, it is common practice to assume that such data are from a 2-parameter lognormal if right skew and from a normal distribution if symmetrical. It is not generally recognized that the Johnson SB Distribution provides a continuum of distributions between the normal and lognormal distributions that constitute SB asymptotes. The Johnson SB transforms experimental data bounded by a minimum value (Xmin) and a maximum value (Xmax) into a normally distributed variable Y = ln [(x - Xmin) / (Xmax - x)] which is bounded as -infinity < Y < +infinity. As Xmax goes to +infinity and Xmin goes to 0, the distribution is asymptotically 2-parameter lognormal. As Xmax goes to +infinity and Xmin goes to -infinity the distribution is asymptotically normal.
Methods of objectively determining 4 optimal parameters for the SB distribution (Xmin, Xmax, mu, sigma) by the maximum likelihood estimation procedures are reviewed. Bruce Hill (1963) showed that the maximum likelihood solution for the three parameter lognormal yields degenerate and absurd solutions as Xmin goes in the limit to the minimum observation; the likelihood of the minimum observation tends to infinity, as the likelihood of all other observations tend to zero. Although somewhat surprising, Hill's result conforms with known general problems with likelihood methods when the support points of the probability distribution are a function of the parameters of the distribution, in this case the parameters Xmin and Xmax. Several modifications of the maximum likelihood methods are proposed. It is also shown that for the standard likelihood function a local maximum occurs within natural parameter space.
Different methods of resolving this problem are discussed along with other methods of obtaining the SB parameters, by fitting to 4 percentiles, by method of moments, and by a graphical technique that plots the data and minimizes the Kolmogorov-Smirnov statistic.
Abstract:
When the counts in a two-way categorical table are formed from the correlated members of a cluster, the common chi-squared test no longer applies. There are several approximate adjustments to the common chi-squared test. For example, Choi and McHugh (1989, Biometrics, 45) showed how to adjust the chi-squared statistic for clustered and weighted data. However, our main contribution is the construction and analysis of a Bayesian model that removes analytical approximation especially when the expected cell is empty or small. This is an extension of a standard multinomial Dirichlet model to include the intra-class correlation associated with the individual within a cluster. We have used the formula described by Altham (1976, Biometrika, 63) to incorporate the intra-class correlation. This intra-cluster correlation varies with the size of the cluster, but assume that it is the same for all clusters of the same size for the same variable. We use MCMC to fit our model, and to make posterior inference about the intra-class correlation and the cell probabilities. Also, using Monte Carlo integration with a binomial importance function, we obtain the Bayes factor for a test of no association. To demonstrate the performance of the alternative test and estimation procedure, we have used data on activity limitation status and age from the National Health Interview Survey and a simulation study.
This seminar is physically accessible to persons with disabilities. For TTY callers, please use the Federal Relay Service at 1-800-877-8339. This is a free and confidential service. To obtain Sign Language Interpreting services/CART (captioning real time) or auxiliary aids, please send your requests via e-mail to EEO Interpreting & CART: eeo.interpreting.&.CART@census.gov or TTY 301-457-2540, or by voice mail at 301-763-2853, then select #2 for EEO Program Assistance.
Abstract:
This presentation will introduce data mining methodology and address some of the common questions from statisticians about data mining. It will include a discussion of typical questions from statisticians about data mining. Sample questions include what is common to data mining and statistical analysis, what is the role of the statistician in the analysis and interpretation of results from data mining, how results are validated, how data mining came to be, datasets appropriate for data mining, and why have computer scientists led so much of the data mining development. Data mining is considered to be the application of modern, highly automated nonparametric analytical methods to recognize enduring patterns in data. Several of the major tools of data mining will be discussed, including decision trees (CART), artificial neural networks, multivariate adaptive regression splines (MARS), rule induction, RandomForests, Multiple additive regression trees (TreeNet/MART Stochastic Gradient Boosting) and several others. Finally, case study examples will be provided for a variety of data mining methods.
Abstract:
The lecture will discuss the use of weights in survey inference. A fundamental idea in survey sampling is to weight cases by the inverse of their probabilities of inclusion, when deriving survey inferences. The weight indicates the number of population units the included case represents, and thus can be seen as a fundamental feature of the design-based survey inference. Modelers, on the other hand, seem more ambivalent about weighting, and argue that (at least in some settings) weighting is unnecessary. Dr. Little will discuss various perspectives and myths about survey weights. He will argue that, from a robust Bayesian perspective, weights are a key feature of the data that cannot be ignored, but weighting may not be the best way to use them.
Abstract:
In the 1980s, the Census Bureau and outside collaborators bridged the transition from the industry and occupation coding system for the 1970 census to that for the 1980 census by creating multiple imputations of 1980-system codes for 1970 census public-use samples. The imputation models were fitted using a relatively small "double-coded" (both 1970 and 1980 systems) sample from the 1970 census. This project had roots in the Population Division at the Bureau. Roger Herriot, as Chief of the Division, was very supportive of the project and contributed ideas to it, and the project was described in William Butz's 1995 ASA Proceedings article in memory of Herriot (http://www.amstat.org/sections/sgovt/outofbox.htm) as one of Herriot's major innovations at the Bureau. This talk will discuss the industry and occupation code project and statistical lessons learned from it. Two recent bridging projects at the National Center for Health Statistics, one addressing the transition from single-race to multiple-race reporting in Federal data collections and one adjusting for differences between self-reported and clinical data in surveys, will be discussed as well. The talk will highlight similarities and differences among the three bridging projects and will point out some outstanding methodological issues.
Abstract:
Historically the American Community Survey (ACS) has produced inconsistent estimates of households and householders and inconsistent estimates of husbands and wives in married couple households even though logically these estimates should be equal. In the 2005 ACS, the size of these inconsistencies at the national level was approximately 3.7 million more householders than households and approximately 1.8 million more spouses than married-couple households. Likewise, for unmarried-partner households there are approximately 176,000 more unmarried-partners than unmarried-partner households. The cause of these data inconsistencies were rooted in the current person weighting methodology which was independent of the housing unit weighting and did not consider relationship to the householder. This paper describes the current weighting methodology and changes introduced to reduce these data inconsistencies while having a minimal impact on other estimates and on the variances of the estimates. A three-dimensional raking methodology is used where the marginal control totals are derived from the survey itself rather than an independent source for the first two dimensions related to equalizing spouses and householders. Changes in the estimation of housing unit characteristics are also discussed. Empirical results from the implementation of this new methodology are presented based on the 2004 and 2005 ACS data.
Abstract:
Semi-competing risks problem refers to a special bivariate time-to-event data structure, where one event is terminal and the other is non-terminal. Since the terminal event may censor the non-terminal event, we may only observe both events if the non-terminal event occurs earlier. This type of data frequently arise in studies of human health and behavior as multiple event times from subjects are routinely studied. The association between the two times and their marginal distributions may be of interest. In clinical trial setting or heterogeneous study population, covariate effect on either event time may be the focus. However, inference based on semi-competing risks data is often complicated by administrative censoring and potentially dependent censoring on the non-terminal event from the terminal event if the two event-times are associated. This talk will describe the unique feature of semi-competing risks data by comparing them with bivariate right censored time-to-event data and competing risks data, discuss identifiability issue and review on recent methodology advances for making inferences based on semi-competing risks data.
More Information about this and other talks sponsored by the Division of Cancer Prevention: http://www3.cancer.gov/prevention/pob/fellowship/colloquia.html
Abstract:
A keyissue in drug discovery is the appropriate use of animal models to study human disease and therapeutic drug response. Animal models have, in general, been used to mimic human disease based upon relatively few points of analogy. The richness of open discovery "omics" platforms allows a comprehensive measurement of disease and drug response across a range of analyte classes, allowing investigators to better understand the predictive value of animal models for human disease. In a study conducted by GlaxoSmithKline, we compared disease effects and treatment response in two mouse models of type 2 diabetes and in a parallel human study. Three registered medicines for diabetes (rosglitazone, metformin, and glyburide) were studied, and detailed measurements of transcripts, lipids, metabolites, and proteins were obtained in tissues and biofluids. Integrated data analysis using various multivariate techniques allowed for the generation of predictive fingerprints which shorten the time required to demonstrate treatment efficacy in diabetes trials, as well as allowing the identification of patients most likely to respond to a particular therapy form baseline measurements. In addition, analysis uncovered a previously unsuspected mechanism for rosiglitazone activity in diabetic adipose. The use of Systems Biology approaches with large "omic" datasets holds great promise for deeper understandings of disease biology and pharmacology.
Abstract:
The Statistics in the Community (STATCOM) Network is a graduate student-run consulting service that provides free statistical consulting to local governmental and nonprofit community groups. A need for statistical expertise in the local community was identified by a graduate student at Purdue University who founded STATCOM in 2001. Students who participate in STATCOM work in teams on community projects, while applying classroom knowledge and gaining marketable skills.
STATCOM also has a P-12 Outreach component, which serves as an effort to increase interest and achievement in statistics among pre-college students by involvement in community events and classrooms. STATCOM, through a Strategic Initiatives Grant from the American Statistical Association, is currently developing a network across institutions of students devoting time to pro bono statistical consulting. This talk will cover the structure of the STATCOM Network, from a national and local level. In addition, this talk will address how the STATCOM Network can help fill a niche in pro bono statistical efforts and be supported by professional statisticians.
This is joint work with Alexander E. Lipka, Amy E. Watkins and Nilupa S. Gunaratna.
Cherie A. Ochsenfeld
Cherie A. Ochsenfeld received a B.S. in Mathematics/Economics, a M.A. in Teacher Education, and a California Teaching Credential in Mathematics from the University of California, Los Angeles. She received a M.S. in Applied Statistics from California State University, Hayward, a M.S. in Mathematical Statistics and is currently a Ph.D. student in Statistics at Purdue University. Her research interests include statistical genetics, nonparametric statistics, and QTL analysis. Cherie is the current Director of STATCOM at Purdue University and has served within the organization for three years.
Gayla R. Olbricht
Gayla R. Olbricht received a B.S. in Mathematics from Missouri State University. She received a M.S. in Applied Statistics and is currently a Ph.D. student in Statistics at Purdue University. Her research interests include statistical genetics, hidden Markov models, and epigenomics. Gayla is the current Student Advisor of STATCOM at Purdue University and has served within the organization for four years.
Shail Butani
Ms. Butani is Chief of the Statistical Methods Staff in the Office of Employment and Unemployment Statistics, U.S. Bureau of Labor Statistics (BLS). She received both her B.A. and M.A. in mathematical statistics from George Washington University. Last year, she was one of the organizers of the ASA Special Interest Group for Volunteers.
In the early 1990's to mid 1990's, she led a very successful quantitative literacy (QL) effort for Washington Statistical Society particularly in the Fairfax County, VA. Major activities were: 1) Conducted and organized speakers and materials for career days for over 100 math classes each year. 2) Participated and provided consultants for QL workshops conducted by ASA for local teachers. 3) Provided statisticians to assist in developing math curricula for Fairfax County Public Schools. 4) Conducted and provided statisticians for elementary schools teachers' workshops. 5) Presented materials at Female Achieving Mathematics Equity (FAME) project. 6) Provided speakers for Girls Excelling in Math and Science (GEMS) programs. 7) Conducted and provided consultants for girls scouts' workshops.
Abstract:
Small area estimation using linear area level models typically assumes normality of the area level random effects (model errors) and of the survey errors of the direct survey estimates. Outlying observations can be a concern, and can arise from outliers in either the model errors or the survey errors, two possibilities with very different implications. We consider both possibilities here and investigate empirically how use of a Bayesian approach with a t-distribution assumed for one of the error components can address potential outliers. The empirical examples use models for U.S. state poverty ratios from the U.S. Census Bureau's Small Area Income and Poverty Estimates program, extending the usual Gaussian models to assume a t-distribution for the model error or survey error. Results are examined to see how they are affected by varying the number of degrees of freedom (assumed known) of the t-distribution. We find that using a t-distribution with low degrees of freedom can diminish the effects of outliers, but in the examples discussed the results do not go as far as approaching outright rejection of observations.
Abstract:
When there is a strongly related auxiliary variable, model-based estimation can yield more precise estimates from smaller samples. Assumptions are made to build the models, produce estimates, and calculate confidence intervals. The first talk explores confidence interval coverage with deep stratification under scenarios when the assumptions are not quite correct, such as failing to assume correct scedasticity, recognize curvature, or incorporate an intercept. In these settings confidence interval coverage can be poor, robust, or ultra conservative. The second talk explores confidence interval coverage and Satterthwaite's approximation to the degrees of freedom when two or more model based estimates are summed in complex sample designs.
Abstract:
This seminar, designed with human rights practitioners in mind, outlines some examples of situations in which statisticians were asked to contribute to human rights projects. Our hope is to allow networking between the statistical community and the human rights community so that the unique contributions that statisticians can make towards human rights advocacy will be utilized in the future.
David Banks - A Katrina Experience
In 2005 the NSF sponsored a number of research projects on the aftermath of Katrina. This talk describes a survey led by Duke, UNC-Charlotte, and Tulane to study the factors that affected whether or not New Orleans residents chose to evacuate in advance of the storm, and what factors affected their post-Katrina experience. As part of this effort we found that some aspects of classic survey methodology do not work well with unsettled populations, and we developed workarounds that often were surprisingly successful.
Gary Shapiro - Guatemala Police Records
Several warehouses were discovered in Guatemala that contain millions of documents belonging to the National Police prior to 1996. The documents are of interest because some provide information on instances of police violence. The Human Rights Data Analysis Group at Benetech was asked to provide technical assistance for understanding and analyzing the archives. In turn, a group of ASA members provided assistance to Benetech on how sampling of these documents could be done. This talk discusses the complex structure of the archives, the sampling that is now being done, and the type of assistance provided to Benetech.
Paul Zador - Darfur What Could Have Been
Several estimates of deaths during the Darfur crisis will be summarized. The methods used to derive them, and their reliability, will be reviewed and critiqued based in part on comments recently published in GAO's report on the Darfur crisis. The question will be raised: How do we determine the practical difference having precise disaster estimates of deaths, hunger, injuries, etc. might make? A volunteer group designed a survey of refugee camps in Chad, but the survey was never conducted. We will describe the survey's design, and discuss why it never happened.
Abstract:
This paper explores someconceptual and methodological issues that are important in the design, operation and analysis of large-scale government surveys. We view the design of survey procedures (including initial planning, sample design, data collection, inference and dissemination) as a mixture of optimization and risk management efforts in the presence of constraints and incomplete information. This in turn suggests several potentially rich areas for research in mathematical and applied statistics.
Five topics receive principal attention, beginning with some relatively well-defined technical issues and then expanding to several broader topics related to data quality and risk management. First, a review of the goals, constraints and risk profiles of survey practice suggests a spectrum of potential approaches to survey work, ranging from rigorously predetermined survey procedures at one extreme to highly exploratory analyses of previously collected data at the other extreme. Classical randomization-based procedures are arguably compatible with a mandate for predetermined methodology. Nonetheless, these procedures have limitations arising from efficiency issues, the presence of nonsampling error, and prospective inferential interest beyond the finite population that was sampled. These limitations lead to review of a second class of approaches to the analysis of survey data, based on models for survey variables, auxiliary variables and nonsampling error processes. Third, we use the framework of risk management to explore six dimensions of survey data quality suggested in Brackstone (1999): accuracy (incorporating all of the components of error considered in standard models for total survey error), timeliness, relevance, interpretability, accessibility and coherence. Fourth, we expand our discussion of risk management by considering operational risk, i.e., the risk that one or more steps in a survey procedure may not be carried out as specified. Finally, we note that work with large-scale surveys will involve a mixture of statistical science and statistical technology, and we suggest that the literature on adoption and diffusion of technology can offer important insights into the distribution of expectations, utility functions and behaviors of large survey organizations, data analysts and other data users.
They invite all of you to celebrate with them. All the authors are longtime WSS members and will each say a few words about the book even signing copies if requested.
Reiter's Book Store, a Washington Landmark for over 60 years, is hosting this special event. There will be wine and cheese as long at it lasts.
Easyto get too, Reiter's' is on 20th street at the Southwest corner of 20th and K. Just two blocks on K street from the Farragut West Metro or 4 blocks down 20th from the Metro at Dupont Circle.
Abstract:
This paper examines how the evolution of a firm's human capital stock is related to firms' benefit choices using integrated data on firms, their employees, and their benefit offerings from the Census Bureau's Longitudinal Employer-Household Dynamics Program and from IRS Form 5500. It then estimates the relationship between compensation packages and firm productivity and survival, controlling for workforce characteristics. The authors find that firms that offer benefits have significantly lower turnover rates and faster growth rates. Benefit-offering firms have higher labor productivity and higher survival rates, even when controlling for firm and workforce characteristics and the level of wage compensation. Greater labor productivity explains some but not all of the differences in survival rates.
Abstract:
articulate matter (PM) has been linked to a range of serious cardiovascular and respiratory health problems, including premature mortality. The main objective of our research is to quantify uncertainties about the impacts of fine PM exposure on mortality. A multivariate spatial regression model is developed for the estimation of the risk of mortality associated to fine PM and its components across all counties the coterminous United States. Different sources of uncertainty in the data and model are explored using the spatial structure of the mortality data and the speciated fine PM. A flexible Bayesian hierarchical model is proposed for a space-time series of counts (mortality) by constructing a likelihood-based version of a generalized Poisson regression model that combines methods for point-level misaligned data and change of support regression. Our results seem to suggest an increase by a factor of two in the risk of mortality due to fine particles with respect to coarse particles. Our study also shows that in the Western United States, the nitrate and crustal components of the speciated fine PM seem to have more impact on mortality than the other components. On the other hand, in the Eastern United States, sulfate and ammonium explain most of the PM fine effect.
Abstract:
Small area estimation techniques typically rely on mixed models containing random area effects to characterise between area variability. In contrast, Chambers and Tzavidis (2006) describe an approach to small area estimation based on regression M-quantiles. This approach avoids conventional Gaussian assumptions and problems associated with specification of random effects, allowing between area differences to be characterized by the variation of area-specific M-quantile coefficients. However, the resulting M-quantile predictors of small area means can be biased. In this talk I will describe a general framework for robust bias adjusted small area prediction that corrects this problem, and is based on representing a small area predictor as a functional of the Chambers and Dunstan (1986) predictor of the within area distribution function of the target variable. An important advantage of this framework is that it allows integrated prediction of small area means and quantiles. I will demonstrate the usefulness of this framework through both model-based as well as design-based simulation, with the latter based on two realistic survey data sets containing small area information. The talk also includes an application of the bias adjusted M-quantile approach to predicting key percentiles of district level distributions of per-capita household consumption expenditure in Albania in 2002.
Abstract:
In many surveys, unweighted imputation methods are employed because of the unavailability of survey weights at the time of imputing missing survey data. In such situations, it is well known that certain customary design-based estimators with imputed data generally are biased even under the usual uniform response mechanism assumption. In this paper, we present the expression of the bias of a design-based estimator under more realistic ignorable response mechanism and then use this expression to propose a bias-corrected estimator. The second part of the paper deals with a variance estimator that captures different sources of uncertainties. Both theory and results from a Monte Carlo simulation study are presented to justify our approach.Keywords: ratio imputation, bias-adjusted estimator, variance estimation, small area estimation
Abstract:
Coverage and Utility of Purchased Residential Address Lists: A Detailed Review of Selected Local Areas. Sylvia Dohrmann
Recently there has been much interest in using address lists originating from the United States Postal Service (USPS) as area sampling frames in place of on-site enumerations of dwelling units. While it has become clear that purchased USPS lists are less costly than the process of on-site enumeration, it is still unclear as to whether these lists are adequate as substitutes for them. In this presentation, we compare the coverage of purchased lists for a selection of PSUs (Primary Sampling Units), differing in size and composition, compared to area sample frames created using on-site enumeration. We will examine the coverage of the USPS lists by comparing them to enumerated lists and review what type of areas are more completely covered by the USPS lists. We will also demonstrate how the extent to which the addresses on the purchased lists can be geocoded relates to their usefulness as the basis for area sampling frames.
Suitability of the USPS Delivery Sequence File as a Commercial-Building Frame. Stephanie Eckman, Michael Colicchia, Colm O'Muircheartaigh, NORC.
The USPS Delivery Sequence File (DSF) has proven to be an accurate and low-cost frame for household surveys. However, no research organization has evaluated the use of the DSF as a frame of non-residential buildings. Given the success that we and other organizations have had using the DSF as a household frame, we are optimistic that the database will provide good coverage of non- residential buildings as well. But we must assess its accuracy and coverage. We have conducted such an assessment in eleven segments across the county. For each segment, we have both a recent field listing of commercial buildings as well as the DSF database of non-residential delivery points. We will compare the two frames, presenting match rates and maps showing the discrepancies between the frames.
Abstract:
Imputation isone of the most popular methods in dealing with nonrespondents in survey problems. In this presentation I focus on the use of empirical likelihood method in imputation that leads to more efficient and/or robust imputation than other methods such as the parametric regression imputation, nonparametric kernel imputation, and random hot deck imputation. More specifically, (1) an empirical likelihood imputation method using information provided by covariates and the propensity function is introduced to produce efficient and doubly robust estimators of population means; (2) an empirical likelihood method is introduced for creating imputation cells in hot deck random imputation where imputation cells are constructed using a categorical covariate; (3) an empirical likelihood method is studied in the case of non-ignorable nonrespondents with either categorical or continuous covariates. Simulation results are presented to show the efficiency and robustness properties of the proposed methods.
The work of Jun Shao was generously supported by grant DMS-0404535 from the National Science Foundation: Methodology, Measurement, and Statistics Program in the Division of Social and Economic Sciences.
Abstracts:
1. A Geostatistical Approach to Linking Geographically-Aggregated Data From Different Sources
Carol A. Gotway Crawford,Office of Workforce and Career Development, CDC; and Linda J. Young, Department of Statistics, University of Florida, Gainesville, FL USA
The widespread availability of digital spatial data and the capabilities of Geographic Information Systems (GIS) make it possible to easily synthesize spatial data from a variety of sources. More often than not, data have been collected at different geographic scales, and each of the scales may be different from the one of interest. Geographic information systems effortlessly handle these types of problems through raster and geoprocessing operations based on proportional allocation and centroid smoothing techniques. However, these techniques do not provide a measure of uncertainty in the estimates and lack the ability to incorporate important covariate information that may be used to improve the estimates. They also often ignore the different spatial supports (e.g., shape and orientation) of the data. On the other hand, statistical solutions to change of support problems are rather specific and difficult to implement. In this presentation, we present a general geostatistical framework for linking geographic data from different sources. This framework incorporates aggregation and disaggregation of spatial data, as well as prediction problems involving overlapping geographic units. It explicitly incorporates the supports of the data, can adjust for covariate values measured on different spatial units at different scales, provides a measure of uncertainty for the resulting predictions, and is computationally feasible within a GIS. The new framework we develop also includes a new approach for simultaneous estimation of mean and covariance functions from aggregated data using generalized estimating equations.
2. Upper Level Set Scan Statistic System for Detecting Arbitrarily/span>
Shaped Hotspots by Reza Modarres, Professor and chair, Dept of Statistics at GWU
The Upper Level Scan Statistic (ULS), its theory, design and implementation, and its extension to the bivariate data are discussed. We provide the ULS-Hotspot algorithm that maintains a list of connected components of the rate surface at each level of the ULS tree. The tree is grown in the immediate successor list, which provides a computationally efficient method for likelihood evaluation, visualization and storage. An example shows how the zones are formed and the likelihood function is developed for each candidate zone. The general theory of bivariate hotspot detection is discussed, including the bivariate binomial and Poisson models and the multivariate exceedance approach. We propose the joint and intersection methods for detecting bivariate hotspots and study the sensitivity of the joint hotspots to the degree of association between the variables. We investigate the hotspots in two diverse applications, one in Microbial Risk Assessment and the other in Mapping of Crime hotspots.
Abstract:
Although "choose all that apply" questions are common in modern surveys, methods for analyzing associations among responses to such questions have only recently been developed. These methods are generally valid only for simple random sampling, but many "choose all that apply" and related questions appear in surveys conducted under more complex sampling plans. The purpose of this talk is to provide statistical analysis methods that can be applied to "choose all that apply" questions in complex survey sampling situations. Loglinear models fit to marginal data are used to describe associations among the multiple responses that occur with this type of data. Model comparison test statistics along with their asymptotic distributions are presented in order to choose a good fitting model. Estimates of odds ratios and their corresponding standard errors are provided in order to measure associations among responses.
Abstract:
Various proteomic assays yield spiky functional data, for example MALDI-TOF and SELDI-TOF yield one-dimensional spectra with many peaks, and 2D gel electrophoresis and LC-MS yield two-dimensional images with spots that correspond to peptides present in the sample. In this talk, I will discuss how to identify candidate biomarkers for various types of proteomic data using methods based on the Bayesian wavelet-based functional mixed models. This approach models the functions in their entirety, so avoid reliance on peak or spot detection methods. The flexibility of this framework in modeling nonparametric fixed and random effect functions enables it to model the effects of multiple factors simultaneously, allowing one to perform inference on multiple factors of interest using the same model fit, while adjusting for clinical for experimental covariates that may affect both the intensities and locations of the peaks and spots in the data. I will demonstrate how to identify regions of the functions that are differentially expressed across experimental conditions, in a way that takes both statistical and clinical significance into account and controls the Bayesian false discovery rate to a pre-specified level. Time allowing, I will also demonstrate how to use this framework as the basis for classifying future samples based on their proteomic profiles in a way that can also combine information across multiple sources of data, including proteomic, genomic, and clinical, and may also discuss improvements of the modeling framework that result in more robust inference. These methods will be applied to a series of proteomic data sets from cancer-related studies.
Abstract:
Rarely does the federal government need advice on theoretical statistics. I would like to talk about one exception. Efforts to persuade Congress to enact legislation that affects public policy are constantly being made by lobbyists who are paid by special interests. While this mode of operation is frequently extremely effective for achieving the goals of the special interest groups, it often does not serve the public interests in the best possible way. As counterpoint to this mode of operation, pro bono interaction with individual legislators and especially testimony in Congressional hearings can be remarkably effective in presenting a balanced picture. The debate on anthropogenic global warming has in many ways left scientific discourse and landed in political polemic. In this talk I will discuss our positive and negative experiences in formulating testimony on this topic.
Abstract:
Several countries around the world have developed organized systems of statistical indicators that are used to inform civil discourse, to track the change in basic economic, social, and environmental statuses of the country. These key national indicator systems have audiences that are both the policy makers in central and local governments but also interested citizens.
The State of the USA is envisioned to be a web-based resources permitting user-friendly presentation of key indicators at national and subnational levels. It will have explicit quality criteria and interest thresholds that inform what indicators are contained in the system. It will include official government statistics, private sector statistics, and academic statistics.
The State of the USA is currently funded by grants from several private foundations and is being incubated in the National Academies.
This WSS session will provide an introduction to the inception and development of the State of the USA, its basic goals, and its emergent organization. A demonstration of a test web site, illustrating some of the features of the indicator presentation will be given.
Abstract:
Survey researchers often need their questions to convey very specific information to respondents for example, questions may include complex definitions, instructions to include or exclude various considerations while answering, and a particular set of closed-ended responses. Although questionnaire design principles provide some advice on constructing complex questions, little empirical evidence demonstrates the superiority of certain decisions over others. For example, in some questions, important respondent instructions "dangle" after the core question has been asked; one alternative is to provide such definitions before asking the core question.
We have conducted several rounds of RDD telephone surveys with split-ballot experiments to explore such issues. This seminar reports on the latest round of 425 interviews conducted via an RDD telephone survey, in which respondents received alternative versions of various survey questions. For example, in some experiments, alternative questions used the same words but were structured differently. Other experiments compared the use of examples vs. definitions to explain complex concepts, compared the use of one vs. two questions to measure the same phenomenon, and compared questions before and after cognitive interviews had been used to clarify key concepts. With permission, interviews were tape recorded and behavior-coded, making it possible to compare various interviewer and respondent difficulties across question versions, in addition to comparing differences in response distributions.
Taken in conjunction with findings from previous rounds of experiments, the results begin to suggest some general design principles for complex questions. For example, the disadvantages of "dangling qualifiers" are becoming clear, as are the advantages of using multiple questions to disentangle certain complex concepts. The seminar will report results of these and other experimental comparisons, with an eye toward providing more systematic questionnaire design guidance.
Following the seminar, all are welcome and invited to attend a social hour at Capitol City Brewing Company, located in the same building as the talk.
Abstract:
The current plan for the 2010 Census includes a nationwide unduplication operation. One potential problem is the possibility of large numbers of false positives. To help evaluate the extent of this problem, the unduplication procedures have been run on the data from the 2000 Census.
The first section of the talk describes a simple approach to take full advantage of multiple processors through writing C programs and using basic UNIX commands. This section also describes the programming concepts used in unduplicating the entire country using BigMatch and the SRD Matcher. It also describes metaprogramming techniques used in this large system and documents some of the errors and problems made during development. One example involves keeping track of all files so that multiple runs do not interfere with each other. A similar system is currently expected to be used for unduplication of the 2010 census on production machines.
The second section of the talk gives an overview of the results of our analysis. Most of the problem with apparent false matches seems to be concentrated in the most common surnames and the most common Hispanic surnames, especially for matches outside the state. Name frequency does not seem to have much effect when there are multiple links of reasonable quality between housing units or when the phone number matches.
This event is accessible to persons with disabilities. Please direct all requests for sign language interpreting services, Computer Aided Real-time Translation (CART), or other accommodation needs, to HRD.Disability.Program@census.gov. If you have any questions concerning accommodations, please contact the Disability Program Office at 301-763-4060 (Voice), 301-763-0376 (TTY), or by voice mail at 301-763-2853, then select #2 for EEO Program Assistance.
Abstract:
Wetlands provide significant environmental benefits such as assimilation of pollutants, flood water storage, water recharge and fish and wildlife habitat. Geographically isolated wetlands (GIW) can provide the same benefits as wetlands in general, and are particularly vulnerable to losses from urbanization and agriculture precisely because they are geographically isolated and have varying amounts of regulatory protection. Currently, there is not a dependable and cost-effective method to generate an accurate GIW map without sending a field scientist to perform surveys or requiring image technicians to perform heads-up digitalization of aerial photography. By using statistically valid estimates of accuracy rates one can evaluate the quality of the information contained in GIW maps. Accuracy rates are used to describe the misclassification errors of the maps. A probability sampling survey methodology that balances statistical considerations, expert opinion and operational considerations is proposed for assessing the accuracy of GIW maps. The proposed sampling design is based on a stratified multi-stage sampling design that addresses sampling size requirements for the different strata and types of GIWs and also recognizes the need for spatial coverage while minimizing operational efforts. Expressions for design-based accuracy estimates and an estimate of the number of GIW, as well as their corresponding variances are also provided.
A simulation exercise is used to illustrate the proposed sampling methodology. A GIW map for Brunswick County in North Carolina, created using historical data was used as the sampling frame. The GIW map was created from a combination of satellite imagery, classification tools to process the imagery and auxiliary information. The sampling methodology was used to randomly select sites from this GIW map. An updated GIW map for the same counties showing exact location of GIW was used to provide "ground-truth" observations from wetland delineations approved by the US Army Corps of Engineers. Accuracy estimates was calculated by comparing site classification differences obtained by using both the original and updated GIW maps. Survey based accuracy estimates and their corresponding variance estimates were calculated.
Abstract:
In therapy of rapidly fatal diseases, early treatment efficacy often is characterized by an event, "response," which is observed relatively quickly. Since the risk of death decreases at the time of response, it is desirable not only to achieve a response, but to do so as rapidly as possible. We propose a Bayesian method for comparing treatments in this setting based on a competing risks model for response and death without response. Treatment effect is characterized by a two-dimensional parameter consisting of the probability of response within a specified time and the mean time to response. Several target parameter pairs are elicited from the physician so that, for a reference covariate vector, all elicited pairs embody the same improvement in treatment efficacy compared to a fixed standard. A curve is fit to the elicited pairs and used to determine a two-dimensional parameter set in which a new treatment is considered superior to the standard. Posterior probabilities of this set are used to construct rules for the treatment comparison and safety monitoring. The method is illustrated by a randomized trial comparing two cord blood transplantation methods.
Abstract:
Ordinal scale survey response items are often used in quantifying a latent trait. When the survey is offered in multiple modes of administration, e.g., telephone interview or self-administered questionnaire, the mode of administration may affect the characteristics of the survey items, such that an individualÕs responses may differ depending on the mode. Using a mental health survey as a case study, the Bayesian Differential Mode Effects Model (BDMEM) is introduced as an Item Response Theory (IRT) model-based solution for the detection, quantification and reconciliation of mode of administration effects at the item, response category, and scale levels. The BDMEM is compared to the popular approach of differential item functioning (DIF), and its advantages over DIF are highlighted, including the optimal use of repeated measures, the detection of differences in categorical response probabilities, and the automatic equating of results under different modes.
Abstract:
The Census Bureau's Demographic Survey Sample Redesign Program, among other things, is responsible for research into improving the designs of demographic surveys, particularly focused on the design of survey sampling. Historically, the research into improving sample design has been restricted to the "mainstream" methods like basic stratification, multi-stage designs, systematic sampling, probability-proportional-to size sampling, clustering, and simple random sampling. Over the past thirty years or more, we have increasingly faced reduced response rates and higher costs coupled with an increasing demand for more data on all types of populations. More recently, dramatic increases in computing power and availability of auxiliary data from administrative records have indicated that we may have more options than we did when we established our current methodology.
This seminar series is the beginning of an exploration into alternative methods of sampling. In this first seminar, from 9:30 to 10:30, we will hear about Professor Thompson's work on network, spatial, and adaptive sampling. He will discuss various alternative approaches and their statistical properties. Following Professor Thompson's presentation, there will be a 15-minute break, and then from 10:45 - 11:30, Professor Jean Opsomer will provide discussion about the methods and their potential in demographic surveys, particularly focusing on impact on estimation. The seminar will conclude with an open discussion session from 11:30 - 11:45 with 15 additional minutes available if necessary.
Seminar #2 is currently slated for December 10, 2007 and will feature Professor Sharon Lohr of Arizona State University discussing multiple overlapping frame designs.
This event is accessible to persons with disabilities. Please direct all requests for sign language interpreting services, Computer Aided Real-time Translation (CART), or other accommodation needs, to HRD.Disability.Program@census.gov. If you have any questions concerning accommodations, please contact the Disability Program Office at 301-763-4060 (Voice), 301-763-0376 (TTY) or by voice mail at 301-763-2853, then select #2 for EEO Program Assistance.
Abstract:
In this paper, based on the general Fay-Herriot model we evaluate the performance of different variance component estimation methods in the model-based point estimates and interval predictions. Following Morris' comments, we propose a new approach to estimate the model variance, which can always produce the positive estimates. Its positiveness and consistency are established also. A parametric bootstrap prediction interval method using the weighted least square estimator and ADM estimator under the general Fay-Herriot model is also proposed, and obtain coverage accuracy of O(mÁ3=2). Extensive simulation and real life data analysis are conducted. Our results suggest that this new approach performs better.
Abstract:
In the firstpart of the talk, I will review various multi-stage sampling in classical genetic linkage and association studies. This part does not involve much statistics. In the second part, I will focus on a cost-effective two-stage design for genome-wide case-control association studies. Some test statistics for this two-stage design will also be discussed. Most of the talk is based on an article with Robert Elston and Danyu Lin to appear in Annual Review of Genomics and Human Genetics (Sept 2007).
For a complete listing of our current seminars, visit http://www.gwu.edu/~stat/seminar.htm. For more information about the George Washington University Department of Statistics Seminars, contact:
Efstathia Bura. Department of Statistics
E-mail: ebura@gwu.edu, Phone: 202-994-6358
Joseph L. Gastwirth, Department of Statistics
E-mail:jlgast@gwu.edu, Phone: 202-994-6548
Abstract:
This talk will discuss the role of text data mining in defense applications. Discussions will include, but not be limited to, the role of text data mining in the characterization of country capabilities, its role in the characterization of the state of the art of a discipline area, and its role in discovery. Discussion will focus on the speaker's experiences in this area and his knowledge of the state of the text data mining literature. We also will explore who the customers might be for these techniques and where the future lies, both in the technology and in the important problems that have not yet been addressed.
Abstract:
Interesting problems in statistics arise in several areas of natural language processing and information retrieval. Broadly, we might divide these into (1) estimating useful distributions for language use and (2) designing insightful and affordable evaluation methods. In this talk, we will provide a broad overview of these two closely related fields, focusing first on the consequences of what has been called the "evaluation guided research paradigm" that now dominates both fields. We'll then drill down to each describe one or two problems from our recent work where it seems to us that our worlds and yours [the statisticans'] might intersect. Our goal in this seminar is to start a discussion about the kinds of problems we might productively work on together.
Abstract:
Given the recent advances in convenient, flexible and powerful computer-intensive methods to analyze data, it is natural to wonder about the relevance of the `classical' theory of statistical inference. Here we discuss an application, namely studies with a covariate measured with error, that poses a severe statistical challenge when only the means of the observations are modelled. In this setting, standard methods of data analysis typically yield dramatically biased results -- even if computer-intensive methods are used. We draw upon the theory of bias reduction of profile estimating functions to arrive at inferences that are substantially less biased. We apply the proposed method to a study examining whether a biomarker measured with error (long-term alanine aminotransferase level) is related to length of hospital stay in patients treated for herpes zoster infections.
Abstract:
Solving the binary classification problem for an application involves solving a data driven modeling problem. Such problems entail multiple and coupled sources of errors. Two communities of practice have approached this problem with different sets of assumptions and resulting limitations. The statistical community assumes that data is generated by a given stochastic model with parameter estimates based on the given class of models. On the other hand, the machine learning or algorithmic modeling community uses algorithmic modeling methods that treat data mechanisms as unknown. Machine learning methods have been successfully used on large data sets and offer a more accurate alternative to data modeling on small data sets. In this talk we consider the hard margin support vector algorithm applied to several bivariate Gaussian data sets with common covariance matrices.
Abstract:
Given the recent advances in convenient, flexible and powerful computer-intensive methods to analyze data, it is natural to wonder about the relevance of the `classical' theory of statistical inference. Here we discuss an application, namely studies with a covariate measured with error, that poses a severe statistical challenge when only the means of the observations are modelled. In this setting, standard methods of data analysis typically yield dramatically biased results -- even if computer-intensive methods are used. We draw upon the theory of bias reduction of profile estimating functions to arrive at inferences that are substantially less biased. We apply the proposed method to a study examining whether a biomarker measured with error (long-term alanine aminotransferase level) is related to length of hospital stay in patients treated for herpes zoster infections.
Abstract:
Users ofstatistical tables released by the Economic Directorate of the U.S. Census Bureau have raised the issue of whether an alternative to cell suppression can be used to protect the confidentiality of such tables. These users would like to have access to at least an approximate value for each cell, except possibly for those cells that are the most sensitive. An alternative method was developed several years ago by researchers at the Census Bureau that successfully meets that goal. This method uses a carefully calibrated noise distribution to generate noise which is then added to the microdata values of a magnitude variable requiring protection. These noisy microdata values are then tabulated to form the cell values for all the tables in a statistical program that describe that variable (e.g., receipts for Non-Employer Statistics). This method is conceptually simple and easy to implement; in particular, it is much simpler than cell suppression. The main concerns are whether noise protected tables are fully protected and whether the noisy cell values are as or more useful to users than the combination of exact and suppressed values provided by cell suppression. The seminal paper by Evans-Zayatz-Slanta (J. Official Statistics, 1998)showed that this was clearly true for the survey analyzed in that paper. The work presented in this paper provides analysis for additional surveys with different features than the survey described in the earlier paper. We present general protection arguments that involve ways of relating the uncertainty provided by noisy values to the required amount of protection. We present graphs which show the different distributions of net noise on the set of sensitive cells versus that for the non-sensitive cells. We also discuss some ways to fine-tune the algorithm to a particular table, taking advantage of its special characteristics. We call this new variation balanced EZS noise'. Our conclusion is that when EZS noise is appropriately applied, it fully protects tables while usually releasing more useful data than cell suppression. The possible application of EZS noise to a variety of statistical programs within the Economic Directorate is currently being researched.
This talk is an expanded version of an invited talk presented at the Third International Conference on Establishment Surveys (ICES 2007 in Montreal) session called "Advances in Disclosure Protection: Releasing More Business and Farm Data to the Public". That paper was co-authored with Jeremy Funk. Paul and Jeremy are members of the Disclosure Avoidance Research Group in the Statistical Research Division of the U.S. Census Bureau. Laura Zayatz is the head of that group and provided guidance on this project.
Abstract:
Some case-control genome-wide association studies (CCGWASs) select promising single nucleotide polymorphisms (SNPs) by ranking corresponding p-values, rather than by applying the same p-value threshold to each SNP. For such a study, we define the detection probability (DP) for a specific disease-associated SNP as the probability that the SNP will be "T-selected", namely have one of the top T largest chi-square values (or smallest p-values) for trend tests of association. The corresponding proportion positive (PP) is the fraction of selected SNPs that are true disease-associated SNPs. We study DP and PP analytically and via simulations, both for fixed and for random effects models of genetic risk, that allow for heterogeneity in genetic risk. DP increases with genetic effect size and case-control sample size, and decreases with the number of non-disease-associated SNPs, mainly through the ratio of T to N, the total number of SNPs. We show that DP increases very slowly with T, and the increment in DP per unit increase in T declines rapidly with T. DP is also diminished if the number of true disease SNPs exceeds T. For a genetic odds ratio per minor disease allele of 1.2 or less, even a CCGWAS with 1000 cases and 1000 controls requires T to be impractically large to achieve an acceptable DP, leading to PP values so low as to make the study futile and misleading. We further calculate the sample size of the initial CCGWAS that is required to minimize the total cost of a research program that also includes follow-up studies to examine the T selected SNPs. A large initial CCGWAS is desirable if genetic effects are small or if the cost of a follow-up study is large.
Abstract:
Software failure data can be analyzed to provide statistical estimates of the reliability of software, which are useful for assessing its quality, and for determining the date of release of a software package. The non-homogeneous Poisson process (NHPP) model is one of the models most widely used for describing and analyzing software failure processes. NHPP models in which the expected number of errors over infinite observation time is finite, are called NHPP-I models.
Our research proves a key statistical limitation of NHPP-I models, namely inconsistency of parameter estimates. In other words, even if the process is observed for an arbitrarily long time one cannot estimate unknown parameters of the model very accurately. The inconsistency feature is a consequence of a representation of an NHPP-I model as a mixture of General Order Statistics or GOS models (Raftery, 1987) and holds more generally for mixture distributions in broader settings, and not just for the NHPP model for software failures. This result also has implications for a Bayesian analysis of NHPP models.
We show that optimal unbiased estimation of any parametric function in an NHPP-I model essentially reduces to estimating related parametric functions of the underlying GOS model. We discuss other known features of an NHPP model that are not consistent with certain intuitive features of software failure processes and reliability growth.
This talk is based on joint research with my departmental colleagues, Professors Tapan Nayak and Subrata Kundu.
The Demographic Statistical Methods Division (DSMD) of the Census Bureau, among other things, is responsible for conducting research to implement more timely and less costly methods to estimate and prevent measurement error in demographic surveys. Latent Class Analysis (LCA) is an alternative approach to achieve this goal in contrast to the current reinterview methodology. Historically, at the Census Bureau, the research into LCA (First-order Markov Latent Class Model) was subject to non-complex sample designs. The DSMD has continued its research to improve the use of LCA for estimating response error. Through the most recent partnership with Westat and Statistical Innovations, the DSMD was able to accomplish this goal by conducting a thorough violation study that incorporates complex sample design with weighting and heterogeneity across latent classes. In addition, the research also incorporated an aspect to modify existing software to estimate the models.
This symposium will provide the research results of that partnership, as well as a session on how to apply the enhanced model to estimate measurement error in current surveys.
This event is accessible to persons with disabilities. Please direct all requests for sign language interpreting services, Computer Aided Real-time Translation (CART), or other accommodation needs, to HRD.Disability.Program@census.gov. If you have any questions concerning accommodations, please contact the Disability Program Office at 301-763-4060 (Voice), 301-763-0376 (TTY).
About the speaker: Dr. Vermunt is a professor in the Department of Statistics Research and Methodology at the University of Tilburg, Netherlands. Dr. Vermunt is the first recipient of the Leo Goodman Award of the ASA Methodology Section (2005). Dr. Vermunt's primary methodological contributions are in the area of categorical data analysis, with particular attention to latent heterogeneity. Using a latent class analysis approach, he has incorporated into log-linear event history analysis methods for handling missing data, unobserved heterogeneity, censoring, and measurement error. He has also successfully applied the same approach to classification and clustering analysis, and multi-level and random coefficient models for categorical data. In his recent work, he has made original and important contributions to the analysis of ordered data with different flexible constraints. Now, with the Census Bureau, Dr. Vermunt showed that the mixture Markov latent class model has a better fit (than previous models) in estimating the Current Population Survey labor force classification errors.
Agenda
| 8:45 am | Refreshments |
| 9:15 am | Introductory Remarks Ruth Ann Killion, Chief, Demographic Statistical Methods Division Candice Barnes, Chief, Survey Response Analysis Branch |
| 9:30 am |
Research results to enhance the LCA model for complex sample design, weighting and the modification of software Dr. Jeroen Vermunt, Tilburg University, Netherlands Dr. Jay Magidson, President, Statistical Innovations |
| 1:00 am | Question/Answer |
| 12:00 | Lunch |
| 1:30 pm |
Workshop -- Application of Mixed LCA models to provide measurement error for current surveys Dr. Jeroen Vermunt, Tilburg University, Netherlands Dr. Jay Magidson, President, Statistical Innovations |
| 3:30 pm | Wrap Up |
| 4:00 pm | Adjourn |
Abstract:
With the advance of biotechnology and reduction of genotyping cost, a genome-wide association study testing association between a disease and 100,000 to 500,000 genetic markers (single nucleotide polymorphisms: SNPs) is feasible. Such a study consists of several stages, from quality control, a genome-wide single marker analysis, to more powerful regional analysis, replication studies. Statisticians face challenges in each of these stages. Consequentially, many statistical issues arise from the analyses. We will review and discuss these statistical issues and controversies.
Joe Sedransk, Professor of Statistics (Case Western Reserve University, Cleveland, Ohio), will give the 17th Annual Morris Hansen Lecture "Assessing the Value of Bayesian Methods for Inference About Finite Population Quantities" on Tuesday October 30 at 3:30 P.M. in the Jefferson Auditorium of the Department of Agriculture's South Building (Independence Avenue SW, between 12th and 14th Street). The Hansen Lecture Series is sponsored by the Washington Statistical Society, Westat, and the National Agricultural Statistics Service (NASS).
The USDA South Building (Independence Avenue SW) is between 12th and 14th Streets at the Smithsonian Metro Stop (Blue Line). Enter through Wing 5 or Wing 7 from Independence Ave. (The special assistance entrance is at 12th & Independence). A photo ID is required.
Please pre-register for this event to help facilitate access to the building. After September 1, pre-register on line at http://www.nass.usda.gov/morrishansen/. Additional information will appear in the October issue.
Abstract:
Bayesian methodology is well developed and there are successful applications in many areas of substantive research. However, the use of such methodology in making inferences about finite population quantities is limited. I will describe several types of application where greater use of Bayesian methods is likely to be profitable and some where they are not. In addition, I will describe research whose successful completion should lead to improved analysis. The illustrations will come, primarily, from establishment surveys and a related area, providing public "report cards" for providers of medical care.
Abstract:
Modern research data have become increasingly complex, raising non-traditional modeling and inferential challenges. In particular, advancements in technology and computation have made recording and processing of functional data possible. Examples of functional data are time series of electroencephalographic (EEG) activity, anatomical shape, and functional MRI. The purpose of this talk is to describe statistical models for feature extraction from single-level (one or multiple functions per subject at one visit) and clustered or longitudinal (one or multiple functions per subject at multiple visits) functional data having a large number of subjects and large within- and between-subject heterogeneity. We introduce the framework and inferential tools for multilevel functional data (MFD) obtained by recording of functional characteristics at multiple visits. Though motivated by a novel experimental setting, the proposed methodology is general, with potential broad applicability to many high-throughput scientific studies. A prototypical example of MFD is the Sleep Heart Health Study (SHHS), which contains electroencephalographic (EEG) signals for each subject at two visits.
Abstract:
While there exist some nice models for the measurement process of scalar and small-scale analytical chemistry experiments, there is lack of understanding and tools for establishing the standards and performance of high throughput measurement systems, such as mRNA microarray measurements. An ongoing program at NIST on gene expression microarray experiments has demonstrated some potential approaches, including some performance metrics for scanner microarray measurement, and use of spike-in experiments in calibration and validation. I will describe a class of multiphase and nonlinear regression models used in these studies, and show how these general measurement models can accommodate for the wide exponential range of signal variation while accounting for the background error, multiplicative signal error, instrument saturation at high intensity, and how they can be adapted to model the highly parallel and multivariate nature of modern biochemical experiments.
Abstract:
Over the past few years, microarray experiments have supplied much information about the disregulation of biological pathways associated with various types of cancer. Many studies focus on identifying subgroups of patients with particularly agressive forms of disease, so that we know who to treat. A corresponding question is how to treat them. Given the treatment options available today, this means trying to predict which chemotherapeutic regimens will be most effective. We can try to predict response to chemo with microarrays by defining signatures of drug sensitivity. In establishing such signatures, we would really like to use samples from cell lines, as these can be (a) grown in abundance, (b) tested with the agents under controlled conditions, and (c) assayed without poisoning patients.
Recent studies have suggested how this approach might work using a widely-used panel of cell lines, the NCI60, to assemble the response signatures for several drugs. Unfortunately, ambiguities associated with analyzing the data have made these results difficult to reproduce. In this talk, we will discuss the steps involved in attacking response prediction, and describe how we have analyzed the data. We will cover some specific ambiguities we have encountered, and in some cases how these can be resolved. Finally, we will describe methods for making such analyses more reproducible, so that progress can be made more steadily.
For Additional Information contact Lisa Poe at the Office of Preventive Oncology (cpfpcoordinator@mail.nih.gov) or (301) 496-8640
Abstract:
One of the basic aspects of number theory concerns the divisibility of integers. Of particular interest are d(n) the number of divisors of n, and sigma(n), the sum of the divisors of n. We shall begin with a discussion of these functions and their respective generating functions. In the latter portion of the talk, we shall look at a graph-theoretic, probability model used to estimate the average running time of a large class of computer programs. Seemingly out of nowhere we will wind up back with the divisor function.
This seminar is physically accessible to persons with disabilities. For TTY callers, please use the Federal Relay Service at 1-800-877-8339. This is a free and confidential service. To obtain Sign Language Interpreting services/CART (captioning real time) or auxiliary aids, please send your requests via e-mail to EEO Interpreting & CART: eeo.interpreting.&.CART@census.gov or TTY 301-457-2540, or by voice mail at 301-763-2853, then select #2 for EEO Program Assistance.
Abstract:
Two issuesof importance to forensic scientists in which statistics has a role to play will be discussed. Multivariate data occur often in forensic science and the example used for illustration is that of the elemental composition of glass. Measurements are made of fragments of glass at a crime scene and from fragments of glass found on a suspect. An approach to the evaluation of evidence is described that takes account of variation in the measurements between different sources and within different sources.
The second issue is that of determination of the size of a sample that needs to be taken from a consignment of drugs in order to make an inferential statement about the proportion of the consignment that is illicit.
Abstract:
There are many attributes to text analysis: words, documents, bigrams, trigrams, n-grams, contextual relationships, latent semantics, and many others. This paper covers a spectral graph method for co-clustering multiple attributes at the same time. Co-clustering is very useful not only because it turns a two step process into a one step process, but it also shows you the relationships between different sets of attributes. This paper goes beyond normal two-mode co-clustering (ie words and documents) into the area of co-clustering multiple modes (ie words, documents, bigrams, trigrams, etc.) all at the same time.
Abstract:
This talk introduces a new method of supervised learning based on linear discrimination among the vertices of a regular simplex in Euclidean space. Each vertex represents a different category. Discrimination is phrased as a regression problem involving -insensitive residuals and a quadratic penalty on the coefficients of the linear predictors. The objective function can by minimized by a primal MM (majorization- minimization) algorithm that (a) relies on quadratic majorization and iteratively reweighted least squares, (b) is simpler to program than algorithms that pass to the dual of the original optimization problem, and (c) can be accelerated by step doubling. Limited comparisons on real and simulated data suggest that the MM algorithm is competitive in statistical accuracy and computational speed with the best currently available algorithms for discriminant analysis.
Note: For a complete list of upcoming seminars check the department's seminar web site: http://www.math.umd.edu/statistics/seminar.shtml.
Abstract:
Analysis of longitudinal and clustered binary data is important in biomedical research. Numerous measures of association have been proposed in the literature for the study of dependence between the binary variables. These measures include correlations, odd ratios, kappa statistics and relative risks. In this talk I will discuss permissible ranges of these measures of association. Knowledge of these ranges is crucial for developing efficient estimation methods for real life data. I will show moment based methods such as generalized estimating equations, which ignore these ranges, could result in misleading p-values and incorrect conclusions.
Abstract:
Instrumental variables (IV) regression is a method for making causal inferences about the effect of a treatment based on an observational study in which there are unmeasured confounding variables. The method requires one or more valid IVs; a valid IV is a variable that is associated with the treatment, is independent of unmeasured confounding variables and has no direct effect on the outcome. Often there is uncertainty about the validity of the proposed IVs. When a researcher proposes more than one IV, the validity of the IVs can be tested via the "overidentifying restrictions test.'' Although the overidentifying restrictions test does provide some information, the test has no power versus certain alternatives and can have low power versus many alternatives due to its omnibus nature. To fully address uncertainty about the validity of the proposed IVs, we argue that a sensitivity analysis is needed. A sensitivity analysis examines the impact of plausible amounts of invalidity of the proposed IVs on inferences for the parameters of interest. We develop a method of sensitivity analysis for IV regression with overidentifying restrictions that makes full use of the information provided by the overidentifying restrictions test, but provides more information than the test by exploring sensitivity to violations of the validity of the proposed IVs in directions for which the test has low power. Our sensitivity analysis uses interpretable parameters that can be discussed with subject matter experts. We illustrate our method using a study of food demand among rural households in the Philippines.
Abstract:
The speakers are co-principal investigators in the Volgenau School's Document Forensics Laboratory. One of the technologies they have developed involves the identification of the unknown writer of a questioned handwritten document from among a population of writers, who have handwriting samples in a database. They will explain how they have designed a system based on applying discriminant analysis in a novel manner to solve this handwriting identification problem. Graph theory is used to quantify handwritten characters yielding high-dimensional feature vectors capturing physical information for the characters. The statistical methodology selects and utilizes a small number of discriminating measurements from the high-dimensional feature vector. They will demonstrate the surprising writer identification power possible using very few lower-case letters of the alphabet.
Abstract:
Using data provided by the Department of Defense merged with Unemployment Insurance wage records, we first examine the effect of being called to active duty on the income of reservists and members of the National Guard. We examine how effects on reservists' income vary by income before being called to active duty as well as by the industry the reservists were employed in prior to active duty.
Furthermore, the data allow us to identify an unanticipated shock that entails both a short-run component (the effect on the reservist's income) and a long-run component in the form of increased risk of death or injury. We can then clearly identify the spouse's labor market response to this shock as well as the overall effect on family income.
In contrast to a traditional displaced worker problem, being called to active duty makes the reservists less available for household production than they were prior to being called up. If a reservist's income falls (or if expected lifetime income falls) the spouse's labor market participation may or may not increase. If a reservist's income rises when called to active duty (due to combat pay, etc.), the spouse's labor market participation will likely fall, unless the increase in income is balanced out by a lower expectation of future income. A reservist's income and the income of the reservist's family, therefore, will not necessarily move in the same direction when the reservist is called up.
Abstract:
Unit root testsin time series analysis have received considerable attention since the seminal work of Dickey and Fuller (1976). In this talk, some of the existing unit root test criteria will be reviewed. Size, power and robustness to model misspecification of various unit root test criteria will be discussed. More recent work on unit root tests where the alternative hypothesis is a unit root process will be discussed. Tests for trend stationary versus difference stationary models will be discussed briefly. Current work on unit root test criteria on random coefficient models and seasonal series will also be discussed. Examples of unit root time series and future directions in unit root hypothesis testing will be presented.
Abstract:
Mechanically active proteins are typically organized as homogeneous or heterogeneous tandems of protein domains. A large number of proteins perform important biological functions in their unfolded state. In current force-clamp atomic force microscopy (AFM), mechanical unfolding of protein tandems is studied by using constant stretching force and recording the unfolding transitions of individual domains. The main goals of these experiments are (a) to obtain the distributions of unfolding times for individual domains and (b) to probe interdomain interactions. Existing statistical methodology offers limited information gain as it ignores the complexities of the data. By the very method of AFM instrumentation, the observable quantities are the ordered forced unfolding times. Extending the existing and developing new theoretical approaches and statistical tools for the analysis of ordered unfolding transitions is the aim of this collaborative project. In this talk, order statistics based methodology will be presented for analyzing the unfolding times of protein tandems and to infer the parent unfolding time distributions of individual domains from ordered unfolding times. Statistical tests for independence of the unfolding times and equality of their (parent) distributions, which use ordered data as their input, will be presented. The proposed tests will enable experimentalists and theoreticians to detect presence of interdomain interaction. This presentation is based on a collaborative research project with biophysicists Prof. Barsegov (University of Massachusetts at Lowell) and Prof. Klimov (George Mason University).
Abstract:
The interval estimation of a binomial proportion is difficult, especially when the proportion is extreme (very small or very large). Most of the methods discussed in the literature implicitly assume simple random sampling. These interval-estimation methods are not immediately applicable to data derived from a complex sample design. Some recent papers have addressed this problem, proposing modifications for complex samples. Matters are further complicated when a one-sided coverage interval is desired. This paper provides an extensive review of existing methods for constructing coverage intervals for a binomial proportion under both simple random and complex sample designs. It also evaluates the empirical performances of different one-sided coverage intervals under both a simple random and a stratified random sample design.
Abstract:
This paper assesses the dynamics of treatment effects arising from variation in the duration of training. We use German administrative data that have the extraordinary feature that the amount of treatment varies continuously from 1 day to 720 days (i.e. 2 years). This feature allows us to estimate a continuous dose-response function that relates each value of the dose, i.e. days of training, to the individual post-treatment employment probability (the response). The dose-response function is estimated after adjusting for covariate imbalance using the generalized propensity score, a recently developed method for covariate adjustment under continuous treatment regimes. Our results indicate an increasing dose-response function for treatments of up to 360 days, and a similarly steady decline afterwards.
Abstract:
Multimode surveys are increasingly fielded in an effort to reduce costs, increase response rates, and accelerate data collection. However the essential survey-taking process of posing a question, formulating an answer, and communicating and recording a response occurs differently in each mode. For example in Web and paper modes the survey presentation is visual and the respondent is solely responsible for understanding the question and providing an answer. On the other hand, in CAPI and CATI modes, the survey presentation is aural and providing an answer involves an interviewer.
This seminar reviews the concept of disparate modes. Survey modes are disparate for a survey item when they result in a different optimal question form in each mode. The intrinsic aspects of each mode are reviewed for their influence on disparity taking into account the specific kinds of items the survey uses.
This presentation uses examples of multimode surveys conducted by Mathematica Policy Research, Inc. It reviews the methods used to investigate this topic, where and why disparity occurs, and how some kinds of items are more prone to disparate presentation across modes. It also notes that different question forms for an item across modes can be the result of the survey design and survey operations environment rather than due to intrinsic disparity. Much of this material was presented at the International Statistical Institute conference in August 2007 in Lisbon, Portugal.
Abstract:
Calibration estimation has been developed into an important field of research in survey sampling during last decade. It is now an important methodological instrument in the production of statistics. A few national statistical agencies have developed software designed to compute calibrated weights based on auxiliary information available in population registers and other sources. However its application in general statistics outside of survey sampling is limited. In this paper we have found the simple calibration method is a powerful tool to handle the general missing data problem when the parameters of interest are defined by unbiased estimating equations. Unlike the traditional calibration method in which the calibrated weights do not depend on any unknown parameters, our calibration weights depend on the unknown parameters of interest and must be estimated by the calibration estimating equations. Large sample results and simulations are included. All results show that in general the proposed empirical likelihood calibration method produces improved estimation over its competitors. This talk is based on joint works with some of my colleagues.
Abstract:
Almost all surveys are subject to some level of nonresponse. Nonresponse bias can be substantial when two conditions hold, 1) when the response rate is relatively low, and 2) when the difference between the characteristics of respondents and nonrespondents is relatively large. As addressed in the most recent OMB guidelines, approaches to reducing and evaluating nonresponse bias should consider both components. This presentation describes several approaches for reducing and evaluating nonresponse bias in surveys aimed at assessing adult literacy. Several bias-reduction approaches will be presented relating to data collection, weighting, and imputation for outcome-related nonresponse. In addition, an evaluation of nonresponse bias will be shown that extends the standard demographic comparison of respondents and nonrespondents to incorporate key survey estimates.