|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
WSS Home | Newsletter | WSS Info | Seminars | Courses | Employment | Feedback | Join!
Abstract:
The literature offers two distinct reasons for incorporating sample weights into the estimation of linear regression coefficients from a model-based point of view. Either the sample design is informative or the model is incomplete. The traditional sample-weighted least-squares estimator can be improved upon when the sample design is informative, but not when the standard linear model fails and needs to be extended.
It is often assumed that the realized sample derives from a two-phase process. In the first phase, the finite population is drawn from a hypothetical superpopulation via simple random (cluster) sampling. In the second phase, the actual sample is drawn from the finite population. Many think that the standard practice of treating the (cluster) sample as if it was drawn with replacement from the finite population is roughly equivalent to the full two-phase process. That is not always the case.
This program is physically accessible to persons with disabilities. For interpreting services, contact Yvonne Moore at TYY 301-457-2540 or 301-457-2853 (voice mail) Sherry.Y.Moore@census.gov.
Abstract:
The United States Department of Transportation's Bureau of Transportation Statistics (BTS) is developing its confidentiality policy which is based on its legislative mandate (49 U.S.C. 111(i)) to protect individually identifiable information. Because the field of statistical disclosure limitation (SDL) research is still evolving, BTS wants to take advantage of the latest SDL research in updating its confidentiality policy and practices. To this end, BTS initiated a project to develop, demonstrate, and implement new, state-of-the-art SDL methods for complex, multi-dimensional (up to five) tables that contain a hierarchical structure.
After reviewing a wide variety of SDL methods described in the literature, the project team selected the Synthetic Data Substitution (SDS) method proposed by Dandekar and Cox (2002), which evaluated well for the BTS requirements. This method was subsequently enhanced to efficiently manipulate large tables. A modified version of this SDS method was implemented into prototype computer software for demonstration and testing. The explication of extremely efficient algorithms (capable of processing multi-dimensional tables with hundreds of thousands of entries) will be discussed, along with the software functionality and a demonstration of the prototype software using examples of agency tabular data.
Abstract:
USDA's National Agricultural Statistics Service (NASS) conducts hundreds of surveys annually on the nation's farmers and agribusinesses. For most surveys multiple questionnaire versions are needed to address differences in agriculture between states. The questionnaires also need to be developed for multiple data collection modes: mail, telephone/CATI, and most recently the World Wide Web. In order to efficiently create the numerous questionnaires needed for all of the survey, state and mode combinations, NASS is developing a client-server based Question Repository System (QRS). The QRS includes a user interface to build properly formatted questions for the various modes; these questions are then stored in a database. The stored questions are then retrieved and used to build questionnaires, which may be saved, printed or ported to a Web server. This seminar discusses the capabilities and some technical details of the QRS.
Abstract:
The purpose of this presentation is to document the scope of administrative records use at the Census Bureau both historically and currently, and to describe the Statistical Administrative Records System (StARS) and Administrative Records Experiment in 2000 (AREX 2000). This presentation is an introduction to these two attempts to simulate an "administrative records census", and serves as an introduction to the following presentation, which will focus on evaluation. I first describe the StARS design for 1999 and 2000, describing challenges in using administrative records data, then describe the specific aspects of the administrative records experiment. I conclude by describing how the demographic results of the StARS and AREX experiments compare to national, state, and county level Census 2000 data.
Abstract:
In December 2002 President Bush signed into law HR 2458 the E-Government Act of 2002. Title V of this Act, the Confidential Information Protection and Statistical Efficiency Act (CIPSEA) of 2002, provides uniform safeguards to protect the confidentiality of information provided by the public for statistical purposes regardless of which agency collects the data. The speaker will give an overview of CIPSEA, describe the impact of CIPSEA on EIA, and discuss the questions that EIA sent to OMB about CIPSEA.
Dr. Kirkendall will also provide an introduction to statistical disclosure limitation methodology. OMB's Statistical Policy Working Paper (SPWP) #22, "Report on Statistical Disclosure Limitation Methodology," will serve as the foundation. The speaker chaired the subcommittee that authored SPWP # 22.
Abstract:
In the early 1980s, a new approach to seasonal adjustment was suggested, namely, the Arima-model-based (AMB) method (Burman, 1980; Hillmer and Tiao, 1982; among others). The method consisted of, first, identifying the Arima model for the observed series; second, decomposing the model into (unobserved) trend-cycle, seasonal, and irregular components, which also follow Arima-type models; and third, estimating the components by means of the Wiener-Kolmogorov filter extended to non-stationary series.
Twenty years later, the approach seems to have come of age. The presentation will describe some of its main features and illustrate how it can answer questions relevant for the analyst, and for economy-watchers and policy-makers.
Two extensions of the approach to fields different from seasonal adjustment will also be presented. One, to business-cycle estimation, will illustrate the complementarity of "ad-hoc" and AMB filtering. The second, to quality control of data, will illustrate application of the automatic regression-ARIMA model identification procedure on a very large scale (perhaps millions of series).
Abstract:
A relatively recent statistical development is the important class of models known as generalized linear models (GLM) that was introduced by Nelder and Wedderburn (1972), and which provides under some conditions a unified regression theory suitable for continuous, binary, categorical, and count data. The theory of GLM was originally intended for independent data, but it can be extended to dependent data under various assumptions. The extension to time series will be presented accompanied by some real data examples.
Note: For a complete list of upcoming seminars check the department's seminar web site: http://www.gwu.edu/~stat/seminars/Spring2003.htm. The campus map is at: http://www.gwu.edu/Map/. The contact person is Reza Modarres at Reza@gwu.edu or 202-994-6359.
Abstract:
USDA's National Agricultural Statistics Service (NASS) conducts hundreds of surveys annually on the nation's farmers and agribusinesses. For most surveys multiple questionnaire versions are needed to address differences in agriculture between states. The questionnaires also need to be developed for multiple data collection modes: mail, telephone/CATI, and most recently the World Wide Web. In order to efficiently create the numerous questionnaires needed for all of the survey, state and mode combinations, NASS is developing a client-server based Question Repository System (QRS). The QRS includes a user interface to build properly formatted questions for the various modes; these questions are then stored in a database. The stored questions are then retrieved and used to build questionnaires, which may be saved, printed or ported to a Web server. This seminar discusses the capabilities and some technical details of the QRS.
Abstract:
Dr. Goel will present results of his investigation into Bayesian and non-Bayesian strategies to improve AADT estimation by exploiting the inherent underlying correlations between link flows. These correlations arise partially because inflows and outflows to a node are always constrained. In addition, when the network has a large number of O-D zones, and a relatively smaller number of links, the correlation between the link flows can be large. Traditional AADT estimation procedure ignores these correlations completely, and amounts to using an ordinary least square estimate, after adjusting the coverage counts by daily and monthly factors. Simulation results will be presented, pointing out some network scenarios, under which the traditional estimates can be improved upon.
Note: This seminar will be held in a wheelchair-accessible location. Attendees who require sign language interpretation, other auxiliary aids or alternate accessible formats should advise the program coordinator at least three business days prior to the date of the seminar.
Abstract:
We present a method for estimating the mean vector from a multivariate skew distribution that includes some unobserved data below the detection limits. To estimate the mean vector and the covariance matrix we develop an EM algorithm solution and use it to maximize the likelihood. We obtain expressions for the mean vector, covariance matrix, and the asymptotic covariance of the vector of means in the original scale. The performance of the MLE method in selecting the correct power transformation and the coverage rate of the confidence region under several conditions are investigated with Monte Carlo simulation.
Box-Cox transformation system produces the power normal (PN) family, whose members include normal and log-normal distributions. We study the moments of PN and obtain expressions for its mean and variance. The quantile functions and a quantile measure of skewness are discussed to show that the PN family is ordered with respect to the transformation parameter. The conditional distributions are studied and shown to belong to the PN family. We obtain expressions for the mean, median and modal regressions. Chebyshev-Hermite polynomials are used to obtain an expression for the correlation coefficient and to prove that correlation is smaller in the PN scale than the original scale. Frechet bounds are used to obtain expressions for the lower and upper bounds of the correlation coefficient. An algorithm is given to compute the bounds.
We also investigate the efficiency of tests after a power transformation. In particular, we consider the one sample test of location and study the gains in efficiency for one-sample t-test following a Box-Cox transformation. We prove that the asymptotic relative efficiency of transformed univariate t-test and Hotelling test of multivariate location with respect to the same statistics >based on untransformed data is at least one. We also study the efficiency of the correlation coefficient following a Box-Cox transformation. We prove that much stronger conclusions can be reached about the independence of the margins of bivariate normal variates once they have been transformed with a Box-Cox transformation.
Note: For a complete list of upcoming seminars check the department's seminar web site: http://www.gwu.edu/~stat/seminars/Spring2003.htm. The campus map is at: http://www.gwu.edu/Map/. The contact person is Reza Modarres at Reza@gwu.edu or 202-994-6359.
Abstract:
Population and data user attributes which should be considered in determining the most proper statistical suppression techniques will be outlined. A general overview of methods used by the National Agricultural Statistics Service and the Agricultural Marketing Service to protect tabular data will be presented. Certain methods will be highlighted and explored in more detail.
Abstract:
Racial and ethnic disparities in health care are real and unacceptable. They occur across a wide range of medical conditions and heath care services, and exist independently of insurance status, income, and other access-related factors. At the level of health systems, minorities are likely to get poorer care because of several factors, including resource allocation policies that are less favorable to minorities, linguistic and cultural barriers, and the disproportionate representation of minorities in restrictive health plans. Minority patients, for a variety of historic and socioeconomic reasons, are more likely to refuse treatment, or fail to adhere to treatment due to misunderstanding or mistrust. These patient and system-level factors, however, don't fully explain the consistency of racial and ethnic gaps in treatment. Prejudice, bias, and stereotyping by providers, as well as clinical uncertainty, contribute to disparities in health care. This is a major conclusion of an expert panel of the Institute of Medicine, summarized in a report called Unequal Treatment: Confronting Racial and Ethnic Disparities in Healthcare. Brian Smedley and Adrienne Stith, program officers at the IOM, will discuss this and other conclusions of the report, along with the reports' recommendations for strategies to eliminate these disparities.
Abstract:
There have been a number of conferences in the last few years focusing on improving the quality of surveys (Stockholm and Ottawa in 2001, Copenhagen in 2002). There are a great many ways to improve the quality of surveys, too many to be covered in one presentation. This keynote presentation from the International Conference on Improving Surveys will focus its discussion on three general topics: response rates, technological changes, and continuous improvement, particularly through communications. Improving response rates has been the topic of numerous papers and conferences. Surveys are being dramatically altered by changes in technology. We discuss three types of technological change, the Internet, mobile telephones, and handheld computers. Communication has not commonly been discussed, but is fundamental to successfully improving surveys.
Abstract:
Health Related Quality of Life surveys deals generally with two kinds of data: data recorded during an exploratory or validation step, in order to help with the construction (definition) of variables and indicators, and data recorded during an analysis step in order to investigate the evolution of the distribution of the previous constructed variables between various populations, times and areas.
These are generally two well separated steps during the research process of a scientist in the field of Health Related Quality of Life, Environment or any other. The first step, generally deal with measurement, calibration, metrology of variables and most used statistical methods are multivariate exploratory analysis and structural models, like factorial analysis models or item response theory models. The second step, is certainly more known by inferential statisticians. Linear, generalized linear, time series and survival methods (and models) are very useful in this step. The variables constructed in the first step are incorporated in this second step and their joint distribution - joint with the other analysis variables (treatment group, time, duration of life, etc ...)- is investigated. In this talk, I will compare the simple strategy of separating the two steps with the one defining and analysing a global model including both the measurement and the analysis step. I will illustrate the issue with a real example in oncology, where the main goal is the analysis of the joint distribution of Survival and Quality of Life of cancer patients randomized in two treatment groups during a clinical trial.
Note: For a complete list of upcoming seminars check the department's seminar web site: http://www.gwu.edu/~stat/seminars/Spring2003.htm. The campus map is at: http://www.gwu.edu/Map/. The contact person is Reza Modarres at Reza@gwu.edu or 202-994-6359.
Abstract:
The Administrative Records Experiment in 2000 (AREX 2000) was an attempt to simulate an administrative records census using data from seven federal databases, supplemented by field and processing operations. This presentation describes the results of the AREX 2000 evaluations, comparing county, tract, block, and household tallies between administrative records and Census 2000 results. The evaluations focused on two key issues important to the Census Bureau and federal program administrators: Can administrative records data be used to develop small area, intercensal estimates of the population and its composition? To what extent can administrative records data substitute for costly non-response followup operations in a decennial census?
The evaluations compare alternative enumeration methods and assess our field, processing, and imputation operations. We present tabular, model, and geospatial support that administrative records provide good estimates of Census counts at larger geographies, with greater accuracy using the "bottom-up" enumeration method. The results also suggest that administrative records addresses and households have potential use in the nonresponse followup or imputation phase of a traditional census. AREX processing deficiencies are investigated and confirm known problems in identifying selected demographic groups. Identifying these deficiencies has allowed us to revise and improve our methodology.
Abstract:
The talk describes the effect of the statistical dependence on tests and confidence intervals for the parameter p, the success probability in a binomial random variable. The problem was motivated by a jury discrimination case, Moultrie v. Martin, in which half the grand jurors served a second year. Hence, the racial compositions of the grand juries in consecutive years were no longer statistically independent. The first part of the talk concentrates on the effect of the dependence on hypothesis testing. It will be shown that ignoring dependence not only made the statistical evidence of discrimination appear stronger than it truly was but also exaggerated the power of the test used to determine the possible discrimination. Both the exact distribution of the number of "successes" and its normal approximation are compared in order to provide a practical condition for the use of the approximation. The second part of the talk focuses on the effect of dependence on confidence intervals for a population proportion. When observations are dependent, even slightly, the coverage probability of the virtually all the confidence intervals in the literature can deviate noticeably from their nominal level. We proposed and examined several modified confidence intervals. Our results showed that the modified Wilson interval performs well and can be recommended for general use.
Note: For a complete list of upcoming seminars check the department's seminar web site: http://www.gwu.edu/~stat/seminars/Spring2003.htm. The campus map is at: http://www.gwu.edu/Map/. The contact person is Reza Modarres at Reza@gwu.edu or 202-994-6359.
Abstract:
This seminar demonstrates ways of enabling our data customers to freely and effectively access and analyze NASS data using a web browser connected to the Internet. We focus on using data from the 1997 Census of Agriculture to demonstrate methods of display and analysis that give a better understanding of inherent patterns and structure in the data. Specifically, we provide the ability to view, analyze, and dynamically interact with summary data at the state and county level. Historically, our customers have had access to this data in tabular form only.
We discuss the relevant concepts and technologies that we considered, and the selection of specific solutions that we currently implement on the NASS web site to give data customers the abilities discussed above. We also discuss our attempt to design a web site so that information of interest to large numbers of customers is easy to access.
While we focus on data from the 1997 Census of Agriculture, the principles and methods discussed can be applied to other sources of survey and census data.
This seminar is physically accessible to persons with disabilities. For TTY callers, please use the Federal Relay Service at 1-800-877-8339. This is a free and confidential service. Requests for sign language interpreting services or other auxiliary aids should be directed to Yvonne Moore at (301) 457-2540 text telephone (TTY), 301-763-5113 (voice mail), or by e-mail to S.Yvonne.Moore@census.gov.
Abstract:
The Office of Research Integrity of the Department of Health and Human Services recently ruled that interviewer falsification was an act of scientific misconduct. A recent meeting of survey researchers concerned with this issue reviewed the literature on interviewer falsification, and reviewed alternative personnel actions reacting to falsification. It drafted a proposed statement on current best methods for dealing with interviewer falsification. This talk reviews the ingredients of the statement and seeks input from participants.
Abstract:
Sand traps and sanity checks in sampling will be discussed from creating a sampling population file, through sample design, selection, and estimation. Samplers heavily rely upon electronic data files to produce their sampling frames but how reliable is that data? What are some simple yet effective checks to avoid pitfalls along the way? When sample results are returned to a statistician for estimation, what could have gone wrong while non-statisticians handled the data? There are errors that are far too easy to make. How can they be caught or even avoided? The Ernst & Young Quantitative Economics and Statistics Group's quality review checks for sampling engagements will be presented.
Abstract:
Disclosure limitation has often been viewed by statistical agencies solely as a mechanism for "protecting" confidentiality, and not in terms of providing data that are useful for statistical analysis. A true statistical approach to disclosure limitation needs to assess the tradeoff between preserving confidentiality and the usefulness of the released data, especially for inferential purposes. In this presentation we discuss these issues, illustrate them with some recent methods for categorical data, and describe some of the research challenges that remain.
Abstract:
An overview of methods used to protect microdata will be presented. Certain methods will be highlighted and explored in more detail as they are applied to Census Bureau microdata.
Abstract:
he "Data Quality Act" is a recent law in the United States requiring every federal agency in the U. S. Government to produce information quality guidelines. At the U. S. Department of Transportation, we wrote guidelines to comply with the Act, improve data quality, and create a consistency across our data systems. In the process of doing this, we encountered some realities of implementation that we had to address if we were to have a realistic chance of achieving the data quality goals. This presentation is about some of the choices we made and the tool we used to help make the choices.
Abstract:
Collecting sensitive information on, for example, illegal immigration or elder abuse can provide important inputs to the policy process, but is challenging because of privacy issues and the potential for response bias.
The three-card method is a survey-based indirect estimation technique that assures absolute privacy of response. No one the interviewer, data analyst, principal investigator, or anyone else can ever know whether a respondent is in the sensitive category, based on his or her responses. Yet when all data are combined, an estimate of the proportion of individuals in the sensitive category is possible. This method was initially designed to: 1) avoid the "mind-boggling" procedures in a randomized response interview; 2) allow follow-up questions; and 3) estimate all answer categories, including the sensitive category, for the total population and major subgroups. The three card method was originally designed to estimate all categories of immigration status, including the sensitive illegal category. Recent developments include (1) separate estimation of visa overstays within the illegal immigrant category (as this group is of special interest in the post-9/11 environment), and (2) elder abuse (a topic of growing concern as baby boomers age).
The first paper (by Judith Droitcour) discusses the general method, estimation of visa overstays, and the variance associated with a three-card estimate. The second paper (by Nathan Anderson) will present a potential application in the area of elder abuse and plans for new work. Fritz Scheuren will discuss the presentations.
Abstract:
This seminar describes how statistical disclosure limitation methods are implemented at the National Center for Health Statistics. The roles that a Disclosure Review Board and a Confidentiality Officer play in a Federal statistical agency will be highlighted.
Abstract:
Hoover and Perez (1999, Econometrics Journal) advocate a constructive approach to data mining. The current paper identifies four pejorative senses of data mining and shows how Hoover and Perez's approach counters each. To assess the benefits of constructive data mining, the current paper applies a data mining algorithm (PcGets) similar to Hoover and Perez's to a dataset for Venezuelan consumers' expenditure. The selected model is economically sensible and statistically satisfactory; and it illustrates how data can be highly informative, even with relatively few observations. Limitations to algorithmically based data mining provide opportunities for the researcher to contribute value added in the empirical analysis.
Abstract:
The Census Bureau has used ethnographic research for over thirty years to provide an understanding of unit and item non-response and of data error in the decennial census and in demographic surveys. The methods used comprise a wide range of qualitative techniques. Participant observation has been used in some studies, but in-depth interviewing has been used more frequently, often combined with a variety of other techniques, including card sorts, vignettes, focus groups and debriefings. Probes about specific wording or concepts from surveys, using techniques appropriate to cognitive interviewing, are also frequently included. The latter demonstrates that the border between ethnographic research and cognitive pre-testing is not clear-cut. Both ethnographic research and cognitive interviewing are forms of qualitative research that are aimed at understanding the response process and use a mixture of observational and interviewing techniques.
In this presentation I will discuss methodological criteria for evaluating the different qualitative techniques that are commonly used. I will discuss how different techniques yield different results and why this matters. One of the main criteria I will discuss is 'ecological validity' which refers to the similarity or commonality (in relevant respects) between the research context (such as the interview situation, the context of observation, etc.) and the research topic (i.e., in this case, the response process). I will show how different techniques can be ranked according to this criterion and to other methodological criteria (such as reliability and generalizability) and practical concerns (such as cost-effectiveness). Finally I will address the trade-offs involved in balancing these practical and methodological criteria.
This program is physically accessible to persons with disabilities. For interpreting services, contact Yvonne Moore at TYY 301-457-2540 or 301-457-2853 (voice mail) Sherry.Y.Moore@census.gov.
Abstract:
Household survey data are often used to estimate the participation of different population groups (e.g., preschool and school-age children, adults) in a wide range of programs and activities. Yet, quite often, the validity of the survey data has not been directly assessed because of the cost of conducting studies to verify survey responses and the difficulties of implementing such studies. This paper examines two approaches for evaluating the quality of household survey data. Both approaches use a follow up survey with the individual or organization identified by the household respondent as providing a program or service. One approach uses a follow up survey in combination with an on-line directory of service providers built from administrative records data.
The two approaches were implemented by the U.S. Department of Education, National Center for Education Statistics in its newest longitudinal study of young children - the Early Childhood Longitudinal Study. The paper will use the findings from these approaches 1) to assess the quality of the data provided by household respondents and 2) to evaluate the promising features and shortcomings of each approach for verifying household survey responses. The implications of these findings for other surveys will be discussed.
Abstract:
In 1998, a consortium of 12 Federal statistical agencies in collaboration with the Methodology, Measurement and Statistics Program, National Science Foundation, and with the support of the Federal Committee on Statistical Methodology initiated a grants program to fund basic survey and statistical research oriented to the needs of Federal agencies. Reports of the principal investigators of the 4 research projects funded during cycle 1 of the Program in 1999 were featured at a first Funding Opportunity Seminar held in Washington during June 2001.
The Second Funding Opportunity Seminar will feature the reports of principal investigators of the 4 projects that were funded in 2001, cycle 2 of the program: 1. "Bayesian Methodology for Disclosure Limitation and Statistical Analysis of Large Government Surveys" by Rod Little and Trivellore Raghunathan; 2. "Visual and Interactive Issues in the Design of Web Surveys" by Roger Tourangeau, Mick Cooper, Reginald Baker, and Fred Conrad; 3. "Robust Small Area Estimation Based on a Survey Weighted MCMC Solution for the Generalized Linear Mixed Model" by Ralph Folsom and Avinash Singh; and 4. "Small Area and Longitudinal Estimation Using Information from Multiple Surveys" by Sharon Lohr. Federal agency statisticians and survey methodologists will be discussants at each session.
There will be 3 morning and 3 afternoon sessions. The Introductory Session will be "The Origins of the Funding Opportunity", and the Concluding Session will be "The Benefits and Challenges of the Funding Opportunity."
There will be a continental breakfast, and refreshments at midmorning and afternoon breaks so we need preliminary counts of the number of attendees. Please call Pat Drummond at 301-458-4193 if you plan to attend.
If planning to attend, contact Pat Drummond by May 5, 2003:
Summary:
This presentation will provide an overview of the Planning Database (PDB). The PDB and associated Hard-to-Count Scores were successfully used in the planning, implementation, and evaluation of Census 2000. We will give some specific examples of PDB applications for 2000. The PDB has proved to be a highly effective targeting tool and this capability can be exploited in ongoing Census Bureau programs, including planning for the 2010 Census. First on the agenda is updating the PDB with Census 2000 results.
Background:
Using 1990 census data at the tract level, the PDB assembled a range of housing, demographic, and socioeconomic variables that are correlated with nonresponse and undercounting. The database provided a systematic way to identify potentially difficult-to-enumerate areas that were flagged for special attention in Census 2000. The PDB was provided to all regional offices and Local Census Offices in Census 2000, and the AHard-to-Count@ scores were used in planning the areas to place Questionnaire Assistance Centers and Be Counted Forms. The PDB was used for other purposes also, such as the Areal-time@ demographic analysis of mail response rates during the critical mail phase in 2000. We also illustrated the potential of the PDB for targeting areas with concentrations of non-English speakers and profiling the specific languages. The variables included in the Planning Database were guided by extensive research conducted at the Census Bureau and by other researchers to measure the undercount and to identify reasons for why people are missed. These variables include housing indicators (percent renters, multiunits, crowded housing, lack of telephones, vacancy), person indicators (poverty, not high school graduate, unemployed, complex household, mobility, language isolation and other operational and demographic data (such as nonresponse rates and race/ethnic distributions). The PDB contains Hard-to-Count (HTC) scores which provide a systematic way to summarize the attributes of each tract in terms of enumeration difficulty. A set of algorithms is used to derive the HTC score. The comparative standing of areas provide indicators of the degree of difficulty--areas with the highest scores are likely to be the areas with relatively high nonresponse and undercount while areas with the lowest scores are likely to be areas with low rates. The high correlation of HTC scores and nonresponse rates is empirically illustrated for both 1990 and 2000. In short, the PDB and associated HTC scores provided good predictions of difficult-to-enumerate areas in 2000. Our vision is a innovative PDB which would merge census, survey/ACS, and administrative data to provide a current and highly defined targeting database for use in ongoing surveys and 2010 census planning.This seminar is physically accessible to persons with disabilities. For TTY callers, please use the Federal Relay Service at 1-800-877-8339. This is a free and confidential service. Requests for sign language interpreting services or other auxiliary aids should be directed to Yvonne Moore at (301) 457-2540 text telephone (TTY), 301-763-5113 (voice mail), or by e-mail to Sherry.Y.Moore@census.gov.
Abstract:
The activities of CDAC will be described. The presentation will highlight products created by CDAC members, including its "Checklist on the Disclosure Potential of Proposed Data Releases" and the auditing software for tabular products.
Abstract:
A number of different methods for sampling rare populations have been considered, but in random digit dial (RDD) telephone surveys these methods usually are costly or have serious statistical problems. In the 2001California Health Interview Survey (CHIS), separate estimates were desired for the following rare populations: five Asian subgroups (Asian Indian, Cambodian, Japanese, Korean, and Vietnamese), American Indian and Alaska Natives, and Latinos in a particular county. Each of the subgroups posed operational obstacles, including interviewing language and culturally appropriate interviewing techniques. This paper describes the methods used to oversample these groups, the operational procedures used to deal with interviewing members of these groups, and the statistical estimation schemes that were used to provide estimates for each group. The key idea is to supplement the RDD sample with samples drawn from special lists. We evaluate the efficiency of the method and related procedures and provide some general suggestions for oversampling rare groups in RDD surveys.
Abstract:
We consider the problem of multi-step ahead prediction in time series analysis using nonparametric smoothing techniques. Forecasting is always one of the main objectives in time series analysis. Research has shown that nonlinear time series models have certain advantages in multi-step ahead forecasting. Traditionally, nonparametric k-step ahead least squares prediction for nonlinear AR(d) models is done by forecasting X_{t+k} via nonparametric smoothing of X_{t+k} on the variables (X_{t},...,X_{t-d+1}) directly. In this paper we propose a multi-stage nonparametric predictor. We show that the new predictor has smaller asymptotic mean squared error than the direct smoother, though the convergence rate is the same. Hence, the proposed predictor is more efficient. Some simulation results, advice for practical bandwidth selection and a real data example are provided.
Abstract:
While there has been considerable research empirically quantifying and simulating the role of borrowing constraints on homeownership rates, the primary focus of this work has been on measuring the relative importance of income and wealth constraints with respect to ownership outcomes. A lack of data on household credit ratings has precluded evaluation of credit quality as a potential barrier to homeownership. This paper overcomes the data problem by deriving a pseudo credit score for each respondent in the Survey of Consumer Finances. This is accomplished utilizing a separate, special sample of individual credit records from which we develop a score imputation equation. Thus, we empirically estimate tenure outcome equations including estimates of household credit quality along with other financial constraints to advance our understanding of how and why such constraints matter in homeownership.
The role of financing constraints also is of interest to academic researchers and policy analysts seeking to understand recent homeownership trends and design policies that may influence future trends. Although homeownership rates increased over the 1990s (from 64% to an historic high of 67%), there is policy interest in further expanding access to homeownership. A second contribution of this paper is to examine the changing role of financial constraints over time, drawing inferences about the possible impact of recent institutional changes in the mortgage market.
This year's Washington Statistical Society President's Invited Address is in memory of Charles (Chip) Alexander Jr. It will be held start at 3:00 p.m. and end at 4:30 p.m. on Wednesday, June 25, 2003. The venue is the Bureau of Labor Statistics Conference Center Rooms 1 and 2. A reception will follow.
The speakers are Graham Kalton of Westat and Cynthia Z.F. Clark of the U.S. Census Bureau. The title of Dr. Kalton's presentation is "Small Domain Estimates: Challenges and Solutions" and the title of Cynthia Z.F. Clark's presentation is "Tribute to Charles (Chip) Alexander Jr.: Chip's Contributions to the Federal Statistical System and Sample Survey Methods." The chair is Nancy M. Gordon of the U.S. Census Bureau and the organizer is Alan R. Tupek of the U.S. Census Bureau.
The abstract for the presentation on small domain estimates is as follows:
The continually increasing demand for timely estimates for small geographic and other domains presents survey statisticians with significant challenges. This talk will review possible solutions, including censuses and large-scale surveys, rolling samples, combining data across time and across surveys, methods for oversampling small domains, and statistical modeling methods.
Abstract:
Do different electronic questionnaire designs affect data accuracy and respondent burden? This seminar presents results from a mock-business survey experiment conducted jointly by the U.S. Census Bureau's Usability Laboratory and the University of Maryland's Laboratory for Automation Psychology and Decision Processes. In this experiment, we compared respondent accuracy and burden for different questionnaire designs within an electronic survey.
We investigated the following design issues:
Each of these issues arises in designing actual Census Bureau economic surveys and censuses. We will discuss the findings and their implications for designing electronic questionnaires.
This seminar is physically accessible to persons with disabilities. For TTY callers, please use the Federal Relay Service at 1-800-877-8339. This is a free and confidential service. Requests for sign language interpreting services or other auxiliary aids should be directed to Yvonne Moore at (301) 457-2540 text telephone (TTY), 301-763-5113 (voice mail), or by e-mail to Sherry.Y.Moore@census.gov.
Abstract:
Colin Mallows' paper "Parity: Implementing the Telecommunications Act of 1996," which just appeared with discussion in Statistical Science (2003), mentions that the implementation of the Telecommunications Act of 1996 has given rise to many challenging statistical questions. The Mallow article then goes on to describe how several of these problems were successfully attacked. As is often the case in hard real life situations, however, many issues remained open and have continued to be given attention.
One of the continuing issues is the possibility that aggregation over service subgroups can lead to increased heterogeneity and may, as a result, mask potentially important differences in performance. By heterogeneity we mean a systematic tendency for relative performance to be better for one subset of transactions than for another subset. In this talk we focus on methodological issues and provide novel analytic and graphical displays that allow the issue of performance heterogeneity to be addressed in a telecommunication context.
The analysis results are clearly presented in "interval plots." These plots are carefully designed to expose when heterogeneity and masking are present and the consistency with which they occur from month to month. We see clearly how visualization can help in data analysis. We also present ways to visually demonstrate the variability in the data.
Abstract:
A number of federal agencies have changed the way they ask about race and ethnicity in ongoing and new surveys of children and families. These changes are in response to new standards issued by OMB's Office of Information and Regulatory Affairs. The Early Childhood Longitudinal Study, Kindergarten Class of 1998-99, which began tracing the educational experiences and outcomes of a national sample of kindergartners in the fall 1998, asks parents to report separately about their own race and ethnicity, as well as their children's race and ethnicity using one or more racial designations. This paper examines some of the following topics:
Abstract:
The role that RDCs play in gaining access to confidential data at Federal statistic agencies is discussed. This seminar will review the practices of two agencies that have RDCs. The first speaker will describe the development of the RDCs at the Census Bureau. Then the RDC at NCHS will be described and the agency's remote access procedures for gaining access to confidential data will be highlighted.
Abstract:
This talk describes a simple method for settings where one has clustered data, but statistical methods are only available for independant data. We assume the statistical method provides us with a normally distributed estimate and an estimate of its variance. We randomly select a data point for each cluster and apply our statistical method to this independent data. We repeat this multiple times, and use the average of estimates as our overall estimate. An estimate of the variance is given by the average of the variance estimates minus the sample variance of the estimates. We call this procedure multiple outputation as all "excess" data within each cluster is thrown out multiple times. Hoffman, Sen, and Weinberg (2001) introduced this approach for generalized linear models when the cluster size is related to outcome. In this talk we demonstrate the broad applicability of the approach. Applications to angular data, p-values, vector parameters, Bayesian inference, genetics data, and random cluster sizes are discussed.
In addition, asymptotic normality of estimates based on all possible outputations as well as a finite number of outputations is proven given weak conditions.
Multiple outputation provides a simple and broadly applicable method for analyzing clustered data. It is especially suited to settings where methods for clustered data are imprac5ical, but can also be applied generally as a quick and simple tool.
This work is done jointly with Michael Proshan and Eric Leifer.
Abstract:
Continuing advancements in technology have the potential to improve the efficiency and quality of survey data collection procedures and operations. In this presentation, RTI staff will discuss ways in which they have operationalized some recent technological advancements to enhance current procedures in survey research. Specifically, they will focus on the use of digital recordings to develop Computer Audio Recorded Interviewing (CARI) for use in field verification, Global Positioning Systems (GPS), and the development of automated testing procedures for Computer-Assisted Instruments (CAI). Examples of how these technologies have been incorporated into RTI's survey operations, results from initial studies, as well as plans for continuing research will be discussed. There will be time designated for questions and answers on these topics as well as other technologies and methodologies currently being used or developed by RTI.
This seminar is physically accessible to persons with disabilities. For TTY callers, please use the Federal Relay Service at 1-800-877-8339. This is a free and confidential service. Requests for sign language interpreting services or other auxiliary aids should be directed to Yvonne Moore at (301) 457-2540 text telephone (TTY), 301-763-5113 (voice mail), or by e-mail to S.Y.Moore@census.gov
Abstract:
We propose a method termed 'MASSC' for statistical disclosure limitation (SDL) of categorical or continuous micro data, while limiting the information loss in the treated database defined in a suitable sense. The new SDL methodology exploits the analogy between (1) taking a sample (instead of a census,) along with some adjustments via imputation for missing information, and (2) releasing a subset, instead of the original data set, along with some adjustments via perturbation for records still at disclosure risk. Survey sampling reduces monetary cost in comparison to a census, but entails some loss of information. Similarly, releasing a subset reduces disclosure cost in comparison to the full database, but entails some loss of information. Thus, optimal survey sampling methods for minimizing cost subject to bias and precision constraints can be used for SDL in providing simultaneous control on disclosure cost and information loss. The method consists of steps of Micro Agglomeration for partitioning the database into risk strata, optimal probabilistic Substitution for perturbation, optimal probabilistic Subsampling for suppression, and optimal sampling weight Calibration for preserving estimates for key outcomes in the treated database.
The proposed method uses a paradigm shift in the practice of disclosure limitation in that the original database itself is viewed as the population and the problem of disclosure by inside intruders is considered. (Inside intruders know the presence of their targets in the database in contrast to outside intruders.) This new framework has two main features: one, it focuses on the more difficult problem of protecting from inside intruders and as a result also protects against outside intruders, and second, it provides in a suitable sense model-free measures of both information loss and disclosure risk when disclosure treatment is performed by employing known random selection mechanisms for substitution and subsampling. Empirical results are presented to illustrate computation of measures of information loss and the associated disclosure risk for a small data set.
Abstract:
Standard methods for statistical disclosure limitation (SDL) in tabular data either abbreviate, modify or suppress from publication the true (original) values of tabular cells. All of these methods are based on satisfying an analytical rule selected by the statistical office to distinguish cells and cell combinations exhibiting unacceptable risk of disclosure (the sensitive cells) from those that do not. The impact of these SDL methods on data analytic outcomes is not well-studied but can be shown to be subtle or severe in particular cases. Dandekar and Cox (2002) introduced a method for tabular SDL called controlled tabular adjustment (CTA). CTA replaces the value of each cell failing the analytical rule by a safe value, viz., a value satisfying the rule, and then uses linear programming to adjust the values of the nonsensitive cells to restore additivity of detail to totals throughout the tabular system. The linear programming framework allows adjustments to be selected so as to minimize any of a variety of linear measures of overall distortion to the data, e.g., total of absolute adjustments, total percent of absolute adjustments, etc. Cox and Dandekar (2003) provide further techniques for preserving data quality. While worthwhile, none of these techniques directly addresses the overarching issue: Will statistical analysis of original and disclosure limited data sets yield comparable results? We provide a mathematical programming framework and algorithms, introduced in Cox and Kelly (2003), that begins to address this issue. Specifically, we demonstrate how to preserve approximately mean values, variances and correlations when original data are subjected to CTA, and how to ensure approximately intercept=zero, slope=one simple linear regression between original and adjusted data.
Abstract:
The U.S. Census Bureau has developed SPEER software that applies the Fellegi-Holt editing method to economic establishment surveys under ratio edit and a limited form of balancing. If implicit edits are available, then Fellegi-Holt methods have the advantage that they determine the minimal number of fields to change (error localize) so that a record satisfies all edits in one pass through the data. In most situations, implicit edits are not generated because the generation requires days-to-months of computation. In some situations, when implicit edits are not available, Fellegi-Holt systems use pure integer programming methods to solve the error localization problem directly and slowly (1-100 seconds per record). With only a small subset of the needed implicit edits, the current version of SPEER (Draper and Winkler 1997, upwards of 1000 records per second) applies ad hoc heuristics that finds error-localization solutions that are not optimal for as much as five percent of the edit-failing records. This talk will have two parts. In the first part we will describe new SAS7 software and corresponding methodology for generating the complete set of implicit ratio edits for a given set of explicit ratio edits. The new software implements a shortest path algorithm and borrows ideas from the Generate Edits portion currently used in the Census Bureau's Plain Vanilla Ratio Module. In the second part of this talk we present recent modifications to the SPEER editing system that maintain its exceptional speed and do a better job of error localization. The new SPEER uses the Fourier-Motzkin elimination method to generate a large subset of the implied edits prior to error localization. We describe the theory, computational algorithms, and results from evaluating the feasibility of this approach.
This seminar is physically accessible to persons with disabilities. For TTY callers, please use the Federal Relay Service at 1-800-877-8339. This is a free and confidential service. Requests for sign language interpreting services or other auxiliary aids should be directed to Yvonne Moore at (301) 457-2540 text telephone (TTY), 301-763-5113 (voice mail), or by e-mail to S.Y.Moore@census.gov