Data Concepts 101
Types of Data
Forsyth Futures Making Progress reports and special studies generally use two different types of data: total count data and survey data.
Total Count Data
Some of the data used in the Making Progress reports special issue studies is an actual count of every person or event being studied, usually from an administrative record. The data used for Making Progress: Infant Health is a good example of this kind of data because it is based on a count from the state’s records on every birth that occurred in that year. The decennial census conducted by the US Census Bureau in which every single resident is counted is also an example of this type of data.
Sample data is different than total count data because instead of counting every single person or event, a smaller sample is measured and used to make estimates about the larger population. A good example of this type of data is data from the American Community Survey, which was formerly known as the “long form” of the Census but is now sent out to a small group of people every year. In the American Community Survey, the Census Bureau randomly selects a smaller number of households in the community each year and uses those households to make estimates about things like the poverty rate and education levels for the whole community.
Variance is a measure of how spread out numbers in a particular dataset are. Higher variance measures mean that the data in that dataset are more spread out than than in datasets with lower variance measures, and a variance measure of zero would indicate that all of the numbers in a dataset were exactly the same. For some data sets, different types of diversity in our community result in measurement variance. For example, Forsyth County 11th grade students earned a wide range of scores on the ACT; some students scored close to the average score while others scored much higher or lower. Variance also occurs in sample data, and high levels of variance among survey respondents can make survey estimates less precise.
Standard error is a measurement that statisticians use to measure variance, with higher scores indicating more variance and lower scores indicating less variance. As is the case with variance, high levels of standard error in sample data can make those estimates less precise.
Coefficient of Variation
Coefficient of variation (CV) is a measurement that statisticians use to compare the variance in an estimate to the size of the estimate itself, and it is calculated using the standard error. Estimates with high CVs are less reliable than estimates with low CVs. The table below shows the levels of reliability that Forsyth Futures assigns data by CV. For the Making Progress reports and special issue studies, Forsyth Futures generally only uses estimates with CVs below 12 and indicates when the CV for an estimate is between 12 and 30. Forsyth Futures does not use estimates with CVs higher than 30 in the Making Progress reports.
When using survey data, researchers cannot be 100% sure what the measure they are studying would look like if they could count every person/event in the entire community. But, researchers can use statistics to create a range in which they are relatively certain that true count would fall. In reports using sample data, Forsyth Futures uses a 95% confidence interval as that range, which means that analysts are 95% sure that the true number for the whole community falls within that range. For example, when using data from the American Community Survey to estimate the number of Forsyth County residents living below the poverty threshold, researchers cannot know the exact number of residents in poverty, but they can be 95% certain that the actual number of residents in poverty fall within a specific range. Total count data does not have uncertainty introduced by how a sample was chosen, but there is some variation that is introduced by random chance. For example, whether a low-birthweight infant is born in December of one year or January of the next year, is a matter of chance, but it could have a slight effect on the rate of low-birthweight births for that year.
Typically rarer events and community characteristics are more sensitive to these random fluctuations than more common events and community characteristics. For example, because the number of suicides in Forsyth County are much lower than the number of babies born at a low birth-weight, a fluctuation of one or two suicides a year due to random chance can have a bigger impact on the suicide rate than the random fluctuation of one or two low-birthweight births. When using total count data, Forsyth Futures analysts calculate a 95% confidence interval for each statistic. This confidence interval for these statistics represents the estimated range of natural fluctuation around the statistic. For example, in 2014 Forsyth County had a low infant birthweight rate of 9.54% with a 95% confidence interval of 8.65%-10.44%. At face value, this appears to be a decrease from the 2013 rate of 10.39%, but since the 2013 rate falls within the confidence interval of the 2014 rate, analysts cannot be at least 95% sure that this difference is not due to random chance.
Margin of Error
The difference between the the estimate of the value of the measure and the upper and lower bounds of the confidence interval is called the margin of error, which Forsyth Futures also reports at the 95% confidence level.
When Forsyth Futures is making comparisons between communities or across years, it conducts statistical tests to make sure that analysts are at least 95% sure that the observed differences are a result of actual differences between the two measures, not a result of random chance or the sampling method used. Statisticians refer to these kinds of differences as statistically significant differences. This is why readers may notice that Forsyth Futures reports will refer to communities as having a similar measures even when the estimates for those measures appear different at face value. Forsyth Futures only identifies measures as different if there is a statistically significant difference between them at the 95% confidence level.
When users are analyzing datasets themselves, they can use the 95% confidence interval described above as a rough estimate of statistical significance. If the 95% confidence intervals for two measures do not overlap, then any difference between the measures is likely to be statistically significant. However, the differences can be statistically significant even if the confidence intervals overlap.
In reports, Forsyth Futures generally only calculates trends for indicators with with at least four years of data. Data is considered trending upwards or downwards if later years of data are consistently higher or lower than earlier years of data at a statistically significant level.
Major Urban Counties
When comparing to other communities, the Forsyth Futures reports typically compare Forsyth County to Durham, Guilford, Mecklenburg, and Wake counties because they are the four other heavily populated counties with mid- to large- sized cities in North Carolina.