Having previously demonstrated what two binary groupings look like when they are separated by six standard deviations, here I demonstrate what they look like when separated by 4 standard deviations. Such a binary has an overlapping coefficient of 4.55%, as seen from the code below, which computes from integration based on Weitzman’s overlapping distribution.
## 0.04550026 with absolute error < 3.8e-05
## [1] "4.55%"
This is what such data looks like graphed in a density curve.
The overlap range is now much larger, as can be seen in the scatterplot below.
Now let’s look at an overlap range of 2 standard deviations.
## 0.3173105 with absolute error < 4.7e-05
## [1] "31.73%"
The density plot now overlaps a lot.
And this is what the scatterplot looks like.
Now look at the scatterplot without color differences. At this point there is the barest of hints that there might be a binary in this system at all.
Let us compare that to the initial binary, separated by 6 standard deviations, now in grey.
With this data, the binary remains visible and obvious even when both samples are gray.
However, even if you cannot observe categories by directly looking, there are tools that can help identify N-nary categories in what looks to us like gradient data – the tools of unsupervised cluster analysis, which I will discuss in the next tutorial.
The RMarkdown file used to generate this post can be found here. Some of the code was modified from code on this site.
References:
Weitzman, M. S. (1970). Measures of overlap of income distributions of white and Negro families in the United States. Washington: U.S. Bureau of the Census.