Explaining QQ Plots - Learning and Coding

To start with, the qq stands for quantile-quantile, or a comparison of quantiles between two populations. A quantile represents how many of the data points fall below this value, compared to the total number of data points. To simplify discussion, I will be discussing percentiles, or dividing up a data set into 100 equally sized groups. A percentile represents what percent of data points fall below the given value.

Primarily, a qq plot is used to compare the distribution of a data set to a standard type of distribution. Often, it is used to answer the question: is the data set in question normally distributed? This is important, as many kinds of data analysis fail if the data is not normal.

QQ plots are extremely technical-while they are very useful for analyzing our assumptions about data (and thus, if a particular kind of analysis will work), it is very difficult to use them to communicate knowledge in any non-technical capacity, nor does it give us information about the relationships between variables.

To explain, let’s set up an example. We’ll use percentiles as the kind of quantile, and compare a data set’s distribution to a true normal distribution. Each point represents a certain percentage, where both populations (data set and normal) have that percent of data values at or below that value. The x-axis value is the value of the normal distribution that cumulatively account for said percent of the data, and the y-axis value is the same, but for the data set rather than true normal.

This is what allows us to compare the distributions of the population. If the distributions are the same, then the cumulative density functions will increase at about the same rate, which will make all the points lie very close to a 45 degree angled line.

For an example, here is a data set which combines two different normal random sets, one has a very narrow standard deviation compared to the other, creating a narrower peak. This dataset is represented in blue, and true normal in orange.

Here are the cumulative density plots, which is closely related to what a qq plot shows:

The dotted red line represents 20% of the data: 20% of the data points fall below it, the other 80% above it. The vertical blue line represents where that 20% meets the dataset, this is the y-value of the 20% mark on the qq plot. The orange line is where 20% meets the normal distribution: this is the x-value of the 20% point on the qq plot.

Here is the qq plot for this dataset:

The orange spot represents the 20% value highlighted in the previous paragraph.

The actual values that are plotted in the qq plot (and thus, what the ticks on the axes represent) is number of standard deviations away from the mean.

Obviously, the plot does not lie perfectly on the red line, because the data set is not normally distributed. While we can easily see that is not normal, it is hard to tell in what way it isn’t normal: which is what makes a qq plot so technical, it is very difficult to build an intuitive understanding of what it represents.

To demonstrate what shapes produce what qq plots, I’ll provide a few sample histograms and resulting qq plots.

A combination of a uniform distribution and a normal distribution results in a distribution that looks like this:

And produces a qq plot like this:

Data with a positive skew:

Data with a negative skew:

Which group of data points has more points in a particular range has a much more noticeable impact on the slope of the qq plot than it does on which side of the 45 degree line the plot is on, which is one of the main reasons that intuitive understanding of a qq plot is hard to develop. It basically requires an intuitive understanding of calculus, and as someone who has a high level of technical understanding of calculus, it’s still not super intuitive. With my technical understanding of the math behind these things, I believe that my intuitive understanding will grow with time.