Furthermore many confuse correlation with causation, i.e., many interpret non–zero correlation as an implied causal relationship. There are many examples of it even in scientific literature. Here our point is not to discuss all these misinterpretations, but to look at another thing on correlation, that is also related to causation in some sense.
In our day–to–day life, causation is transitive: that is, if one thing, say, A causes another, say, B and this B also causes a third, say, C, then we are ready to accept that A causes C. Of course, this concept is quite familiar to most of us and it is often called indirect causation. Such a relationship is said to have the property of transitivity.
But then what about the correlation: more precisely, if A and B are positively correlated and B and C are positively correlated then are A and C also positively correlated? Is the positive correlation is also transitive? For example, if the price of stock B increases along with the price of stock A, and the price of stock C increases with the price of the stock B. Then is it the case that the price of the stock C increases with the price of the stock A always?
Yes, one may jump into conclusion that they do so. Not because of he/she may be thinking about Pearson’s correlation coefficient, but it may due to thinking in line with causation. In fact, if one knows about Pearson’s correlation coefficient he/she may not conclude so. An article by Langford, Schwertman and Owens in The American Statistician gives rather deep look at the problem. But thinking in terms of causation one often concludes that A and C are also positively correlated always, i.e., the price of the stock C increases with that of A, always. But, since causation and correlation are two different things there is no assurance that a property of causation is also held in correlation.
Let’s get back to the above article, in this they use the data for all of the New York Yankees players with at least 300 “at bat” at the end of the year 2000 regular baseball season. “At bat” means a batter facing a pitcher, so each batter listed has at least 300 times of facing pitchers. As authors describe the variables X, Y and Z represent the number of triples, base hits and home runs for the players where each row corresponds to one particular player with his name also are shown. In baseball a hit is called a base hit, a triple is an act of a batter reaching safely the third base and of a home run is a score that the team gets when a batter is able to reach home safely in one play. The data is shown in the following table and the graphs are drawn between each pair of variables to see how they vary among themselves, for example the graph in the middle shows how the number of home runs (Z) changes with the number of base hits (X) for the players. They increase linearly with each other, but you may notice that it is not a sharp relationship between them.
It can be found that Pearson correlation coefficient between X and Y is 0.526, that between Y and Z is 0.293 but that between X and Z is -0.096, a negative correlation. That is the first two correlations are positive but the third is not a positive one, therefore it is not a transitive relation. But if one guesses that fourth player is an outlier (he has rather high Z value when taken with his X value compared to others) the above correlations become 0.559, 0.321 and 0.067. That is, when first two correlations are improved to some extent, that is, if they are made more positive and then the third also becomes positive thus giving the transitivity property.
So one may think that every correlation triplet such as above can be made to follow transitivity property, at least in cases where there are enough data cases to delete such suspected outliers so that first two correlations can be improved. Indeed this is true when the linear relationship between the first two pairs, namely, X and Y and Y and Z are adequately strong, i.e., when they are rather high positive correlations. The authors prove a theorem that says P2xy + P2yz + P2xz ≤ 1 + 2Pxy Pyz Pxz where Pxy stands for the correlation coefficient between X and Y and similarly the others.
Note that a correlation coefficient between two quantities is a measure of how close the data points on the two quantities to a hypothesized linear relationship between them. When you have three quantities where two of the correlations are positive but rather weak meaning that data points on respective quantities are not “that” close to their hypothesized linear relationships then the third correlation can be negative.