You might be tempted to do PCA after a statistical test. Read more to discover why this is a bad idea.

If you do a statistical test **before** a dimensional reduction method like PCA, the highest source of variance is likely to be whatever you tested statistically.

Let me describe the situation. You’ve done an `-omics`

level analysis on your system of interest. You run a t-test (or ANOVA, etc) on each of the features in your data (gene, protein, metabolite, etc). Filter down to those things that were statistically significant, and then finally, you decide to look at the data using a dimensionality reduction method such as *principal components analysis* (PCA) so you can **see** what is going on.

I have seen this published at least once (in a Diabetes metabolomics paper, if anyone knows it, please send it to me so I can link it), and have seen collaborators do this after coaching from others in non-statistical departments.

The problem is that PCA is just looking at either feature-feature covariances or sample-sample covariances. If you have trimmed the data to those things that have statistically significant differences, then you have completely modified the covariances, and PCA is likely to pick up on that.

Let’s actually do an example where there are no differences initially, and then see if we can introduce an artificial difference.

We start with completely random data, 10000 features, and 100 samples.

Now we will do a t-test on each row, taking the first 50 samples as class 1 and the other 50 samples as class 2.

How many are significant at a p-value of 0.05?

Obviously, these are false positives, but they are enough for us to illustrate the problem.

First, lets do PCA on the whole data set of 10000 features.

```
sample_classes = data.frame(class = c(rep("A", 50), rep("B", 50)))
all_pca = prcomp(t(random_data), center = TRUE, scale. = FALSE)
pca_scores = cbind(as.data.frame(all_pca$x), sample_classes)
ggplot(pca_scores, aes(x = PC1, y = PC2, color = class)) + geom_point()
```

Obviously, there is no difference in the groups, and the % explained variance is very low.

Second, lets do it on just those things that were significant:

```
sig_pca = prcomp(t(random_data[which(t_test_res$p.value <= 0.05), ]), center = TRUE,
scale. = FALSE)
sig_scores = cbind(as.data.frame(sig_pca$x), sample_classes)
ggplot(sig_scores, aes(x = PC1, y = PC2, color = class)) + geom_point()
```

And look at that! We have separation of the two groups! But …., this is completely random data, that didn’t have any separation, **until we did the statistical test**!

Be careful of the order in which you do things. If you want to do dimensionality reduction to look for issues with the samples, then do that **before** any statistical testing on the individual features.

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/rmflight/researchBlog_distill, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

For attribution, please cite this work as

Flight (2018, Sept. 14). Deciphering Life: One Bit at a Time: Don't do PCA After Statistical Testing!. Retrieved from https://rmflight.github.io/posts/2018-09-14-don-t-do-pca-after-statistical-testing/

BibTeX citation

@misc{flight2018don't, author = {Flight, Robert M}, title = {Deciphering Life: One Bit at a Time: Don't do PCA After Statistical Testing!}, url = {https://rmflight.github.io/posts/2018-09-14-don-t-do-pca-after-statistical-testing/}, year = {2018} }