Principal Component Analysis

Abstract

This paper is an essay about “Principal Component Analysis” written by Sofiene Khiari as an End-term project for the course “Applied Mathematics and Informatics In Drug Discovery” at the University of Basel. It presents the basic principles of the method after introducing the ones of “Machine Learning” and “Dimensionality Reduction”, being its foundation. The different steps for the method are explained in a second section using an example that shows the benefit of using such a method to analyse data, no matter the domain of expertise. This essay tries to simplify the concept, making it even understandable by readers who may not be familiar with the subject.

Introduction

In this essay, we will briefly introduce Machine Learning and the Principal Component Analysis in a way that is hopefully comprehensible even to readers that are not familiar with the subject. The essay comprises two main sections:

  • Basic Principles: Introduction to the basic principles of Machine Learning and Dimensionality Reduction with Principal Component Analysis taken as an example.

  • Analysis Conduct: Description of the Principal Component Analysis’ steps and application thereof through a basic example.

Basic Principles

Machine Learning

Machine learning is all about using and learning from massive amounts of evolving data to come up with algorithms that help understand and solve complicated problems humanity is facing. It has many application areas.

The need for such a technology has arisen from the everlasting progress humanity made and is making that allows solving problems bearing more and more complexity. Simple algorithms aren’t good enough anymore to solve such problems and the solutions needed are too complicated for us to find, mainly because of a lack of knowledge, as we often do have enough data to be able to describe such problems with good accuracy. That’s why we learn using all that data and the immense computational power computers provide us with, and we’re able to distill the process that can explain the data we observe and construct a good and useful approximation by detecting certain patterns or regularities that can help us predict the near future [1].

Sometimes, the data we need to analyse is however very complicated and has a lot of dimensions to it. In cases like this, the techniques of Dimensionality Reduction become very handy.

Dimensionality Reduction

Dimensionality in Machine Learning refers to the number of attributes or fields a structured dataset can have [2]. In an academic field, they can for example be a number of different exams that evaluate the knowledge of a group of students in certain domains, as we’re going to see in our example analysis. Real-life complicated datasets may contain hundred of dimensions [2] and this high dimensionality may cause some issues:

  • It is very challenging to plot, visualise and analyse data that has a very high dimensionality [3], as humans are unable to visualise data beyond 3D.

  • It may require a lot of computational power to analyse such data, and the result may not show good accuracy [2].

Dimensions reduction analysis include various techniques that can lower the dimensions of data without losing key information [2]. Ideally, we should not need to perform these analysis, as the classifier should be able to use whichever features are necessary and discard the irrelevant [1], but it may be preferable to perform them separately anyway for the following reasons [1, 2]:

  • They reduce the time, memory and computation needed for the training.

  • They decrease the complexity of the resulting algorithm rendering it more clear and robust.

  • They can be plotted and analysed visually.

There are two main methods for reducing dimensionality: feature selection and feature extraction [1]. Principal Component Analysis is the best known and most widely used feature extraction method and as such generally consists in finding a new set of k dimensions that are combinations of the original d dimensions [1]. This essay will only be focusing on the Principal Component Analysis even though there are multiple techniques that we could use.

Principal Component Analysis

In a nutshell, Principal Component Analysis is an unsupervised method of Dimensionality Reduction “interested in finding a mapping from the inputs in the original d-dimensional space to a new (k < d)-dimensional space, with minimum loss of information” [1] while also eliminating the redundancy in the dataset [2]. It generally uses the Singular Value Decomposition [3]. During the analysis, we aim to maximise the variance by choosing the eigenvector with the largest eigenvalue (more details in the Analysis Conduct section). The first component has the biggest variance, meaning that it holds the maximum information about our data’s clustering potential, followed by the second component and so on. We can use the principal components we get to draw a PCA Plot, where we can visualise potential similar individuals cluster [3]. In the following section, we will look into the different steps that we need to follow to conduct a Principal Component Analysis using a specific example.

Analysis Conduct

To better explain the analysis steps, we will choose an example. For this purpose, we’re going to analyse the results of 30 students in 8 different exams and try to see if those students have homogenous results or if they can be fit into categories solely based on said results.

Import the data

First, we start by importing the data from a csv file present in the same folder as our notebook. This data doesn’t come from a real-life situation but was rather created for this assay purely for demonstration purposes. The Fig. 1 shows the first five rows of the table representing the generated dataframe.

E1 E2 E3 E4 E5 E6 E7 E8
S01 5.8 6.0 5.7 5.7 5.5 5.7 5.9 5.7
S02 4.4 4.5 4.4 4.4 4.5 4.4 4.0 4.0
S03 5.9 5.9 5.7 6.0 6.0 5.5 5.8 5.6
S04 2.0 1.3 1.7 1.2 1.1 1.2 1.7 1.8
S05 4.0 4.1 4.0 4.5 4.0 4.4 4.2 4.2

Fig. 1 First five rows of the table representing the generated dataframe

Run the analysis

Our analysis was carried out in a single line of code using the sklearn library, which abstracts all the mathematical calculations for the principal component analysis’ transformation of data. With this library, we only need to provide the number of principal components we want our model to have [2]. The principal component analysis watch the following steps at a high level [2, 3]:

  • Standardize the dataset (prerequisite for every principal component analysis).

  • Compute the covariance matrix:

    • Calculate the average values for each variable to be able to calculate the center of the data.

    • Shift the values so that the center of the data would correspond to 0 (while keeping the proportions of the points between each other intact).

  • Calculate the Eigenvalues and Eigenvectors to be able to identify principal components:

    • Fit a line through the data in a way that maximises its Eigenvalues (the sum of squared distances betweeen the projected points and the origin), while making sure that it includes the center of the data. This line is the principal component 1.

    • Draw a line that is perpendicular to principal component 1 and also goes through the origin. This line is the principal component 2.

  • We use the loading scores (total variation) of the principal components to decide which one(s) is (are) the most important for the clustering of the data. We can use a Scree plot to visualise these proportions (not done in this example though). The final selection of the principal component(s) depends on how many principal components we decided we want our model to have.

  • Transform the original matrix of data by using the projected points on each selected principal component. The resulting matrix is used to draw the plot as shown in the results’ plotting section.

Tip

The StatQuest PCA step-by-step video [3] provides a very good visual representation of the different steps involved in the Principal Component Analysis.

The results of the principal component analysis in our case are displayed in Fig. 2. The total variation around the principal component 1 is equal to 0.983 and the one around principal component 2 to 0.005.

Principal Component 1 Principal Component 2
-3.031184 -0.048137
-0.766471 -0.036151
-3.112037 -0.083159
3.720344 -0.340983
-0.525484 0.169394

Fig. 2 First five rows of the generated Principal Component Analysis’ results

Plot the results

We can plot the results by creating a Scatter plot that has the principal component 1 on the x-axis and the principal component 2 on the y-axis. We can then color-code the points depending on their position on the graph relative to the principal component 1. The result is show in Fig. 3. We can see in the resulting graph that some students cluster on the right side, some almost in the middle and some in the left side. This suggests that three categories can be made for our students solely based on their exam results, which may be helpful for further analysis. We can also see in the graph how the total variation around the principal component 1 is much higher than the one around the principal component 2 which corresponds well to the values of total variation we got from our analysis in the analysis conduct section.

_images/ipca_6_0.png

Fig. 3 Two Components Principal Component Analysis’ Scatter Plot

Bibliography

1(1,2,3,4,5,6)

Ethem Alpaydın. Introduction to Machine Learning. The MIT Press, Massachusetts Institute of Technology, second edition, 2010.

2(1,2,3,4,5,6,7,8)

Veer Kumar. Complete tutorial of pca in python sklearn with example. machinelearningknowledge.ai, 10 2021. URL: https://machinelearningknowledge.ai/complete-tutorial-for-pca-in-python-sklearn-with-example/.

3(1,2,3,4,5)

Josh Starmer. Statquest: principal component analysis (pca), step-by-step. Youtube, 04 2018. URL: https://www.youtube.com/watch?v=FgakZw6K1QQ.