PCA

Aug 02, 2023

By Admin


PCA

Principal Component Analysis (PCA) is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while preserving the most important information. It achieves this by finding new uncorrelated variables, called principal components, which are linear combinations of the original features. The first principal component captures the most significant variance in the data, and subsequent components capture the remaining variance in decreasing order.

In the world of photography, capturing the essence of a moment is an art that requires skill and creativity. Just like a photographer at a football game, who explores various angles and views to find that perfect shot that encapsulates the true spirit of the game, Principal Component Analysis (PCA) in data analysis seeks to unveil the intrinsic patterns and features hidden within complex datasets.

Much like a photographer experimenting with different perspectives, PCA delves into the multidimensional landscape of data, where each data point is akin to a unique moment frozen in time. The photographer's goal is to distill the essence of the football game into a single captivating photo, while PCA's aim is to identify the critical aspects of the data that drive its variability and reveal the underlying structure.

Steps for PCA

Step 1: Data Standardization PCA assumes that the data is centered around the origin. Therefore, the first step is to standardize the data by subtracting the mean of each feature and dividing by its standard deviation. This ensures that all features have the same scale.
Step 2: Covariance Matrix Calculation The covariance matrix is computed from the standardized data to capture the relationships between the features. The covariance between two features Xi and Xj is given by :

PCA-Formula

Where,
● n is the number of data points
● Xki is the i-th feature of k-th data point
● Xi is the mean of the i-th feature

Step 3: Eigenvector-Eigenvalue Decomposition The next step is to perform an eigenvector-eigenvalue decomposition of the covariance matrix. The eigenvectors represent the principal components, and the corresponding eigenvalues represent the amount of variance captured by each principal component.

The eigenvalue equation is:

Covariance Matrix ⋅ EigenVectors = EigenValues ⋅ EigenVectors

Step 4: Selecting Principal Components The principal components are sorted in descending order based on their corresponding eigenvalues. The first principal component is the direction along which the data varies the most (highest variance). The second principal component is the direction orthogonal to the first, capturing the second highest variance, and so on.

Step 5: Reducing Dimensionality The final step is to choose the top k principal components to reduce the dimensionality of the data. Typically, k is determined based on the amount of variance explained or the desired dimensionality reduction. The transformed data is obtained by projecting the original data onto the selected principal components:

Transformed Data = Original Data - Selected Principal Components

Where the transformed data will have k columns (representing the k principal components) instead of the original number of columns.

In summary, PCA helps in reducing the dimensionality of high-dimensional data while preserving the most significant information by finding new orthogonal variables (principal components) that capture the most variance. The steps include data standardization, computing the covariance matrix, performing eigenvector-eigenvalue decomposition, selecting principal components, and reducing the dimensionality of the data.

Interview Questions :

1. What is confusion matrix?

2. What are the terms in confusion matrix?

3. How to calculate evaluation metrics?