Implementation of Principal Component Analysis

In this blog, we would discuss the Implementation of Principal Component Analysis. PCA is a statistical procedure that is used to extract relationships between variables in order to simplify data. It is commonly used to reduce the dimensionality of data sets and is also a useful tool for data visualization. PCA is performed by first computing the covariance matrix of the data, and then computing the eigenvectors and eigenvalues of this matrix.

 

 

The eigenvectors with the largest eigenvalues correspond to the directions of maximum variance in the data and can be used to define a new set of variables that are uncorrelated with each other. PCA is a powerful tool that can be used to simplify data sets and find relationships between variables. However, it is important to remember that PCA is a linear transformation, so it will not be able to capture non-linear relationships between variables.

 

 

It is often used to speed up machine learning algorithms and to make data easier to visualize. PCA is a linear transformation that projects data onto a lower-dimensional space in a way that preserves the maximum variance. The axes of the resulting space are called principal components, and the first principal component is the direction that preserves the maximum variance. PCA is a useful tool for data visualization, noise reduction, and dimensionality reduction. 

 

 

 

Implementation of Principal Component Analysis

 

import numpy as np
 
def PCA(X , num_components):
     
    X_meaned = X - np.mean(X , axis = 0)
     
    cov_mat = np.cov(X_meaned , rowvar = False)
     
    eigen_values , eigen_vectors = np.linalg.eigh(cov_mat)
     
    sorted_index = np.argsort(eigen_values)[::-1]
    sorted_eigenvalue = eigen_values[sorted_index]
    sorted_eigenvectors = eigen_vectors[:,sorted_index]
     
    eigenvector_subset = sorted_eigenvectors[:,0:num_components]
     
    X_reduced = np.dot(eigenvector_subset.transpose() , X_meaned.transpose() ).transpose()
     
    return X_reduced

 

 

import pandas as pd
 
#Get the IRIS dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
data = pd.read_csv(url, names=['sepal length','sepal width','petal length','petal width','target'])
 
#prepare the data
x = data.iloc[:,0:4]
 
#prepare the target
target = data.iloc[:,4]
 
#Applying it to PCA function
mat_reduced = PCA(x , 2)
 
#Creating a Pandas DataFrame of reduced Dataset
principal_df = pd.DataFrame(mat_reduced , columns = ['PC1','PC2'])
 
#Concat it with target variable to create a complete Dataset
principal_df = pd.concat([principal_df , pd.DataFrame(target)] , axis = 1)

 

 

import seaborn as sb
import matplotlib.pyplot as plt
 
plt.figure(figsize = (6,6))
sb.scatterplot(data = principal_df , x = 'PC1',y = 'PC2' , hue = 'target' , s = 60 , palette= 'icefire')

 

 

Reduced Dimension Plot

 

 

 

Advantages of Principal Component Analysis

 

There are many advantages to using Principal Component Analysis (PCA). One major advantage is that it can help to reduce the dimensionality of data. This can be extremely helpful when working with high-dimensional data sets. Additionally, PCA can help to improve the performance of machine learning algorithms. This is because PCA can help to reduce the noise in data sets and make patterns more visible. Finally, PCA is a relatively simple technique that is easy to implement.

 

 

There are many advantages of using Principal Component Analysis (PCA). PCA is a statistical technique that is used to reduce the dimensionality of data. It is often used to make data easier to visualize and to make patterns easier to identify. PCA can also be used to find new variables that are linear combinations of the original variables. These new variables are often called “principal components.”

 

 

In many cases, these new variables can be used to explain a greater amount of variability in the data than the original variables. PCA is also useful for data compression. When data is compressed, it can be stored more efficiently and transmitted more quickly. PCA can be used to find a small set of variables that can be used to approximate the original data set. Another advantage of PCA is that it can be used to improve the performance of machine learning algorithms. In some cases, using PCA to transform the data can make it easier for the algorithms to find patterns.

 

 

 

Disadvantages of Principal Component Analysis

 

There are a few potential disadvantages of PCA that should be considered before using this technique:

 

1. PCA can be sensitive to outliers in the data. Outliers can have a significant impact on the results of PCA, so it’s important to identify and remove them from the dataset before running the analysis.

 

 

2. PCA can be affected by the scaling of the data. This means that variables that are on a larger scale (e.g. age) can have a greater influence on the results than variables on a smaller scale (e.g. height). To account for this, it is often recommended to standardize the data before running PCA.

 

 

3. PCA assumes that the data is linear. This means that it may not be suitable for datasets that have non-linear relationships.

 

 

4. PCA can be computationally intensive, especially for large datasets. This can make it impractical to use on very large datasets.

 

 

5. PCA is a linear method, which means it can only deal with linear relationships between variables. If there are non-linear relationships present in your data, PCA will not be able to accurately represent them.

 

 

 

Also, read 176B Parameter Bloom Model.

 

Share this post

2 thoughts on “Implementation of Principal Component Analysis

Leave a Reply

Your email address will not be published. Required fields are marked *