Prediction using Logistic Regression in python
Introductions
In this blog, we have discussed how to implement the Logistic Regression algorithm to predict whether people buy a product. From the previous blog, we know that logistic regression is used to predict the categorical dependent variable and solve classification problems. From the sklearn library, we use the inbuilt logistic regression function for prediction.
Libraries Required
- Numpy
- Pandas
- Matplotlib
- Seaborn
- sklearn
Implementation of Logistic regression to predict whether people buy a product.
Step-1: Importing Libraries and dataset
To start with the Prediction using Logistic regression, the first step is to get the dataset. The dataset includes whether people buy a product based on their age, gender, and estimated annual salary. The dataset can be downloaded at Dataset.
# import the required librares import numpy as np import pandas as pd # import the visualizing libraries import matplotlib.pyplot as plt import seaborn as sns # read the dataset df = pd.read_csv('path for your file')
Step-2: Preprocessing the data
Firstly we check for any missing values in the data. If there are many missing values we would remove the column.
# checking for nullvalues df.isnull().all()
Now, check for any outliers in the data
# checking for outliers in age column fig, axes = plt.subplots(1, 2, figsize=(15, 10)) sns.boxplot(ax=axes[0],x= df['Age'], palette = "Set1") # setting the title of figure to age of people axes[0].set_title('Age Of People') sns.boxplot(ax=axes[1],data = df['EstimatedSalary']) # setting the title of figure to Estimated Salary Of People axes[1].set_title('Estimated Salary Of People') plt.show()
In the next step, check for correlation between the column and drop the column with fewer correlation values as it doesn’t contribute to the prediction.
df.corr() f,ax = plt.subplots(figsize=(6, 5)) sns.heatmap(df.corr(), annot=True, linewidths=0.5,linecolor="red", fmt= '.1f',cmap='viridis',ax=ax) plt.show()
we drop the userid column as it doesn’t contribute to prediction.
#preparing data df.drop('User ID',axis = 1, inplace = True) label = {'Male': 0 ,"Female" : 1} df['Gender'].replace(label, inplace= True)
Step-3: Splitting the dataset.
X is set to all the columns except the purchased column and Y is set to the purchased column which we are going to predict.
# set inputs and outputs X = df.drop('Purchased',axis = 1) y = df['Purchased']
With the help of the StandardScaler, data can be standardized so that the transformed feature has a mean of 0 and a standard deviation of 1.
# we import the StandardScaler from sklearn library from sklearn.preprocessing import StandardScaler scaler = StandardScaler() # we fit the data to standard scaler d_scaled = scaler.fit_transform(X) data_scaled1 = pd.DataFrame(d_scaled) data_scaled1.head()
we split our variable into training and testing sets. It is usually a good practice to keep 70% of the data in your train dataset and the rest 30% in your test dataset.
# we import train_test_split and linear regression module from sklearn from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test = train_test_split(d_scaled,y,test_size=0.30,random_state=4)
Step-4: Building the model for Logistic regression and testing it on test data.
We use an inbuilt model from the sklearn library to build the Logistic regression model and fit our training data to it.
# we import LogisticRegression module from sklearn library from sklearn.linear_model import LogisticRegression model = LogisticRegression(C=0.1,max_iter = 500) model.fit(X_train,y_train)
We test the model using the test data that is to predict whether people buy the product
# Predicting Values of test data using the above trained model y_pred = model.predict(X_test)
Step-5: Evaluation Metrics
Evaluation metrics show how well your model is performing. We use Accuracy, precision, recall, and f1-score.
# we import accuracy, precision, recall and f1 score using sklearn library from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score print("Accuracy:", accuracy_score(y_test,y_pred)) print("Precision:", precision_score(y_test, y_pred, )) print("Recall:", recall_score(y_test,y_pred)) print("F1 Score:", f1_score(y_test,y_pred))
Also read: Salary prediction using Lasso and Ridge
Pingback: Meaning of arrow on the iPhone - Study Experts
Pingback: Remove a local branch of the GitHub - Study Experts
Pingback: Logistic Regression in Machine Learning - Study Experts