Logistic Regression
Introduction
Logistic regression is a statistical method used to model the relationship between a binary dependent variable and one or more independent variables. In this lecture, we will learn how to perform logistic regression in R, including model building, evaluation, and interpretation.
Key Concepts
1. What is Logistic Regression?
Logistic regression models the probability that a given input point belongs to a certain class. The logistic function (or sigmoid function) is used to map predicted values to probabilities:
[ P(Y = 1|X) = ]
where:
( P(Y = 1|X) ) is the probability that the dependent variable ( Y ) equals 1 given the independent variables ( X ).
( _0, _1, …, _n ) are the model coefficients.
2. Assumptions of Logistic Regression
For logistic regression to provide reliable results, the following assumptions should be met:
The dependent variable is binary.
There is a linear relationship between the logit of the dependent variable and the independent variables.
Observations are independent of each other.
There is little to no multicollinearity among the independent variables.
Performing Logistic Regression in R
1. Building the Model
You can build a logistic regression model using the glm()
function in R with the family parameter set to binomial
.
# Creating a sample dataset
set.seed(123)
<- data.frame(
data
x = rnorm(100),
y = rbinom(100, 1, 0.5)
)
# Building the logistic regression model
<- glm(y ~ x, data = data, family = binomial)
model
summary(model)
2. Evaluating the Model
You can evaluate the model’s performance using various metrics such as accuracy, confusion matrix, and ROC curve.
# Making predictions
<- predict(model, type = "response")
predictions
<- ifelse(predictions > 0.5, 1, 0)
predicted_classes
# Confusion Matrix
<- table(predicted_classes, data$y)
confusion_matrix
print(confusion_matrix)
# Calculating accuracy
<- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
print(paste("Accuracy:", accuracy))
3. Plotting the ROC Curve
You can plot the ROC curve using the pROC
package.
# Installing and loading the pROC package
install.packages("pROC")
library(pROC)
# Plotting the ROC curve
<- roc(data$y, predictions)
roc_curve
plot(roc_curve, main = "ROC Curve")
Example: Comprehensive Logistic Regression Analysis
Here’s a comprehensive example of performing logistic regression analysis in R.
# Creating a sample dataset
set.seed(123)
<- data.frame(
data
x1 = rnorm(100),
x2 = rnorm(100),
y = rbinom(100, 1, 0.5)
)
# Splitting the data into training and testing sets
library(caret)
<- createDataPartition(data$y, p = 0.7, list = FALSE)
trainIndex
<- data[trainIndex, ]
train_data
<- data[-trainIndex, ]
test_data
# Building the logistic regression model
<- glm(y ~ x1 + x2, data = train_data, family = binomial)
model
summary(model)
# Making predictions on the test set
<- predict(model, newdata = test_data, type = "response")
predictions
<- ifelse(predictions > 0.5, 1, 0)
predicted_classes
# Evaluating the model
<- table(predicted_classes, test_data$y)
confusion_matrix
<- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
print(paste("Accuracy:", accuracy))
# Plotting the ROC curve
library(pROC)
<- roc(test_data$y, predictions)
roc_curve
plot(roc_curve, main = "ROC Curve")
Summary
In this lecture, we covered how to perform logistic regression in R, including building the model, evaluating its performance, making predictions, and visualizing the results. Logistic regression is a powerful tool for modeling binary outcomes and making predictions based on those models.
Further Reading
For more detailed information, consider exploring the following resources:
Call to Action
If you found this lecture helpful, make sure to check out the other lectures in the ML R series. Happy coding!