Segmented Regression On Estimated Probabilities Vs. Raw Binary Outcome
Introduction
Regression analysis is a fundamental tool in statistics used to model the relationship between a dependent variable and one or more independent variables. In the context of binary data, logistic regression is a popular choice for predicting the probability of a binary outcome. However, when dealing with segmented regression, it's essential to consider whether using estimated probabilities or raw binary outcomes is the most approach. In this article, we'll delve into the world of segmented regression and explore the implications of using estimated probabilities versus raw binary outcomes.
What is Segmented Regression?
Segmented regression, also known as piecewise regression, is a type of regression analysis that involves fitting multiple regression lines to a dataset, with each line representing a different segment or interval of the data. This approach is particularly useful when the relationship between the dependent and independent variables changes at specific points or intervals. Segmented regression can help identify these changes and provide a more accurate representation of the underlying relationship.
Logistic Regression with Natural Splines
In the context of binary data, logistic regression is a popular choice for predicting the probability of a binary outcome. When using natural splines, we can model non-linear relationships between the independent variables and the log-odds of the binary outcome. The natural spline approach allows us to capture complex relationships and identify potential changes in the relationship at specific points or intervals.
Estimated Probabilities vs. Raw Binary Outcomes
When performing segmented regression, we have two options: using estimated probabilities or raw binary outcomes. Estimated probabilities are obtained by applying the logistic regression model to the data, resulting in a predicted probability of the binary outcome for each observation. Raw binary outcomes, on the other hand, are the actual binary values (0 or 1) observed in the data.
Using Estimated Probabilities
Using estimated probabilities in segmented regression can be beneficial in several ways:
- Improved model fit: Estimated probabilities can provide a better fit to the data, especially when the relationship between the independent variables and the binary outcome is non-linear.
- Increased accuracy: By using estimated probabilities, we can obtain more accurate predictions of the binary outcome, which can be particularly important in applications where the outcome has significant consequences.
- Easier interpretation: Estimated probabilities can be easier to interpret than raw binary outcomes, as they provide a continuous measure of the probability of the binary outcome.
However, using estimated probabilities also has some limitations:
- Loss of information: By using estimated probabilities, we may lose some information about the raw binary outcomes, which can be important in certain applications.
- Overfitting: If the model is too complex, it may overfit the data, resulting in poor performance on new, unseen data.
Using Raw Binary Outcomes
Using raw binary outcomes in segmented regression can be beneficial in several ways:
- Preservation of information: By using raw binary outcomes, we preserve the information about the actual binary values observed in the data.
- Simpler model: Using raw binary outcomes can result in a simpler model, which can be easier to interpret and less prone to overfitting.
- Better generalizability: Raw binary outcomes can provide better generalizability to new, unseen data, as they are less dependent on the specific model used.
However, using raw binary outcomes also has some limitations:
- Limited model fit: Raw binary outcomes may not provide a good fit to the data, especially when the relationship between the independent variables and the binary outcome is non-linear.
- Difficulty in interpretation: Raw binary outcomes can be difficult to interpret, as they represent a binary value (0 or 1) rather than a continuous probability.
Conclusion
In conclusion, both estimated probabilities and raw binary outcomes have their advantages and disadvantages in segmented regression. The choice between the two ultimately depends on the specific research question, the nature of the data, and the goals of the analysis. By considering these factors and the implications of using estimated probabilities versus raw binary outcomes, researchers can make informed decisions and perform more accurate and reliable segmented regression analyses.
Example Code in R
Here's an example code in R that demonstrates how to perform segmented regression using estimated probabilities and raw binary outcomes:
# Load necessary libraries
library(ggplot2)
library(splines)

set.seed(123)
n <- 100
x <- runif(n, min = 0, max = 10)
y <- rbinom(n, size = 1, prob = 1 / (1 + exp(-x)))
model <- glm(y ~ ns(x, knots = c(2, 5, 8)), family = binomial)
predicted_probabilities <- predict(model, type = "response")
ggplot(data.frame(x, y, predicted_probabilities), aes(x = x, y = predicted_probabilities)) +
geom_point() +
geom_line() +
labs(title = "Estimated Probabilities", x = "x", y = "Predicted Probabilities")
raw_binary_outcomes <- ifelse(predicted_probabilities > 0.5, 1, 0)
ggplot(data.frame(x, y, raw_binary_outcomes), aes(x = x, y = raw_binary_outcomes)) +
geom_point() +
labs(title = "Raw Binary Outcomes", x = "x", y = "Raw Binary Outcomes")
Introduction
In our previous article, we explored the world of segmented regression and discussed the implications of using estimated probabilities versus raw binary outcomes. In this article, we'll answer some frequently asked questions (FAQs) about segmented regression and provide additional insights to help you better understand this topic.
Q: What is the difference between estimated probabilities and raw binary outcomes?
A: Estimated probabilities are obtained by applying a logistic regression model to the data, resulting in a predicted probability of the binary outcome for each observation. Raw binary outcomes, on the other hand, are the actual binary values (0 or 1) observed in the data.
Q: Why would I want to use estimated probabilities instead of raw binary outcomes?
A: Using estimated probabilities can provide a better fit to the data, especially when the relationship between the independent variables and the binary outcome is non-linear. Estimated probabilities can also be easier to interpret than raw binary outcomes, as they provide a continuous measure of the probability of the binary outcome.
Q: What are the limitations of using estimated probabilities?
A: Using estimated probabilities can result in a loss of information about the raw binary outcomes, which can be important in certain applications. Additionally, if the model is too complex, it may overfit the data, resulting in poor performance on new, unseen data.
Q: Why would I want to use raw binary outcomes instead of estimated probabilities?
A: Using raw binary outcomes can preserve the information about the actual binary values observed in the data. Raw binary outcomes can also result in a simpler model, which can be easier to interpret and less prone to overfitting.
Q: What are the limitations of using raw binary outcomes?
A: Using raw binary outcomes may not provide a good fit to the data, especially when the relationship between the independent variables and the binary outcome is non-linear. Raw binary outcomes can also be difficult to interpret, as they represent a binary value (0 or 1) rather than a continuous probability.
Q: How do I choose between estimated probabilities and raw binary outcomes?
A: The choice between estimated probabilities and raw binary outcomes depends on the specific research question, the nature of the data, and the goals of the analysis. Consider the following factors:
- Data type: If the data is binary, raw binary outcomes may be more suitable. If the data is continuous, estimated probabilities may be more suitable.
- Model complexity: If the model is too complex, estimated probabilities may be more prone to overfitting. If the model is too simple, raw binary outcomes may not provide a good fit to the data.
- Interpretability: If you need to interpret the results in a specific way, estimated probabilities may be more suitable. If you need to preserve the information about the raw binary outcomes, raw binary outcomes may be more suitable.
Q: Can I use both estimated probabilities and raw binary outcomes in the same analysis?
A: Yes, you can use both estimated probabilities and raw binary outcomes in the same analysis. This can be done by using a combination of logistic regression and classification models. For example, you can use logistic regression to predict the estimated probabilities and then use a classification model to predict the raw binary outcomes.
Q: How do I evaluate the performance of my segmented regression model?
A: To evaluate the performance of your segmented regression model, you can use various metrics such as:
- Accuracy: This measures the proportion of correctly classified observations.
- Precision: This measures the proportion of true positives among all positive predictions.
- Recall: This measures the proportion of true positives among all actual positive observations.
- F1-score: This measures the harmonic mean of precision and recall.
You can also use visualizations such as:
- ROC curve: This plots the true positive rate against the false positive rate at different thresholds.
- Precision-recall curve: This plots the precision against the recall at different thresholds.
Conclusion
In conclusion, segmented regression is a powerful tool for modeling complex relationships between variables. By understanding the implications of using estimated probabilities versus raw binary outcomes, you can make informed decisions and perform more accurate and reliable segmented regression analyses. Remember to consider the specific research question, the nature of the data, and the goals of the analysis when choosing between estimated probabilities and raw binary outcomes.
Example Code in R
Here's an example code in R that demonstrates how to evaluate the performance of a segmented regression model:
# Load necessary libraries
library(ggplot2)
library(splines)
library(caret)
set.seed(123)
n <- 100
x <- runif(n, min = 0, max = 10)
y <- rbinom(n, size = 1, prob = 1 / (1 + exp(-x)))
model <- glm(y ~ ns(x, knots = c(2, 5, 8)), family = binomial)
predicted_probabilities <- predict(model, type = "response")
confusion_matrix <- table(predicted_probabilities > 0.5, y)
print(confusion_matrix)
roc_curve <- roc(y, predicted_probabilities)
plot(roc_curve, main = "ROC Curve", xlab = "False Positive Rate", ylab = "True Positive Rate")
precision_recall_curve <- pr.curve(y, predicted_probabilities)
plot(precision_recall_curve, main = "Precision-Recall Curve", xlab = "Recall", ylab = "Precision")
This code demonstrates how to evaluate the performance of a segmented regression model using various metrics and visualizations in R. The example code includes a sample dataset, a logistic regression model with natural splines, and plots of the ROC curve and precision-recall curve.