KNN Diabetes Prediction Capstone

Author

Jennifer Becerra

1 Introduction

Diabetes is a chronic health condition that can lead to serious complications when it is not identified early. Because early detection improves long‑term outcomes, machine learning methods have become increasingly useful for supporting clinical decision‑making. These models can identify patterns in health data that may not be immediately visible through traditional analysis, making them valuable tools for screening and risk assessment (Xing & Bei, 2020; Nain & Tomar, 2023).

In this project, I applied the K‑Nearest Neighbors (KNN) algorithm to the Pima Indians Diabetes dataset to predict whether an individual is likely to have diabetes. KNN is a distance‑based classification method that assigns a label to a new observation by comparing it to the most similar cases in the training data (Zhang et al., 2018). Because the algorithm relies directly on the training dataset when making predictions, its performance depends heavily on data quality, preprocessing, and the selection of the tuning parameter k (Zhang, 2022).

KNN uses distance calculations—typically Euclidean distance—to determine similarity between observations. This makes preprocessing essential, since variables with larger numeric ranges can disproportionately influence the distance metric if the data is not standardized. The choice of k also plays a major role in model performance. Smaller values may lead to overfitting, while larger values may smooth out meaningful distinctions. KNN has been applied in areas such as finance, forecasting, and medical diagnosis due to its simplicity and its ability to perform well when properly tuned (Alkhatib et al., 2013).

The purpose of this project was to evaluate the predictive performance of KNN on the Pima Indians Diabetes dataset, compare it to baseline models, and examine how preprocessing and tuning influence classification accuracy. The broader goal was to determine whether KNN can serve as an effective and interpretable tool for diabetes risk prediction.

2 Background

The Pima Indians Diabetes dataset is widely used in machine‑learning research because it provides a structured set of clinical and demographic variables relevant to diabetes risk. Prior studies have shown that models trained on this dataset can achieve strong predictive performance when appropriate preprocessing and tuning are applied (Xing & Bei, 2020). KNN is particularly sensitive to these steps because it does not build a parametric model; instead, it relies on the structure of the data itself. This makes the algorithm straightforward to implement but also dependent on careful preparation of the input features.

3 Methods

Methods

For this project, I used the k‑Nearest Neighbors (KNN) algorithm as the main classification method. KNN is a simple, distance‑based approach that makes predictions by looking at the most similar cases in the dataset. Instead of building a complex model, it compares a new observation to the existing data and assigns the outcome that is most common among its closest neighbors (Zhang et al., 2018).

To measure how similar two observations are, KNN uses a distance calculation, and Euclidean distance is the most common choice. This gives the algorithm a way to determine which points are “closest” in the feature space. Once the distances are calculated, the algorithm looks at the k nearest points and uses a majority vote to decide the predicted class.

Choosing the right value of k is an important part of how KNN works. A very small k can make the model too sensitive to noise, while a large k can smooth out patterns that might actually matter (Zhang, 2022). Because of this, selecting an appropriate k is necessary for getting good performance.

4 Exploratory Data Analysis

  Pregnancies    Glucose BloodPressure SkinThickness    Insulin        BMI
1   0.6395305  0.8617221   -0.03272323    0.55804049 -0.2571922  0.1649875
2  -0.8443348 -1.2014407   -0.51729142   -0.01464349 -0.2571922 -0.8458446
3   1.2330766  2.0079237   -0.67881415   -0.01464349 -0.2571922 -1.3223797
4  -0.8443348 -1.0704463   -0.51729142   -0.58732747 -0.5181880 -0.6292377
5  -1.1411079  0.5014873   -2.61708691    0.55804049  0.1048342  1.5368309
6   0.3427574 -0.1862336    0.12879950   -0.01464349 -0.2571922 -0.9902491
  DiabetesPedigreeFunction         Age Outcome
1                0.4681869  1.42506672       1
2               -0.3648230 -0.19054773       0
3                0.6040037 -0.10551539       1
4               -0.9201630 -1.04087112       0
5                5.4813370 -0.02048305       1
6               -0.8175458 -0.27558007       0

  Pregnancies         Glucose           BloodPressure       SkinThickness      
 Min.   :-1.1411   Min.   :-2.5441340   Min.   :-3.909269   Min.   :-2.114485  
 1st Qu.:-0.8443   1st Qu.:-0.7183986   1st Qu.:-0.678814   1st Qu.:-0.396433  
 Median :-0.2508   Median :-0.1534850   Median :-0.032723   Median :-0.014643  
 Mean   : 0.0000   Mean   :-0.0009993   Mean   :-0.001491   Mean   :-0.004328  
 3rd Qu.: 0.6395   3rd Qu.: 0.6079203   3rd Qu.: 0.613368   3rd Qu.: 0.271699  
 Max.   : 3.9040   Max.   : 2.5319016   Max.   : 4.005345   Max.   : 6.666670  
    Insulin             BMI            DiabetesPedigreeFunction
 Min.   :-1.1917   Min.   :-2.058843   Min.   :-1.1888         
 1st Qu.:-0.2867   1st Qu.:-0.715880   1st Qu.:-0.6885         
 Median :-0.2572   Median :-0.022738   Median :-0.2999         
 Mean   :-0.1252   Mean   :-0.000326   Mean   : 0.0000         
 3rd Qu.:-0.2382   3rd Qu.: 0.598201   3rd Qu.: 0.4659         
 Max.   : 5.8131   Max.   : 5.002541   Max.   : 5.8797         
      Age             Outcome     
 Min.   :-1.0409   Min.   :0.000  
 1st Qu.:-0.7858   1st Qu.:0.000  
 Median :-0.3606   Median :0.000  
 Mean   : 0.0000   Mean   :0.349  
 3rd Qu.: 0.6598   3rd Qu.:1.000  
 Max.   : 4.0611   Max.   :1.000

5 Modeling Approach

The K‑Nearest Neighbors (KNN) algorithm was used for this project because it performs well on structured medical datasets and has been shown to be effective in similar health‑related prediction tasks (Xing & Bei, 2020). Before training the model, all numeric predictors were standardized so that each variable contributed equally to the distance calculations. This step is especially important in clinical datasets where measurements, such as glucose and BMI, exist on very different scales.

To identify an appropriate value for k, I ran a tuning process using 10‑fold cross‑validation. Each candidate model was evaluated using the Area Under the ROC Curve (AUC), which provides a more reliable measure of discrimination than accuracy alone, especially in medical prediction settings (Nain & Tomar, 2023). I tested values of k from 1 through 25, and the results showed that k = 21 produced the highest average AUC across folds. This aligns with findings that slightly larger neighborhood sizes can help stabilize predictions in noisy or heterogeneous datasets (Zhang et al., 2018).

After selecting k = 21, the final model was trained on the full training set and evaluated on the test set. Performance was assessed using AUC, accuracy, sensitivity, specificity, and the confusion matrix to understand how well the model identified both positive and negative cases. This approach ensured that the model was tuned systematically and grounded in established practices for predictive modeling in healthcare.

[1] 21

6 Results

The K‑Nearest Neighbors (KNN) model demonstrated strong predictive performance on the test dataset. Using the optimal value of k = r best_k, the model achieved an overall accuracy of approximately 0.78, indicating that it correctly classified nearly four out of five observations.

To further evaluate discriminatory ability, the model’s predicted probabilities were used to compute the Area Under the Receiver Operating Characteristic Curve (AUC). The resulting AUC of approximately 0.86 reflects excellent separation between individuals with and without diabetes.

A confusion matrix was generated to summarize classification performance. The matrix displays the distribution of true positives, true negatives, false positives, and false negatives, providing insight into how the model performs across outcome categories. A visual representation of the confusion matrix is shown in Figure X, with darker shading indicating higher cell frequencies

Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  90  22
       Yes 11  30
                                          
               Accuracy : 0.7843          
                 95% CI : (0.7106, 0.8466)
    No Information Rate : 0.6601          
    P-Value [Acc > NIR] : 0.0005461       
                                          
                  Kappa : 0.4933          
                                          
 Mcnemar's Test P-Value : 0.0817228       
                                          
            Sensitivity : 0.8911          
            Specificity : 0.5769          
         Pos Pred Value : 0.8036          
         Neg Pred Value : 0.7317          
             Prevalence : 0.6601          
         Detection Rate : 0.5882          
   Detection Prevalence : 0.7320          
      Balanced Accuracy : 0.7340          
                                          
       'Positive' Class : No

Area under the curve: 0.8601

7 Discussion

The results show that KNN can be an effective model for predicting diabetes when the data is properly preprocessed and the number of neighbors is carefully tuned. The strong AUC value suggests that the model is able to distinguish well between diabetic and non‑diabetic individuals. These findings are consistent with previous research showing that KNN performs well on medical classification tasks when distance‑based methods are appropriate (Alkhatib et al., 2013). The importance of preprocessing was also clear in this project. Without scaling and imputation, the model’s performance would have been significantly weaker, which aligns with recommendations in the literature (Zhang, 2022).

8 Limitations

Although the model performed well, there are several limitations to consider. The dataset is relatively small and represents a specific population, which limits generalizability. KNN is also computationally expensive for larger datasets because it must compute distances for every prediction. Additionally, KNN does not provide direct insight into feature importance, which can make interpretation more challenging compared to models like logistic regression or decision trees.

9 Conclusion

The findings from this project show that the K‑Nearest Neighbors algorithm can be a reliable approach for predicting diabetes risk when the data is carefully prepared and the tuning parameter is selected appropriately. With an accuracy of about 0.78 and an AUC of about 0.86, the model demonstrated strong overall performance and was able to separate diabetic and non‑diabetic cases effectively. Although KNN has limitations—such as sensitivity to scaling and higher computational cost—it remains a practical and interpretable method for classification tasks in healthcare settings. These results suggest that KNN can serve as a useful tool for supporting early diabetes risk assessment, especially when paired with thoughtful preprocessing and validation.

10 References

Alkhatib, K., Najadat, H., Hmeidi, I., & Shatnawi, M. K. A. (2013). Stock price prediction using k‑nearest neighbor (kNN) algorithm. International Journal of Business, Humanities and Technology, 3(3), 32–44.

Nain, S., & Tomar, S. (2023). Medical image prediction for diagnosis of breast cancer using machine learning. AIP Conference Proceedings, 2853(1), 020140.

Xing, Y., & Bei, Y. (2020). Medical health big data classification based on KNN classification algorithm. IEEE Access.

Zhang, S. (2022). Challenges in KNN classification. IEEE Transactions on Knowledge and Data Engineering, 34(10), 4663–4675.

Zhang, X., Li, M., Zong, X., Zhu, X., & Wang, R. (2018). Efficient kNN classification with different numbers of nearest neighbors. IEEE Transactions on Neural Networks and Learning Systems.