Introduction
Diabetes is a chronic health condition that can lead to serious complications when it is not identified early. Because early detection improves long‑term outcomes, machine learning methods have become increasingly useful for supporting clinical decision‑making. These models can identify patterns in health data that may not be immediately visible through traditional analysis, making them valuable tools for screening and risk assessment (Xing & Bei, 2020; Nain & Tomar, 2023).
In this project, I applied the K‑Nearest Neighbors (KNN) algorithm to the Pima Indians Diabetes dataset to predict whether an individual is likely to have diabetes. KNN is a distance‑based classification method that assigns a label to a new observation by comparing it to the most similar cases in the training data (Zhang et al., 2018). Because the algorithm relies directly on the training dataset when making predictions, its performance depends heavily on data quality, preprocessing, and the selection of the tuning parameter k (Zhang, 2022).
KNN uses distance calculations—typically Euclidean distance—to determine similarity between observations. This makes preprocessing essential, since variables with larger numeric ranges can disproportionately influence the distance metric if the data is not standardized. The choice of k also plays a major role in model performance. Smaller values may lead to overfitting, while larger values may smooth out meaningful distinctions. KNN has been applied in areas such as finance, forecasting, and medical diagnosis due to its simplicity and its ability to perform well when properly tuned (Alkhatib et al., 2013).
The purpose of this project was to evaluate the predictive performance of KNN on the Pima Indians Diabetes dataset, compare it to baseline models, and examine how preprocessing and tuning influence classification accuracy. The broader goal was to determine whether KNN can serve as an effective and interpretable tool for diabetes risk prediction.
Background
The Pima Indians Diabetes dataset is widely used in machine‑learning research because it provides a structured set of clinical and demographic variables relevant to diabetes risk. Prior studies have shown that models trained on this dataset can achieve strong predictive performance when appropriate preprocessing and tuning are applied (Xing & Bei, 2020). KNN is particularly sensitive to these steps because it does not build a parametric model; instead, it relies on the structure of the data itself. This makes the algorithm straightforward to implement but also dependent on careful preparation of the input features.
Methods
Methods
For this project, I used the k‑Nearest Neighbors (KNN) algorithm as the main classification method. KNN is a simple, distance‑based approach that makes predictions by looking at the most similar cases in the dataset. Instead of building a complex model, it compares a new observation to the existing data and assigns the outcome that is most common among its closest neighbors (Zhang et al., 2018).
To measure how similar two observations are, KNN uses a distance calculation, and Euclidean distance is the most common choice. This gives the algorithm a way to determine which points are “closest” in the feature space. Once the distances are calculated, the algorithm looks at the k nearest points and uses a majority vote to decide the predicted class.
Choosing the right value of k is an important part of how KNN works. A very small k can make the model too sensitive to noise, while a large k can smooth out patterns that might actually matter (Zhang, 2022). Because of this, selecting an appropriate k is necessary for getting good performance.
Exploratory Data Analysis
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
1 0.6395305 0.8617221 -0.03272323 0.55804049 -0.2571922 0.1649875
2 -0.8443348 -1.2014407 -0.51729142 -0.01464349 -0.2571922 -0.8458446
3 1.2330766 2.0079237 -0.67881415 -0.01464349 -0.2571922 -1.3223797
4 -0.8443348 -1.0704463 -0.51729142 -0.58732747 -0.5181880 -0.6292377
5 -1.1411079 0.5014873 -2.61708691 0.55804049 0.1048342 1.5368309
6 0.3427574 -0.1862336 0.12879950 -0.01464349 -0.2571922 -0.9902491
DiabetesPedigreeFunction Age Outcome
1 0.4681869 1.42506672 1
2 -0.3648230 -0.19054773 0
3 0.6040037 -0.10551539 1
4 -0.9201630 -1.04087112 0
5 5.4813370 -0.02048305 1
6 -0.8175458 -0.27558007 0
Pregnancies Glucose BloodPressure SkinThickness
Min. :-1.1411 Min. :-2.5441340 Min. :-3.909269 Min. :-2.114485
1st Qu.:-0.8443 1st Qu.:-0.7183986 1st Qu.:-0.678814 1st Qu.:-0.396433
Median :-0.2508 Median :-0.1534850 Median :-0.032723 Median :-0.014643
Mean : 0.0000 Mean :-0.0009993 Mean :-0.001491 Mean :-0.004328
3rd Qu.: 0.6395 3rd Qu.: 0.6079203 3rd Qu.: 0.613368 3rd Qu.: 0.271699
Max. : 3.9040 Max. : 2.5319016 Max. : 4.005345 Max. : 6.666670
Insulin BMI DiabetesPedigreeFunction
Min. :-1.1917 Min. :-2.058843 Min. :-1.1888
1st Qu.:-0.2867 1st Qu.:-0.715880 1st Qu.:-0.6885
Median :-0.2572 Median :-0.022738 Median :-0.2999
Mean :-0.1252 Mean :-0.000326 Mean : 0.0000
3rd Qu.:-0.2382 3rd Qu.: 0.598201 3rd Qu.: 0.4659
Max. : 5.8131 Max. : 5.002541 Max. : 5.8797
Age Outcome
Min. :-1.0409 Min. :0.000
1st Qu.:-0.7858 1st Qu.:0.000
Median :-0.3606 Median :0.000
Mean : 0.0000 Mean :0.349
3rd Qu.: 0.6598 3rd Qu.:1.000
Max. : 4.0611 Max. :1.000
Modeling Approach
The K‑Nearest Neighbors (KNN) algorithm was used for this project because it performs well on structured medical datasets and has been shown to be effective in similar health‑related prediction tasks (Xing & Bei, 2020). Before training the model, all numeric predictors were standardized so that each variable contributed equally to the distance calculations. This step is especially important in clinical datasets where measurements, such as glucose and BMI, exist on very different scales.
To identify an appropriate value for k, I ran a tuning process using 10‑fold cross‑validation. Each candidate model was evaluated using the Area Under the ROC Curve (AUC), which provides a more reliable measure of discrimination than accuracy alone, especially in medical prediction settings (Nain & Tomar, 2023). I tested values of k from 1 through 25, and the results showed that k = 21 produced the highest average AUC across folds. This aligns with findings that slightly larger neighborhood sizes can help stabilize predictions in noisy or heterogeneous datasets (Zhang et al., 2018).
After selecting k = 21, the final model was trained on the full training set and evaluated on the test set. Performance was assessed using AUC, accuracy, sensitivity, specificity, and the confusion matrix to understand how well the model identified both positive and negative cases. This approach ensured that the model was tuned systematically and grounded in established practices for predictive modeling in healthcare.
Results
The K‑Nearest Neighbors (KNN) model demonstrated strong predictive performance on the test dataset. Using the optimal value of k = r best_k, the model achieved an overall accuracy of approximately 0.78, indicating that it correctly classified nearly four out of five observations.
To further evaluate discriminatory ability, the model’s predicted probabilities were used to compute the Area Under the Receiver Operating Characteristic Curve (AUC). The resulting AUC of approximately 0.86 reflects excellent separation between individuals with and without diabetes.
A confusion matrix was generated to summarize classification performance. The matrix displays the distribution of true positives, true negatives, false positives, and false negatives, providing insight into how the model performs across outcome categories. A visual representation of the confusion matrix is shown in Figure X, with darker shading indicating higher cell frequencies
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 90 22
Yes 11 30
Accuracy : 0.7843
95% CI : (0.7106, 0.8466)
No Information Rate : 0.6601
P-Value [Acc > NIR] : 0.0005461
Kappa : 0.4933
Mcnemar's Test P-Value : 0.0817228
Sensitivity : 0.8911
Specificity : 0.5769
Pos Pred Value : 0.8036
Neg Pred Value : 0.7317
Prevalence : 0.6601
Detection Rate : 0.5882
Detection Prevalence : 0.7320
Balanced Accuracy : 0.7340
'Positive' Class : No
Area under the curve: 0.8601
Discussion
The results show that KNN can be an effective model for predicting diabetes when the data is properly preprocessed and the number of neighbors is carefully tuned. The strong AUC value suggests that the model is able to distinguish well between diabetic and non‑diabetic individuals. These findings are consistent with previous research showing that KNN performs well on medical classification tasks when distance‑based methods are appropriate (Alkhatib et al., 2013). The importance of preprocessing was also clear in this project. Without scaling and imputation, the model’s performance would have been significantly weaker, which aligns with recommendations in the literature (Zhang, 2022).
Limitations
Although the model performed well, there are several limitations to consider. The dataset is relatively small and represents a specific population, which limits generalizability. KNN is also computationally expensive for larger datasets because it must compute distances for every prediction. Additionally, KNN does not provide direct insight into feature importance, which can make interpretation more challenging compared to models like logistic regression or decision trees.
Conclusion
The findings from this project show that the K‑Nearest Neighbors algorithm can be a reliable approach for predicting diabetes risk when the data is carefully prepared and the tuning parameter is selected appropriately. With an accuracy of about 0.78 and an AUC of about 0.86, the model demonstrated strong overall performance and was able to separate diabetic and non‑diabetic cases effectively. Although KNN has limitations—such as sensitivity to scaling and higher computational cost—it remains a practical and interpretable method for classification tasks in healthcare settings. These results suggest that KNN can serve as a useful tool for supporting early diabetes risk assessment, especially when paired with thoughtful preprocessing and validation.
References
Alkhatib, K., Najadat, H., Hmeidi, I., & Shatnawi, M. K. A. (2013). Stock price prediction using k‑nearest neighbor (kNN) algorithm. International Journal of Business, Humanities and Technology, 3(3), 32–44.
Nain, S., & Tomar, S. (2023). Medical image prediction for diagnosis of breast cancer using machine learning. AIP Conference Proceedings, 2853(1), 020140.
Xing, Y., & Bei, Y. (2020). Medical health big data classification based on KNN classification algorithm. IEEE Access.
Zhang, S. (2022). Challenges in KNN classification. IEEE Transactions on Knowledge and Data Engineering, 34(10), 4663–4675.
Zhang, X., Li, M., Zong, X., Zhu, X., & Wang, R. (2018). Efficient kNN classification with different numbers of nearest neighbors. IEEE Transactions on Neural Networks and Learning Systems.