Our Services

Get 15% Discount on your First Order

[rank_math_breadcrumb]

Saudi Electronic University Homework Help

466 Data mining and Data warehousing Project

Description

Project description

This is a group project that involves extracting knowledge and discovering patterns from data by applying data

mining techniques using Weka. You are required select a dataset, utilize Weka tools to for data pre-processing,

apply relevant data mining techniques, and evaluate the performance of the models.

Initially you are required to set up a plan, following the instructions below:

Initial instructions:

 Choose your group members on average 5 members.

 Choose the dataset.

 Download Weka software.

Note, late submissions are NOT accepted. In case of plagiarism from your peers or online sources, zero marks will

be given.

Project submissions

1- Report 12 marks

In the report, include a description of the project for all the stages. Your report should be maximum of 5 pages.

You can add any screenshots in the appendix while referring to the figure within the text. The rationale of your

decsions should be clarified and clear.

The content of the report should be as following:

 Introduction: describe the objective and the dataset (attribute, size, source) of the project. Also, cite the

source of your dataset.

 Data preprocessing: Detail steps taken to clean and prepare the data for analysis (missing values,

normalization, transformation, etc.).

 Data minning technique: Describe the data minning technique and algorithms used (decision trees, k-

means, Apriori, etc.)

 Model training and testing: Outline the steps conducted in weka used to create your model in Weka.

Specify the settings or parameter changes. Describe how the selected models were trained on the

dataset. This includes splitting the data into training and testing test.

 Results: Discuss the results of your model and compare it to another model.

 Conclusion: State the main findings, limitations, improvements or any further reseach.

Note, if you used any sources, don’t forget to cite them where relevant or it will be considered plagirism.

2- Presentation and submissions 8 Marks

Prepare a presentation of your project with a demonstration of your system. Make sure that all your files are

correctly submitted (data files, model, report and slides, etc.) aong with a README file that clarfies the file

‫كلـ ـية الجبي ـ ـ ـ ـ ـ ـ ـ ـ ـ ـ ـل الصناع ـ ـ ـ ـ ـ ـ ـ ـ ـ ـ ـ ـ ـ ــية‬

Jubail Industrial College
‫وكالة شؤون التعليم والتدريب‬
Education and Training Affairs Deputyship

CS 466 Data mining and Data warehousing Project
Dr. Enas AlRaddadi
Semester: 461
Course Code

CS466

Course Title

Data mining and data warehousing

Given

Week 2

Submission

Week 13

Students are requested to comply with all JIC examination rules and regulations strictly.

Max Marks

Report

Presentation

SIS

Form# F376-1_rev.1

1’st Student
Name:
ID:
Section#:

2’nd Student
Name:
ID:
Section#:

3’rd Student
Name:
ID:
Section#:

Page 1 of 2

‫كلـ ـية الجبي ـ ـ ـ ـ ـ ـ ـ ـ ـ ـ ـل الصناع ـ ـ ـ ـ ـ ـ ـ ـ ـ ـ ـ ـ ـ ــية‬

Jubail Industrial College
‫وكالة شؤون التعليم والتدريب‬
Education and Training Affairs Deputyship

Project description
This is a group project that involves extracting knowledge and discovering patterns from data by applying data
mining techniques using Weka. You are required select a dataset, utilize Weka tools to for data pre-processing,
apply relevant data mining techniques, and evaluate the performance of the models.
Initially you are required to set up a plan, following the instructions below:
Initial instructions:




Choose your group members on average 5 members.
Choose the dataset.
Download Weka software.

Note, late submissions are NOT accepted. In case of plagiarism from your peers or online sources, zero marks will
be given.

Project submissions
1- Report

12 marks

In the report, include a description of the project for all the stages. Your report should be maximum of 5 pages.
You can add any screenshots in the appendix while referring to the figure within the text. The rationale of your
decsions should be clarified and clear.
The content of the report should be as following:


Introduction: describe the objective and the dataset (attribute, size, source) of the project. Also, cite the
source of your dataset.
 Data preprocessing: Detail steps taken to clean and prepare the data for analysis (missing values,
normalization, transformation, etc.).
 Data minning technique: Describe the data minning technique and algorithms used (decision trees, kmeans, Apriori, etc.)
 Model training and testing: Outline the steps conducted in weka used to create your model in Weka.
Specify the settings or parameter changes. Describe how the selected models were trained on the
dataset. This includes splitting the data into training and testing test.
 Results: Discuss the results of your model and compare it to another model.
 Conclusion: State the main findings, limitations, improvements or any further reseach.
Note, if you used any sources, don’t forget to cite them where relevant or it will be considered plagirism.

2- Presentation and submissions

8 Marks

Prepare a presentation of your project with a demonstration of your system. Make sure that all your files are
correctly submitted (data files, model, report and slides, etc.) aong with a README file that clarfies the file
structure of your project.

Form# F376-1_rev.1

Page 2 of 2

Received December 5, 2019, accepted December 16, 2019, date of publication December 30, 2019, date of current version February 4, 2020.
Digital Object Identifier 10.1109/ACCESS.2019.2963053

A Machine Learning Methodology for
Diagnosing Chronic Kidney Disease
JIONGMING QIN 1 , LIN CHEN 2 , YUHUA LIU
CHANGHAO FENG 1 , AND BIN CHEN 1

1 , CHUANJUN LIU

1 Chongqing Key Laboratory of Non-linear Circuit and Intelligent Information Processing, College of Electronic and Information Engineering, Southwest

University, Chongqing 400715, China
2 Department of Electronics, Graduate School of Information Science and Electrical Engineering, Kyushu University, Fukuoka 819-0395, Japan

Corresponding author: Bin Chen ([email protected])
This work was supported in part by the National Nature Science Foundation of China under Grant 61801400 and Grant 61703348, and in
part by the Central Universities under Grant XDJK2018C021, Grant JSPS KAKENHI, and Grant JP18F18392.

ABSTRACT Chronic kidney disease (CKD) is a global health problem with high morbidity and mortality
rate, and it induces other diseases. Since there are no obvious symptoms during the early stages of CKD,
patients often fail to notice the disease. Early detection of CKD enables patients to receive timely treatment
to ameliorate the progression of this disease. Machine learning models can effectively aid clinicians achieve
this goal due to their fast and accurate recognition performance. In this study, we propose a machine
learning methodology for diagnosing CKD. The CKD data set was obtained from the University of California
Irvine (UCI) machine learning repository, which has a large number of missing values. KNN imputation was
used to fill in the missing values, which selects several complete samples with the most similar measurements
to process the missing data for each incomplete sample. Missing values are usually seen in real-life medical
situations because patients may miss some measurements for various reasons. After effectively filling out
the incomplete data set, six machine learning algorithms (logistic regression, random forest, support vector
machine, k-nearest neighbor, naive Bayes classifier and feed forward neural network) were used to establish
models. Among these machine learning models, random forest achieved the best performance with 99.75%
diagnosis accuracy. By analyzing the misjudgments generated by the established models, we proposed an
integrated model that combines logistic regression and random forest by using perceptron, which could
achieve an average accuracy of 99.83% after ten times of simulation. Hence, we speculated that this
methodology could be applicable to more complicated clinical data for disease diagnosis.
INDEX TERMS Chronic kidney disease, machine learning, KNN imputation, integrated model.
I. INTRODUCTION

Chronic kidney disease (CKD) is a global public health problem affecting approximately 10% of the world’s population
[1], [2]. The percentage of prevalence of CKD in China is
10.8% [3], and the range of prevalence is 10%-15% in the
United States [4]. According to another study, this percentage
has reached 14.7% in the Mexican adult general population
[5]. This disease is characterised by a slow deterioration in
renal function, which eventually causes a complete loss of
renal function. CKD does not show obvious symptoms in its
early stages. Therefore, the disease may not be detected until
the kidney loses about 25% of its function [6]. In addition,
CKD has high morbidity and mortality, with a global impact
The associate editor coordinating the review of this manuscript and
approving it for publication was Hao Ji.
VOLUME 8, 2020

on the human body [7]. It can induce the occurrence of
cardiovascular disease [8], [9]. CKD is a progressive and
irreversible pathologic syndrome [10]. Hence, the prediction
and diagnosis of CKD in its early stages is quite essential,
it may be able to enable patients to receive timely treatment
to ameliorate the progression of the disease.
Machine learning refers to a computer program, which
calculates and deduces the information related to the task
and obtains the characteristics of the corresponding pattern
[11]. This technology can achieve accurate and economical
diagnoses of diseases; hence, it might be a promising method
for diagnosing CKD. It has become a new kind of medical
tool with the development of information technology [12]
and has a broad application prospect because of the rapid
development of electronic health record [13]. In the medical
field, machine learning has already been used to detect human

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see

20991

J. Qin et al.: Machine Learning Methodology for Diagnosing CKD

body status [14], analyze the relevant factors of the disease
[15] and diagnose various diseases. For example, the models
built by machine learning algorithms were used to diagnose
heart disease [16], [17], diabetes and retinopathy [18], [19],
acute kidney injury [20], [21], cancer [22] and other diseases
[23], [24]. In these models, algorithms based on regression,
tree, probability, decision surface and neural network were
often effective. In the field of CKD diagnosis, Hodneland
et al. utilized image registration to detect renal morphologic
changes [25]. Vasquez-Morales et al. established a classifier
based on neural network using large-scale CKD data, and
the accuracy of the model on their test data was 95% [26].
In addition, most of the previous studies utilized the CKD
data set that was obtained from the UCI machine learning
repository. Chen et al. used k-nearest neighbor (KNN), support vector machine (SVM) and soft independent modelling
of class analogy to diagnose CKD, KNN and SVM achieved
the highest accuracy of 99.7% [27]. In addition, they used
fuzzy rule-building expert system, fuzzy optimal associative
memory and partial least squares discriminant analysis to
diagnose CKD, and the range of accuracy in those models was
95.5%-99.6% [1]. Their studies have achieved good results
in the diagnosis of CKD. In the above models, the mean
imputation is used to fill in the missing values and it depends
on the diagnostic categories of the samples. As a result,
their method could not be used when the diagnostic results
of the samples are unknown. In reality, patients might miss
some measurements for various reasons before diagnosing.
In addition, for missing values in categorical variables, data
obtained using mean imputation might have a large deviation
from the actual values. For example, for variables with only
two categories, we set the categories to 0 and 1, but the
mean of the variables might be between 0 and 1. Polat et al.
developed an SVM based on feature selection technology,
the proposed models reduced the computational cost through
feature selection, and the range of accuracy in those models
was from 97.75%-98.5% [6]. J. Aljaaf et al. used novel multiple imputation to fill in the missing values, and then MLP
neural network (MLP) achieved an accuracy of 98.1% [28].
Subas et al. used MLP, SVM, KNN, C4.5 decision tree and
random forest (RF) to diagnose CKD, and the RF achieved an
accuracy of 100% [2]. In the models established by Boukenze
et al., MLP achieved the highest accuracy of 99.75% [29].
The studies of [2], [29] focus mainly on the establishment
of models and achieve an ideal result. However, a complete
process of filling in the missing values is not described in
detail, and no feature selection technology is used to select
predictors as well. Almansour et al. used SVM and neural
network to diagnose CKD, and the accuracy of the models
was 97.75% and 99.75%, respectively [30]. In the models
established by Gunarathne et al., zero was used to fill out the
missing values and decision forest achieved the best performance with the accuracy was 99.1% [31].
To summarize the previous CKD diagnostic models,
we find that most of them suffering from either the method
used to impute missing values has a limited application
20992

range or relatively low accuracy. Therefore, in this work,
we propose a methodology to extend application range of the
CKD diagnostic models. At the same time, the accuracy of the
model is further improved. The contributions of the proposed
work are as follows.
1) we used KNN imputation to fill in the missing values in
the data set, which could be applied to the data set with the
diagnostic categories are unknown.
2) Logistic regression (LOG), RF, SVM, KNN, naive
Bayes classifier (NB) and feed forward neural network (FNN)
were used to establish CKD diagnostic models on the complete CKD data sets. The models with better performance
were extracted for misjudgment analysis.
3) An integrated model that combines LOG and RF by
using perceptron was established and it improved the performance of the component models in CKD diagnosis after the
missing values were filled by KNN imputation.
KNN imputation is used to fill in the missing values. To our
knowledge, this is the first time that KNN imputation has
been used for the diagnosis of CKD. In addition, building an
integrated model is also a good way to improve the performance of separate individual models. The proposed methodology might effectively deal with the scene where patients
are missing certain measurements before being diagnosed.
In addition, the resulting integrated model shows a higher
accuracy. Therefore, it is speculated that this methodology
might be applicable to the clinical data in the actual medical
diagnosis.
The rest of the paper is organized as follows. In Section II,
we describe the preliminaries. The establishments of the
individual model and the integrated model are described in
Section III. In Section IV, we evaluate and discuss the performance of the integrated model. In Section V, we summarize
the work and its contributions, including future works.
II. PRELIMINARIES

In this section, we describe the preliminaries before establishing the models, including the description of the data set
and the operating environment, the imputation of the missing
values and the extraction of the feature vector.
A. DATA DESCRIPTION AND OPERATING ENVIRONMENT

The CKD data set used in this study was obtained from the
UCI machine learning repository [32], which was collected
from hospital and donated by Soundarapandian et al. on 3rd
July, 2015. The data set contains 400 samples. In this CKD
data set, each sample has 24 predictive variables or features
(11 numerical variables and 13 categorical (nominal) variables) and a categorical response variable (class). Each class
has two values, namely, ckd (sample with CKD) and notckd
(sample without CKD). In the 400 samples, 250 samples
belong to the category of ckd, whereas 150 samples belong to
the category of notckd. It is worth mentioning that there is a
large number of missing values in the data. The details of each
variable are listed in Table 1. All of the algorithms were conducted in R (version 3.5.2), and the packages used included
VOLUME 8, 2020

J. Qin et al.: Machine Learning Methodology for Diagnosing CKD

TABLE 1. Details of each variable in the original CKD data set.

Hmisc(4.2-0), DMwR(0.4.1), leaps(3.0), randomForest
(4.6-14), caret(6.0-81), e1071(1.7-0.1), class(7.3-14) and
neuralnet(1.44.2).
B. DATA PROCESSING

Each categorical (nominal) variable was coded to facilitate
the processing in a computer. For the values of rbc and pc,
normal and abnormal were coded as 1 and 0, respectively. For
the values of pcc and ba, present and notpresent were coded
as 1 and 0, respectively. For the values of htn, dm, cad, pe and
ane, yes and no were coded as 1 and 0, respectively. For the
value of appet, good and poor were coded as 1 and 0, respectively. Although the original data description defines three
variables sg, al and su as categorical types, the values of these
three variables are still numeric based, thus these variables
were treated as numeric variables. All the categorical variables were transformed into factors. Each sample was given
an independent number that ranged from 1 to 400. There is
a large number of missing values in the data set, and the
number of complete instances is 158. In general, the patients
might miss some measurements for various reasons before
making a diagnosis. Thus, missing values will appear in the
data when the diagnostic categories of samples are unknown,
and a corresponding imputation method is needed.
After encoding the categorical variables, the missing values in the original CKD data set were processed and filled at
first. KNN imputation was used in this study, and it selects the
K complete samples with the shortest Euclidean distance for
each sample with missing values. For the numerical variables,
the missing values are filled using the median of the corresponding variable in K complete samples, and for the category variables, the missing values are filled using the category
that has the highest frequency in the corresponding variable
in K complete samples. For physiological measurements,
VOLUME 8, 2020

people with similar physical conditions should have similar
physiological measurements, which is the reason for using
the method based on a KNN to fill in the missing values. For
example, the physiological measurements should be stable
within a certain range for healthy individuals. For diseased
individuals, the physiological measurements of the person
with a similar degree of the same disease should be similar.
In particular, the differences in physiological measurements
data should not be large for people with similar situations.
This method should also be adapted to the diagnostic data
of other diseases, as it has been applied in the area of
hyperuricemia [33].
When the median of corresponding variables in K complete samples are selected, K is preferably taken as an odd
number because in this case the middle number is naturally
the median when the values of the numeric variables in the K
complete samples are sorted by numerical value. The selection of K should neither be too large nor too small. An excessively large K value may ignore the inconspicuous mode,
which might be important. Conversely, an excessively small
K value causes noise and the abnormal data affects the filling
of the missing values exceedingly. Therefore, the values of K
in this work were chosen as 3, 5, 7, 9 and 11. As a result, five
complete CKD data sets were generated. In addition, we also
proved the effectiveness of KNN imputation by comparing it
with two other methods in section III. One is to use random
values to fill in the missing values, the other is to use mean
and mode of the corresponding variables to fill in missing
values of continuous and categorical variables, respectively.
C. EXTRACTING FEATURE VECTORS OR PREDICTORS

Extracting feature vectors or predictors could remove variables that are neither useful for prediction nor related to
response variables and thus prevent these unrelated variables
20993

J. Qin et al.: Machine Learning Methodology for Diagnosing CKD

FIGURE 1. The results of important variables extraction by using optimal subset regression at K = 9.

FIGURE 2. The results of important variables extraction by RF at K = 9.

from interfering with the model construction, which causes
the models to make an accurate prediction [34]. Herein,
we used optimal subset regression and RF to extract the
variables that are most meaningful to the prediction. Optimal subset regression detects the model performance of all
possible combinations of predictors and selects the best combination of variables. RF detects the contribution of each
variable to the reduction in the Gini index. The larger the Gini
index, the higher the uncertainty in classifying the samples.
Therefore, the variables with contribution of 0 are treated as
redundant variables. The step of feature extraction was run on
each complete data set. Images obtained on one complete data
set are shown in Figs. 1 and 2, and this data set was obtained
by KNN imputation when K equaling to 9.
Fig. 1 represents the optimal combination of variables in
the case of selecting one to all variables when the optimal
20994

subset regression was used. The vertical axis represents variables. The horizontal axis is the adjusted r-squared which
represents the degree to which the combination of variables
explains the response variable. To make it easy to distinguish each combination of variables, we used four colors
(red, green, blue and black) to mark the selected variables.
The combinations are ranked from left to right by the degree
of explanations to the response variable and the right-most
combination has the strongest interception to the response
variable. Since the space is limited, the values represented by
the horizontal axis in Fig. 1 are retained in two decimal places.
The right-most combinations of variables in the images which
were obtained by the optimal subset regression on each complete data set are shown in Table 2. For the complete data sets
obtained by the KNN imputation, we selected the intersection
of the optimal combinations on all complete data sets as
the extracted combination of variables to obtain a uniform
combination. In Table 2, for the complete data sets obtained
by the KNN imputation, we used the intersection (bp, sg, al,
bu, hemo, htn, dm, appet) to establish the models. For the
complete data set obtained by the mean and mode imputation,
the combination of the last row in Table 2 was used. For the
complete data sets obtained by random imputation, we used
the corresponding optimal combination obtained from each
complete data set.
The result of feature extraction of RF is represented
in Fig. 2, the vertical axis represents the variables, and the
horizontal axis represents the reduced Gini index. The larger
the reduced Gini index, the stronger the predictability of the
variable to the response variable. When the RF was used to
remove the variables with the contribution of zero, no matter
which method was used to fill in the missing values, the variables with contribution of zero were the same, including pcc,
ba and cad. Therefore, when the RF was used to extract the
variables, all variables were selected expect pcc, ba and cad.
VOLUME 8, 2020

J. Qin et al.: Machine Learning Methodology for Diagnosing CKD

TABLE 2. The top combinations of variables on the complete data sets.

TABLE 3. The template of confusion matrix in this study.

D. PERFORMANCE INDICATORS

In this study, ckd was set to be positive and notckd was set
to be negative. The confusion matrix was used to show the
specific results and evaluate the performance of the machine
learning models. The template of the confusion matrix is
shown in Table 3.
True positive (TP) indicates the ckd samples were correctly
diagnosed. False negative (FN) indicates the ckd samples
were incorrectly diagnosed. False positive (FP) indicates
the notckd samples were incorrectly diagnosed. True negative (TN) indicates the notckd samples were correctly diagnosed. Accuracy, sensitivity, specificity, precision, recall and
F1 score were used to evaluate the performance of the model.
They are calculated using the following equations:
TP + TN
accuracy =
(1)
TP + FN + FP + TN
TP
recall = sensitivity =
(2)
TP + FN
TN
(3)
specificity =
FP + TN
TP
precision =
(4)
TP + FP
precision × recall
F1 score = 2 ×
(5)
precision + recall
III. PROPOSED MODEL

In this section, the classifiers were first established by
different machine learning algorithms to diagnose the data
samples. Among these models, those with better performance were selected as potential components. By analyzing
their misjudgments, the component models were determined.
An integrated model was then established to achieve higher
performance.
A. ESTABLISHING AND EVALUATING INDIVIDUAL MODELS

The following machine learning models have been obtained
by using the corresponding subset of features or predictors on
the complete CKD data sets for diagnosing CKD.
1) Regression-based model: LOG
VOLUME 8, 2020

2) Tree-based model: RF
3) Decision plane-based model: SVM
4) Distance-based model: KNN
5) Probability-based model: NB
6) Neural network: FNN
Generally, in disease diagnosis, diagnostic samples are
distributed in a multidimensional space. This space comprises
predictors that are used for data classification (ckd or notckd).
Samples of data in the space are clustered in different regions
due to their different categories. Therefore, there is a boundary between the two categories, and the distances between
samples in the same category are smaller. According to the
effectiveness of classification, we choose the aforementioned
methods for disease diagnosis. LOG is based on linear regression, and it obtains the weight of each predictor and a bias.
If the sum of the effects of all predictors exceeds a threshold,
the category of the sample will be classified as ckd or notckd.
RF generates a large number of decision trees by randomly
sampling training samples and predictors. Each decision tree
is trained to find a boundary that maximises the difference
between ckd and notckd. The final decision is determined
by the predictions of all trees in the disease diagnosis. SVM
divides different kinds of samples by establishing a decision surface in a multidimensional space that comprises the
predictors of the samples. KNN finds the nearest training
samples by calculating the distances between the test sample
and the training samples and then determines the diagnostic
category by voting. Naive Bayes classifier calculates the
conditional probabilities of the sample under the interval
by the number of ckd and notckd samples in each different
measurement interval. FNN can analyse non-linear relationships in the data sets due to its complex structure, and the
sigmoid activation function was used in the hidden layer and
the output layer.
To evaluate model performance comprehensively, in the
case of retaining the sample distribution in the original data,
a complete data set was divided into four subsets evenly.
For all of the above models, each subset was utilized once
for testing, and other subsets were utilized for training,
the overall result was taken as the final performance. With
the exception of RF, the rest of the models were established
using the selected variables by feature extraction. RF does
not require prior feature extraction, because predictors are
selected randomly when each decision tree is established.
In addition, when using KNN and FNN, all the categorical
20995

J. Qin et al.: Machine Learning Methodology for Diagnosing CKD

TABLE 4. The accuracy of two types of RF after the KNN imputation was
run.

variables were converted into numeric types: categories 0 and
1 were converted to values 0 and 1, respectively, and the
complete data sets were then normalised with the mean that is
equal to 0 and the standard deviation that is equal to 1. Details
of all are as follows:
1) The output of LOG was the probability that the sample
belongs to notckd, and the threshold was set to 0.5.
2) RF was established using all variables. Two strategies
were used to determine the number of decision trees generated. One is to use the default 500 trees and the other is
to use the number of trees corresponding to the minimum
error in the training stage. The RF was established using both
strategies and evaluated on the data sets obtained by KNN
imputation. The same random number seed 1234 was used to
divide data and establish model, and the accuracy is shown
in Table 4. It can be seen that the default number of trees is
a better choice, therefore we selected the default 500 trees to
establish RF.
3) The models of SVM were generated by using the RBF
kernel function, and the function is described as follow:
2

K hx1 , x2 i = e−γ kx1 −x2 k

(6)

where γ was set to [0.1, 0.5, 1, 2, 3, 4]. Parameter C
represents the weight of misjudgment loss, and it was set
to [0.5, 1, 2, 3]. In each calculation of the model training,
the algorithm selects the best combination of parameters to
establish the model by grid search.
4) For the NB, the value of Laplace was equal to 1.
5) For the KNN, due to the nearest Euclidean distance
with the detected sample, when the number of samples that
are selected in training data set is an even number, the algorithm randomly selects a category as the output result of
the detected sample in the situation wherein the number of
selected samples belonging to ckd and notckd are the same.
To avoid this in the work, the nearest neighbor parameter was
set to [1, 3, 5, . . . , 19]. In each calculation of model training,
the algorithm selected the best parameter to establish the
model by grid search.
6) For the FNN, the network had a hidden layer. Presently,
there is no clear theory in determining the best number of
hidden layer nodes in a neural network. A method proposed in
the previous study that was used to evaluate the performance
of neural networks by increasing the number of hidden layer
nodes one by one [35] was used in this study. The number of
hidden layer nodes was increased one by one from 1 to 30.
Then, the best result was selected.
To ensure the repeatability and comparability of the results,
in the division of data, the establishment of RF with FNN,
and the selection of the best parameters of SVM with KNN,
the same seed of 1234 was used. For the random imputation,
20996

TABLE 5. The accuracy (%) of the basic models after the optimal subset
regression.

TABLE 6. The accuracy (%) of the basic models after the features
extraction of RF was run.

the step of feature extraction was run on the complete data set
obtained. Then, the models were established and evaluated
by using the extracted features. Because of the randomness
of the random imputation, the whole process was repeated
five times to get the average result. For the KNN imputation
and the mean and mode imputation, due to the certainty
of data, the evaluation of models was executed once. After
the feature extraction methods of optimal subset regression
and RF were run, the accuracy of the basic models on
the complete data sets are shown in Table 5 and Table 6,
respectively.
It can been seen from Tables 5 and 6 that the optimal subset
regression is more suitable for LOG and SVM when the
KNN imputation is used, and the feature extraction method
of RF is more suitable for FNN and KNN. When the KNN
imputation is used, the accuracy of LOG and SVM is significantly improved (Table 5). In Table 6, the accuracy of
LOG and SVM is relatively low, which might be due to the
fact that there are too many redundant variables compared
to the optimal subset regression. The accuracy of FNN is
slightly improved and RF shows better performance when
the KNN imputation is used both in Tables 5 and 6. For the
NB and the KNN, the performance of the models when using
KNN imputation is not very ideal compared to using random
imputation or mean and mode imputation in Tables 5 and 6.
The above result also proves the validity of the KNN imputation, since KNN imputation does improve the accuracy of
some models, such as LOG, RF and SVM (Table 5). From
Tables 5 and 6, LOG and SVM with the use of optimal
subset regression, KNN and FNN with the use of the feature
extraction of random forest and RF have better performance.
Therefore, they are selected as the potential component
models.
VOLUME 8, 2020

J. Qin et al.: Machine Learning Methodology for Diagnosing CKD

TABLE 7. The numbers of misjudgments of the extracted models.

TABLE 8. The time spent by RF, LOG, SVM and KNN on the complete data.

From Tables 5 and 6, it can be seen that there is no significant
difference between LOG, SVM and KNN. In the case where
the performance of the models is similar, the models are
evaluated by the complexity of the algorithm, the running
time and the computational resources consumed. LOG, RF,
SVM and KNN were run five times on each complete data,
and the average time taken are summarized in Table 8. It can
be seen that the SVM and KNN take more time than the
LOG and RF. In addition, SVM and KNN are also effected
by their respective model parameters, so the parameters need
to be adjusted before the models are established, which means
more manual intervention is needed. For the LOG, there was
no additional parameter that need to be adjusted. For the
RF, the default parameters of the model were used. Hence,
a combination of the LOG and the RF was selected to generate
the final integrated model.
B. MISJUDGMENT ANALYSIS AND SELECTING
COMPONENT MODELS

After evaluating the above models, the potential component
models were extracted for misjudgment analysis to determine
which would be used as the components. The misjudgment
analysis here refers to find out and compare the samples
misjudged by different models, and then determine which
model is suitable to establish the final integrated model.
The misjudgment analysis was performed on the extracted
models. The prerequisite for generating an integrated model
is that the misjudged samples from each component model
are different. If each component model misjudges the same
samples, the generated integrated model would not make a
correct judgement for the samples either. When the data were
read, each sample was given a unique number ranging from
1 to 400. The numbers of misjudgments for the extracted
models on each complete data are shown in Table 7, and the
black part indicates that the samples were misjudged by other
models except FNN.
In Table 7, for the FNN, it can be seen that most of
the misjudgments are simultaneously misjudged by other
models. In addition, the performance of FNN is affected
by the number of nodes in the hidden layer. It is not easy
to establish a unified model for different data. Therefore,
the FNN was excluded firstly. For the best model (RF), when
K equaling to 7, only one misjudgment is simultaneously
misjudged by the LOG. In other cases, all the samples that
are misjudged by RF can be correctly judged by the rest of
the models. Hence, the combinations of the RF with the rest
of the models could be used to establish an integrated model.
Next, we investigate which specific model combination could
generate the best integrated model for diagnosing CKD.
VOLUME 8, 2020

C. ESTABLISHING THE INTEGRATED MODEL

LOG and RF were selected as underlying components to
generate the integrated model to improve the performance
of judging. The probabilities that each sample was judged as
notckd in LOG and RF were used as the outputs of underlying
components. These two probabilities of each sample were
obtained and could be expressed in a two-dimensional plane.
In the complete CKD data sets, the probability distributions
of the samples in a two-dimensional plane are similar. Therefore, the probability distribution of samples when K equaling
to 11 is shown in Fig. 3.
It can be seen from Fig. 3 that the samples have different
aggregation regions in the two-dimensional plane due to the
different categories (ckd or notckd). In general, samples with
ckd are concentrated in the lower left part, while the notckd
samples are distributed in the top right part. Due to the
fact that the results in the two models are different, some
samples are located at the top left and lower right, and one
of the two models makes the misjudgments. Perceptron can
be used to separate samples of two categories by plotting a
decision line in the two-dimensional plane of the probability
distribution. Ciaburro and Venkateswaran defined perceptron
as the basic building block of a neural network, and it can
be understood as anything that requires multiple inputs and
produces an output [36]. The perceptron used in this study is
shown in Fig. 4.
In Fig. 4, prob1 and prob2 are the probabilities that a sample was judged as notckd by LOG and RF, respectively. w0 ,
w1 and w2 are the weights of input signals. w0 corresponds
to 1, w1 corresponds to prob1 and w2 corresponds to prob2 ,
respectively. y is calculated according to (7):
y = w0 + w1 × prob1 + w2 × prob2 .

(7)
20997

J. Qin et al.: Machine Learning Methodology for Diagnosing CKD

the model, the following equation is valid:
Y × g(X ) = Y (WX + b) > 0.

(9)

For all misjudgments, the value of (9) is less than zero, and
the large the absolute value, the more serious the model misjudges the samples. Hence, for a misjudged sample (Xi , Yi ),
the loss of the perceptron can be expressed as (10):
L = −Yi (WXi + b).

(10)

The perceptron is trained by the gradient descent method to
adjust the weight and bias. The partial derivative of the weight
and bias of the loss function are expressed as follows:

FIGURE 3. The probability distribution of the samples in the complete
CKD data set (at K = 11), the horizontal axis and the vertical axis
represent the probabilities that the samples were judged as notckd by
the LOG and the RF, respectively.

FIGURE 4. The structure of the perceptron used in this study.

The input signal corresponding to the weight w0 is 1, which
is a bias. The function of Signum is used to calculate output
by processing the value of y as follows: If y > 0, then the
output = 1, whereas if y 0, the output = 1, whereas the g(X )
Purchase answer to see full
attachment

Share This Post

Order a Similar Paper and get 15% Discount on your First Order

Our Services

466 Data mining and Data warehousing Project

Description

Share This Post

Related Questions

Operation management

Leadership and Management Development (5OS06)

Only edit the structure

Only edit the structure

Operation management

Operation management

answering presentation qustions

Research about business closed ( shortage of cash )

i want ppt new Antimicrobial drug resistance

i want ppt solve IPC in Radio diagnosis and interventional radiology

i want ppt solve IPC in labs

IT project 361

i want ppt solve

Reward for Performance and Contribution (5HR03)

PMP RESEARCH + PRESENTATION

372 ass 13

373 ass 16

373 ass 16