Text preview

Data Mining Project

Introduction

This report discusses the challenges faced by banks in managing debt and repaying loans. The banks are currently struggling with borrowers who fail to fulfill their loan repayment commitments despite making promises. The banks are determined to prevent this issue from recurring and have shared a dataset of 2000 customers who have previously taken out loans. Various data mining techniques will be used on this dataset to predict and differentiate between different types of loan customers, ultimately determining the likelihood of loan repayment. This analysis will allow us to forecast the future probability of loan repayment for individual customers based on their characteristics and the dataset outcomes.

The dataset has 2000 instances (customers) with 15 attributes. Each customer is identified by a distinct 6 digit customer ID.

Terminology Used

Standard division (StdDev) is a measure that quantifies the spread an

...

d difference from the mean of a set of numbers.

A high standard deviation in the data points suggests that their values have a broader range and are not close to the mean.

Outliers are observations that exhibit noisy data and deviate from the rest, yet they can still hold significance.

Noise is the process of eliminating any changes, mistakes, or irrelevant information from the original data.

Attribute refers to a property, characteristic, or variable of an object.

Each row in the data set contains data variables.

Mean, also known as average, is the most commonly used measure of location for a set of points or numbers.

Instances refer to unique identifiers that differentiate individuals, customs, or persons. Each instance is linked to a set of values for 15 attributes.

Attributes

Customer ID, Forename, Surname, Age, Gender, Years at address, Employment status, Country, Current debt, Postcode,

View entire sample

Join StudyHippo to see entire essay

Income, Own home, CCJs, Loan amount, Outcome.

Data Summary

Attribute name

Datatype

Min

Max

Mean

Standard division

Values

Customer ID

Numeric

555574

1110985

837077.517

160763.319

Discrete

Forename

Nominal

Discrete

Surname

Nominal

Discrete

Age

Numeric

17

89

52.912

20.991

Continuous

Gender

Nominal

Discrete

Years at address

Numeric

1

560

18.526

23.202

Continuous

Employment status

Nominal

Discrete

Country

Nominal

Discrete

Current debt

Numeric

0

9980

3309.325

2980.629

Continuous

Postcode

Nominal

Discrete

Income

Numeric

3000

220000

38319

12786.506

Continuous

Own home

Nominal

Discrete

CCJs

Numeric

0

100

1.052

2.469

Continuous

Loan amount

Numeric

13

54455

18929.628

12853.189

Continuous

Outcome

Nominal

Discrete

Customer ID

– is utilized to identify customers uniquely. Upon analyzing the data, it becomes apparent that there are instances of duplicate values. The dataset demonstrates a uniqueness rate of 99%, indicating that 14 values have been duplicated or erroneously entered. By subtracting the distinct value from the unique value, we can ascertain that 7 customers are replicated in the system. A thorough examination of the dataset reveals that these customers possess identical IDs but different names, suggesting an error might have occurred.

It is important for customers to maintain their uniqueness. Failure to do so can

impact other attributes that are crucial in resolving loan repayment issues, as it will prevent accurate referencing of customer data.

Forename Surname – using these attributes will not provide any statistical information. They are not relevant for predicting the likelihood of future loan repayment.

The data set's filtering displays different variations of data, leading to confusion and clutter. Examples include 'E Spencer' and 'Dea,th' which have inconsistent formatting. Additionally, there are mixed up gender values and a mismatched attribute for the gender, such as assigning 'Fewson' as female (F).

Age - The age attribute in this dataset is accurate and error-free. It spans from 17 to 89, with an average age of 53. These statistics offer a comprehensive view of the information. With a standard deviation (StdDev) of 20.991, the data exhibits a well-distributed and unpredictable pattern, with values spread out around the mean.

The age group of 17 – 21 is considered young and can be used to obtain probable distribution. Alternatively, a different subset can be used to gather reliable data, such as deviations in intervals of 10 years.

The Age attribute can provide statistical information about the accuracy of age group usage without any external factors affecting their circumstances. Combining external factors like the 'Debt Amount' attribute would result in highly valuable probability statistics regarding the repayment of loans by these customers in the future.

Gender – The dataset includes 9 distinct values for males and females. "M" and "F" are the primary representations of gender in the data, while "H", "D", and "N" are considered as 'noise' or unknown. Nevertheless, by analyzing the first names, it is possible to accurately determine the correct gender, enabling correction of

these erroneous values.

By substituting the labels 'Male' and 'Female', represented by 'M' and 'F', respectively, it is feasible to replace them. After examining the datasheet, it becomes evident that there are instances where the assigned gender does not match the given forenames. It can be assumed that these entries were mistakenly inputted, resulting in corrections made to align with their respective genders. Specifically, there are four fields in which the gender is denoted as 'Female', despite belonging to males – Stuart, Dan, David, and Simon. Consequently, these names have been modified to 'M'. This discrepancy could potentially be attributed to a misconfiguration issue.

After analyzing the data, it is evident that the percentages for females and males are quite similar. Specifically, females make up 48.45% while males account for 51.55%, resulting in a difference of 3.1%. This information can be used to accurately predict future loan repayment rates. However, it is important to note that making decisions based on gender may lead to discrimination. Thus, the gender attribute may not be suitable for decision-making purposes. Nevertheless, the gender data will still be retained, even though not all gender attributes can be determined with complete certainty. The usefulness of the gender attribute is considered to be very low as it can be seen as equally representative in both directions.

Gender attributes are once again deemed to be of little use. As mentioned earlier, data can be inputted inaccurately, such as when a given name that is presumed to belong to a female is actually associated with a male.

Years at address - There are 74 unique values. Four anomalies are present in this attribute when analyzing the dataset: Simon

Wallace, Brian Humpreys, Steve Hughes, Chris Greenbank. These individuals have an unusually high number of years living at their address. It is uncertain whether this data was inputted incorrectly or if the number represents months or days.

The visualization screen displays age and years at address as noise, which are dots that represent invalid data due to exceeding the normal range of years.

These four sets of data do not show a valid correlation between age and number of years at address.

It is recommended to remove these customers from the dataset due to their unreliable data. Despite only a few customers being impacted, eliminating them would not jeopardize the overall dependability of the remaining customers' data.

The length of time a customer has resided at their current address is thought to impact the likelihood of loan repayment success. The rationale behind using these figures remains unclear, therefore it is not advised to modify them.

If the data entered for these four is assumed to be incorrect, then the maximum years at address would be 71. To solve this problem, a ten year interval subset can be used for the Years at address attribute. This subset would provide a reliable probability for paying back a loan, based on the number of years living at the current address.

Employment status - The employment status attribute is primarily comprised of self-employed individuals, accounting for 1013 out of 2000 people, or 50.65%. Among the remaining individuals, 642, or 32.1%, are unemployed, 340, or 17%, are employed, and 5 are retired, representing 0.25%. However, it cannot be verified if these figures accurately reflect the employment status.

Despite the retirement age being 65, none of the retired

individuals in this dataset exceed that limit. These customers are classified as part of the retired segment, which greatly influences the ratio of retired individuals with unpaid debts versus those who have successfully repaid them. However, because this specific portion of the dataset is small, the reliability of this information remains uncertain.

County – the county attribute includes 4 district values: UK, Spain, Germany, and France. The UK has the largest number of values (1994), while the remaining countries have a combined total of six values.

The data for Spain, Germany, and France is not enough to give precise or dependable outcomes. The reason is that a notable portion of the data concentrates on the UK. Although it is possible to detect patterns in UK customers, the usefulness of this data is restricted and unsuitable for solving the current problem.

Current debt - There are 788 distinct values for this attribute, with 294 unique values (15%). Since a person's debt is variable, it is categorized as continuous. The more debt an individual accumulates, the more difficult it becomes for them to obtain new loans or settle existing ones.

There is a theory that suggests individuals with substantial debt and lower income compared to their debt are unlikely to repay it; however, there is no evidence supporting this belief.

The current debt attribute provides statistical information about a person's previous debt. It can be utilized to accurately predict the likelihood of a customer repaying future loans. When an individual needs to borrow money, it implies that they are facing financial difficulties, which in turn suggests the existence of debt.

Post code - the post code attribute has a total of 19171 attributes.

There are no missing values (0%), indicating that there are multiple customers living in the same area. These areas may exhibit a pattern. However, since there are not enough subjects to establish an accurate pattern based on location, the same applies to gender.

This attribute has no validity and provides no contribution to predicting loan repayment.

Income – this attribute has 100 different values and 0% at seven unique values. These values are continuous as income has no fixed amount and can vary, for example, with bonuses or overtime.

When evaluating alongside Current Debt, the income attribute plays a significant role. It can be predicted that individuals with little income and the same amount of debt will not be able to repay future loans. By analyzing multiple attributes, trends can be observed, such as individuals with lower debt having higher income. The scatter graph shows two outliers in terms of loan amount (y) and income (x). One borrower has successfully repaid the loan, while the other has defaulted. Since there are only two outliers, they may not significantly impact the evaluation and prediction process.

Own Home - has three distinct values, with zero noise and 0% missing values, as there are only three possible states for home ownership.

The presence of homeownership and a high income in individuals is likely to lead to loan repayment, which can provide valuable insights for predicting future loan repayments. Analyzing the scatter graph with respect to income indicates that customers with a stable income can easily repay loans, particularly if they also have a mortgage. Conversely, individuals who pay rent incur extra expenses and may struggle with loan repayment.

CCJs have a value between 0

and 100, which represents a person's trustworthiness in paying back debts. A higher score indicates a higher chance of defaulting on loans, making it an important factor in this dataset. This information is valuable for predicting the probability of loan repayment for potential customers.

There are six distinct values (0, 1, 2, 3, 10, and 100). The assumption is that if a customer aged 27 has 100 CCJs, it is highly unlikely and considered error data. Therefore, it should be treated as noise. Since the actual value for this customer is unknown, this field should be removed from the set and assumed to be either 1 or 10. Considering there are multiple instances of values 0, 1, 2, and 3; we can make a similar assumption for the person with a value of 10 due to the significant gap between these numbers. Out of a total of 2000 customers, only the values 10 and 100 are unique.

Loan Amount - The loan amount is not fixed and can vary over time. No individual can have a predetermined number of loans.

The dataset contains a minimum loan amount of 13, which is considered an input error value or noise. Typically, banks do not lend out such low amounts like ?13. The actual amount remains uncertain, but it could be 1300.

Outcome - contains two distinct values: Paid and Defaulted attributes. This information can be utilized to forecast the repayments of future loans for those customers.

Data Mining Algorithm Selection

Naive Bayes

The selected data mining algorithm is Naive Bayes, which employs the probability theorem known as Bayes theorem. This theorem is named after Reverend Thomas Bayes, who existed from 1702 to 1761. Naive

Bayes forecasts the category of a new dataset by assuming connections between various characteristics. It takes into account the autonomy of each characteristic's value from the values of other features.

The classifier assumes that the presence of a specific feature in a class does not have any correlation with the presence of any other feature. [1]

NaA?ve Bayes has been used in real-life scenarios such as disease diagnosis (specifically, Mycin). This method was implemented to predict the occurrence of late onset Alzheimer's disease in a group of 1411 individuals. Each individual had 312318 SNP measurements as genome-wide predictive features. It successfully dealt with nonlinear dependent data and effectively provided useful patterns in the extensive dataset [2].

The purpose of this data mining technique is to establish if a specific class can be assigned to a given set of attribute values. This is accomplished by calculating the likelihood that the attributes' values correspond to the class. The classifier is ultimately determined as the option that occurs most frequently.

This annotation displays the concept of conditional probability. It represents the likelihood of event B happening given that event A has already occurred, indicating the conditional probability of event B with respect to event A.

The probability of B given A is denoted as P(B|A).

Zero frequency problem

Naive-Bayes has a drawback: if there are no instances of a class label and a particular attribute value together, the probability estimate based on frequency will be zero. This can result in a zero value when all probabilities are multiplied. To address this issue, you need to adjust for zero frequency occurrences.

Advantages & Disadvantages

Naive Bayes classifier assumes a specific data distribution, which can have negative consequences.

However, violating this assumption does not necessarily make Naive Bayes inefficient. When a class label and a certain attribute value have no occurrences together, the probability estimate based on frequency will be zero. Due to the conditional independence assumption of Naive-Bayes, multiplying all probabilities will result in a zero, impacting the posterior probability estimate.

Often, when calculating probabilities, people tend to overlook the importance of discretizing the feature or fitting a normal curve.

Naive Bayes is a robust method that can detect noise and unrelated attributes. However, in order to make reliable estimations of the probability of each class, a large data set is needed. Using the Naive Bayes classification algorithm with a small data set may result in low precision and recall.

Use Laplacian smoothing to slightly increase the counts of some occurrence in your data.

The text below isand unified, keeping the and their contents:

Decision tree

Decision trees are a classification representation that utilizes a hierarchical structure of conditions. The final decision is determined by following the fulfilled conditions from the root of the tree to one of its leaves. These trees are built around the principle of "divide and conquer".

There are two possible types of divisions: Nominal partition and attribute splitting. Nominal partitioning occurs when an attribute results in a split with as many branches as there are values for the attribute.

The text discusses numeric partitioning as the second type of partition. It allows for partitions with "x>" and "x

"best" attribute as the root node and dividing data based on its values. This process applies to each partition. A node is considered highly pure if one class dominates among its examples.

Decision tree methods are often referred to as Recursive Partitioning methods because they repeatedly divide data into smaller and purer subsets. Attributes are further split using pure nodes or, if unavailable, nodes with the next highest level of purity. Pure nodes do not require further division since all samples within them belong to the same class.

In the pruning phase, small and deep nodes caused by noise in training data are eliminated from bottom to top, reducing overfitting risk and improving classification accuracy for unknown data. This leads to a more generalized tree structure.
During the construction of the decision tree, each node aims to identify split attributes and points that effectively separate the training records assigned to that leaf node.The accuracy of these split values determines how well classes are distinguished.

The first algorithm for decision tree classification was ID3, developed by Ross Quinlan in 1986.It assumed that all attributes were nominal and there were no unknown values.The C4.5 algorithm, an enhanced version of ID3 introduced in 1993, is a more versatile extension for decision tree generation. Overfitting occurs when a model becomes overly trained on specific details and performs poorly on new examples, while underfitting happens when a model is too simplistic and fails to distinguish between new examples resulting in subpar performance. Decision trees offer advantages such as easy understanding and visualization through diagrams; however, pruning can be challenging with large datasets leading to complex and non-generalized trees. Generating decision trees is fast and

requires minimal effort for classifying indefinite records. Pruning helps handle dataset noise without significantly impacting tree accuracy. By utilizing the white box model of the C4.5 algorithm that incorporates knowledge about underlying logic, one gains better comprehension of results and can incorporate new scenarios effectively. The C4.5 algorithm is employed for classification problems using training sets to create decision trees based on different types of data attributes including continuous, discrete, and missing data; it also has the capability to convert trees into rules by incorporating pruning techniques to minimize levels and nodes within the tree structure.The algorithm can accurately generate results without removing noisy values in a dataset provided by a bank, which includes both continuous and discrete attributes. Enhancements to the algorithm involve pruning and testing it with methods such as cross validation, training sets, and a 50:50 split. Node quality in the decision tree is assessed using impurity measures like entropy. The gain of a test condition is determined by comparing the impurity of parent nodes with that of child nodes. Maximizing gain means minimizing the weighted average impurity measure of children nodes. If entropy is used as the information gain measure (I()), it is referred to as information gain (I"info). Chapter two discusses data preparation, mentioning that certain attributes require preprocessing. Discrimination based on gender identity and location will not affect loan granting decisions in future cases. Due to insufficient records from European countries, country attributes are disregarded when establishing an accurate pattern. The evaluation process will consider the following attributes mentioned earlier: Outcome, Income, Loan amount, CCJs, Years At Address, Current debt, Own home, Age, and Employment status. Postcode,Surname.Forename,Customer ID.Gender,and Country

(considered noise) will not be utilized as attributesA total of 1994 records remain in the dataset after eliminating 6. The 'Gender' attribute has been changed to 'M' and 'F', which is reflected in the histogram. The pre-edited histogram is on the left, while the post-edited one is on the right. The deleted records were related to 'Years at Address' and 'CCJs', considered noise. This conclusion is supported by evidence of an individual aged 27 with 100 CCJs and a person aged 85 with 560 years at their current address, indicating errors. Removing these records would not affect results due to sufficient additional data. Histograms for CCJs and Years at Address show updated data, with the top chart representing data before removal and bottom chart showing data after removal. Analyzing the "income" attribute may provide insights as it could indicate outliers within the dataset. The "Income" attribute contains two outliers, one valued at 18000 and another at 220000, significantly different from other values, making it important in determining loan repayment success or failure.
The images illustrate that young individuals with high incomes and existing CCJs are more likely to default on a loan, whereas middle-aged individuals with high salaries and no previous CCJs are more likely to repay a loan. The histogram representing the data set does not require removal of outliers in the "Income" attribute. However, after eliminating noisy records from the CCJs and Years at Address attributes, there are six fewer records (1988) for UK residents in the county attribute. For France, Germany, and Spain in the Countries attribute, only six values will be considered due to insufficient data indicating any significant patterns.

The Modelling Results

and Discussion Experiment 1 aims to compare classification accuracy using a 50:50 or cross-validation strategy with NB and J48 techniques. Both the C4.5 (J48 in Weka) algorithm and Naive Bayes algorithm will undergo testing using a 50:50 percentage split for evaluation purposes. Having distinct independent samples is crucial for obtaining reliable evaluation results. The data is divided into separate parts for learning and testing by employing a percentage split. However, a larger split like 80:20 may not be accurate due to insufficient data from the bank.

The cross-validation option of testing (10 folds) will be used with both training algorithms. This method divides the data into 10 folds and repeats the process ten times, utilizing one fold for testing in each repetition while utilizing the remaining nine folds for learning purposes.Weka carries out the learning algorithm 11 times: once for each fold of cross-validation and once more on the entire dataset. Out of a total of 2000 records, 1800 are used for learning while the remaining 200 are used for testing. This systematic approach improves upon repeated holdout as it reduces estimate variations, making it a reliable method. Both J48 and Naive Bayes methodologies will be tested using these techniques. Additional experiments will be conducted to determine the most accurate combination of results compared to Weka's default values.

For subsequent experiments, parameters like binarySplits and unpruned will be changed specifically for the J48 algorithm. For this experiment with Naive Bayes, techniques such as useKernalEstimator and useSupervisedDiscretization are crucial factors.

It is essential to modify only one parameter at a time while keeping track of the confusion matrix and accurately classified instances. Afterward, all parameters should be reverted back

to their original state. This approach ensures reliable results and aids in identifying influential parameters.

The Confusion Matrix below displays the utilized data mining techniques along with their corresponding strategies, providing percentage values for both Paid and Default categories. It specifically presents accuracy percentages for correctly classified instances by J48 and Naive Bayes classifiers.

Notably, when employing the 50:50 training-testing method, both J48 and Naive Bayes classifiers exhibit the highest classification accuracy.Thus, these methods will be utilized in future experiments. The average across all strategy combinations shows that the model with a 50:50 split using J48 has an average of 77.35%, which is closest to its overall classification accuracy.

In Experiment 2: Studying the Effects of Pruning, our objective is to determine whether pruning a decision tree yields optimal classification accuracy when utilizing J48 classifier. The Methodology involves evaluating how setting the unpruned parameter to 'true' impacts accurately classified instances. Originally set as false, indicating a pruned tree, altering this parameter will prevent pruning and allow us to assess its effect on classification accuracy.

Our preliminary investigation revealed that employing a 50:50 split percentage within the J48 algorithm resulted in a highly accurate pruned tree. However, changing the parameter to true resulted in a decrease in accuracy. The correctly classified instances dropped to 77.03%, a decrease of 1.1%. When unpruned was set to true, it led to a much larger decision tree.

In terms of Modelling, the training method used was J48 with a 50:50 split.The percentages for %Paid, %Default, and %Classification Accuracy were 85.84%, 66.44%, and 77.03% respectively.The accuracy of the paid attribute increased by 0.18%, while the accuracy of the default attribute decreased by 11.69%.Both the confusion matrix

and table percentages demonstrate lower percentages for deflated and higher percentages for paid. Enabling an unpruned parameter leads to a notably larger decision tree, which can offer valuable marketing data by targeting self-employed customers eligible for higher loan amounts based on their current income. Experiment 3 aimed to assess whether Binary Splits resulted in improved classification accuracy with J48. The employed methodology is outlined below.

Debt Predictions Using Data Mining Techniques Essay Example

Haven't found what you were looking for?

Search for samples, answers to your questions and flashcards

Debt Predictions Using Data Mining Techniques Essay Example

Haven't found what you were looking for?

Search for samples, answers to your questions and flashcards

Unfortunately copying the content is not possible

Tell us your email address and we’ll send this sample there.