Literature Review About Data Warehouse Essay Example
Literature Review About Data Warehouse Essay Example

Literature Review About Data Warehouse Essay Example

Available Only on StudyHippo
  • Pages: 19 (5142 words)
  • Published: August 8, 2018
  • Type: Research Paper
View Entire Sample
Text preview

DATA WAREHOUSE

According to William Inmon, data warehouse is a “subject-oriented, integrated, time-variant, and non-volatile collection of data in support of the management’s decision-making process” (Inmon, 1999). Data warehouse is a database containing data that usually represents the business history of an organization. This historical data is used for analysis that supports business decisions at many levels, from strategic planning to performance evaluation of a discrete organizational unit.

It provides an effective integration of operational databases into an environment that enables strategic use of data (Zhou, Hull, King and Franchitti, 1995). These technologies include relational and MDDB management systems, client/server architecture, meta-data modelling and repositories, graphical user interface and much more (Hammer, Garcia-Molina, Labio, Widom, and Zhuge, 1995; Harinarayan, Rajaraman, and Ullman, 1996).

The emergence of cross discipline domai

...

n such as knowledge management in finance, health and e-commerce have proved that vast amount of data need to be analysed. The evolution of data in data warehouse can provide multiple dataset dimensions to solve various problems. Thus, critical decision making process of this dataset needs suitable data warehouse model (Barquin and Edelstein, 1996).

The main proponents of data warehouse are William Inmon (Inmon, 1999) and Ralph Kimball (Kimball, 1996). But they have different perspectives on data warehouse in term of design and architecture. Inmon (Inmon, 1999) defined data warehouse as a dependent data mart structure while Kimball (Kimball, 1996) defined data warehouse as a bus based data mart structure. Table 2.1 discussed the differences in data warehouse structure between William Inmon and Ralph Kimball.

A data warehouse is a read-only data source where end-users are not allowed to change the values or data elements. Inmon’s (Inmon, 1999) data warehouse architecture strateg

View entire sample
Join StudyHippo to see entire essay

is different from Kimball’s (Kimball, 1996). Inmon’s data warehouse model splits data marts as a copy and distributed as an interface between data warehouse and end users. Kimball’s views data warehouse as a unions of data marts. The data warehouse is the collections of data marts combine into one central repository. Figure 2.1 illustrates the differences between Inmon’s and Kimball’s data warehouse architecture adopted from (Mailvaganam, 2007).

Although Inmon and Kimball have a different design view of data warehouse, they do agree on successful implementation of data warehouse that depends on an effective collection of operational data and validation of data mart. The role of database staging and ETL processes on data are inevitable components in both researchers data warehouse design. Both believed that dependant data warehouse architecture is necessary to fulfil the requirement of enterprise end users in term of preciseness, timing and data relevancy

DATA WAREHOUSE ARCHITECTURE

Although data warehouse architecture have wide research scope, and it can be viewed in many perspectives. (Thilini and Hugh, 2005) and (Eckerson, 2003) provide some meaningful way to view and analyse data warehouse architecture. Eckerson states that a successful data warehouse system depends on database staging process which derives data from different integrated Online Transactional Processing (OLTP) system. In this case, ETL process plays a crucial role to make database staging process workable. Survey on factors that influenced selection on data warehouse architecture by (Thilini, 2005) indentifies five data warehouse architecture that are common in use as shown.

Independent Data Marts

Independent data marts also known as localized or small scale data warehouse. It is mainly used by departments, divisions of company to provide individual operational databases. This type of data

mart is simple yet consists of different form that was derived from multiple design structures from various inconsistent database designs. Thus, it complicates cross data mart analysis. Since every organizational units tend to build their own database which operates as independent data mart (Thilini and Hugh, 2005) cited the work of (Winsberg, 1996) and (Hoss, 2002), it is best used as an ad-hoc data warehouse and also to be use as a prototype before building a real data warehouse.

Data Mart Bus Architecture

(Kimball, 1996) pioneered the design and architecture of data warehouse with unions of data marts which are known as the bus architecture or virtual data warehouse. Bus architecture allows data marts not only located in one server but it can be also being located on different server. This allows the data warehouse to functions more in virtual mode and combined all data marts and process as one data warehouse.

Hub-and-spoke architecture

(Inmon, 1999) developed hub and spoke architecture. The hub is the central server taking care of information exchange and the spoke handle data transformation for all regional operation data stores. Hub and spoke mainly focused on building a scalable and maintainable infrastructure for data warehouse.

Centralized Data Warehouse Architecture

Central data warehouse architecture build based on hub-and-spoke architecture but without the dependent data mart component. This architecture copies and stores heterogeneous operational and external data to a single and consistent data warehouse. This architecture has only one data model which are consistent and complete from all data sources. According to (Inmon, 1999) and (Kimball, 1996), central data warehouse should consist of database staging or known as operational data store as an intermediate stage for operational processing of

data integration before transform into the data warehouse.

Federated Architecture

According to (Hackney, 2000), federated data warehouse is an integration of multiple heterogeneous data marts, database staging or operational data store, combination of analytical application and reporting systems. The concept of federated focus on integrated framework to make data warehouse more reliable. (Jindal, 2004) conclude that federated data warehouse are a practical approach as it focus on higher reliability and provide excellent value.

(Thilini and Hugh, 2005) conclude that hub and spoke and centralized data warehouse architectures are similar. Hub and spoke is faster and easier to implement because no data mart are required. For centralized data warehouse architecture scored higher than hub and spoke as for urgency needs for relatively fast implementation approach.

In this work, it is very important to identify which data warehouse architecture that is robust and scalable in terms of building and deploying enterprise wide systems. (Laney, 2000), states that selection of appropriate data warehouse architecture must incorporate successful characteristic of various data warehouse model. It is evident that two data warehouse architecture prove to be popular as shown by (Thilini and Hugh, 2005), (Eckerson, 2003) and (Mailvaganam, 2007). First hub-and-spoke proposed by (Inmon, 1999) as it is a data warehouse with dependant data marts and secondly is the data mart bus architecture with dimensional data marts proposed by (Kimball, 1996). The selection of the new proposed model will use hub-and-spoke data warehouse architecture which can be used for MDDB modelling.

DATA WAREHOUSE EXTRACT, TRANSFORM, LOADING

Data warehouse architecture process begins with ETL process to ensure the data passes the quality threshold. According to Evin (2001), it is essential to have right dataset. ETL are an

important component in data warehouse environment to ensure dataset in the data warehouse are cleansed from various OLTP systems. ETLs are also responsible for running scheduled tasks that extract data from OLTP systems. Typically, a data warehouse is populated with historical information from within a particular organization (Bunger, Colby, Cole, McKenna, Mulagund, and Wilhite, 2001). The complete process descriptions of ETL are discussed.

Data warehouse database can be populated with a wide variety of data sources from different locations, thus collecting all the different dataset and storing it in one central location is an extremely challenging task (Calvanese, Giacomo, Lenzerini, Nardi, and Rosati, , 2001). However, ETL processes eliminate the complexity of data population via simplified process as depicts in figure 2.2. The ETL process begins with data extract from operational databases where data cleansing and scrubbing are done, to ensure all data’s are validated. Then it is transformed to meet the data warehouse standards before it is loaded into data warehouse.

(Zhou et al, 1995) states that during data integration process in data warehouse, ETL can assist in import and export of operational data between heterogeneous data sources using Object linking and embedding database (OLE-DB) based architecture where the data are transform to populate all validated data into data warehouse.

In (Kimball, 1996) data warehouse architecture as depicted in figure 2.3 focuses on three important modules, which is “the back room” “presentation server” and “the front room”. ETL processes is implemented in the back room process, where the data staging services in charge of gathering all source systems operational databases to perform extraction of data from source systems from different file format from different systems and platforms.

The

second step is to run the transformation process to ensure all inconsistency is removed to ensure data integrity. Finally, it is loaded into data marts. The ETL processes are commonly executed from a job control via scheduling task. The presentation server is the data warehouse where data marts are stored and process here. Data stored in star schema consist of dimension and fact tables. This is where data are then process of in the front room where it is access by query services such as reporting tools, desktop tools, OLAP and data mining tools.

Although ETL processes prove to be an essential component to ensure data integrity in data warehouse, the issue of complexity and scalability plays important role in deciding types of data warehouse architecture. One way to achieve a scalable, non-complex solution is to adopt a “hub-and-spoke” architecture for the ETL process. According to Evin (2001), ETL best operates in hub-and-spoke architecture because of its flexibility and efficiency. Centralized data warehouse design can influence the maintenance of full access control of ETL processes.

ETL processes in hub and spoke data warehouse architecture is recommended in (Inmon, 1999) and (Kimball, 1996). The hub is the data warehouse after processing data from operational database to staging database and the spoke(s) are the data marts for distributing data. Sherman, R (2005) state that hub-and-spoke approach uses one-to-many interfaces from data warehouse to many data marts. One-to-many are simpler to implement, cost effective in a long run and ensure consistent dimensions. Compared to many-to-many approach it is more complicated and costly.

DATA WAREHOUSE FAILURE FACTORS

Hayen, Rutashobya, and Vetter, 2007 studies shows that implementing a data warehouse project is costly and

risky as a data warehouse project can cost over $1 million in the first year. It is estimated that two-thirds of the effort of setting up the data warehouse projects attempt will fail eventually. (Hayen et al, 2007) cited on the work of (Briggs, 2002) and (Vassiliadis, 2004) noticed three factors for the failure of data warehouse project which is environment, project and technical factors as shown.

Environment leads to organization changes in term of business, politics, mergers, takeovers and lack of top management support. These include human error, corporate culture, decision making process and poor change management (Watson, 2004) (Hayen et al, 2007).

Poor technical knowledge on the requirements of data definitions and data quality from different organization units may cause data warehouse failure. Incompetent and insufficient knowledge on data integration, poor selection on data warehouse model and data warehouse analysis applications may cause huge failure.

In spite of heavy investment on hardware, software and people, poor project management factors may lead data warehouse project failure. For example, assigning a project manager that lacks of knowledge and project experience in data warehouse, may cause impediment of quantifying the return on investment (ROI) and achievement of project triple constraint (cost, scope, time).

Data ownership and accessibility is a potential factor that may cause data warehouse project failure. This is considered vulnerable issue within the organization that one must not share or acquire someone else data as this considered losing authority on the data (Vassiliadis, 2004). Thus, it emphasis restriction on any departments to declare total ownership of pure clean and error free data that might cause potential problem on ownership of data rights.

DATA WAREHOUSE SUCCESS FACTORS

Hwang M.I stress that

data warehouse implementations are an important area of research and industrial practices but only few researches made an assessment in the critical success factors for data warehouse implementations. He conducted a survey on six data warehouse researchers (Watson & Haley, 1997; Chen et al., 2000; Wixom & Watson, 2001; Watson et al., 2001; Hwang & Cappel, 2002; Shin, 2003) on the success factors in a data warehouse project. He concluded his survey with a list of successful factors which influenced data warehouse implementation as depicted in figure 2.8. He shows eight implementation factors which will directly affect the six selected success variables

The above mentioned data warehouse success factors provide an important guideline for implementing a successful data warehouse projects. (Hwang M.I., 2007) studies shows an integrated selection of various factors such as end user participation, top management support, acquisition of quality source data with profound and well-defined business needs plays crucial role in data warehouse implementation. Beside that, other factors that was highlighted by Hayen R.L. (2007) cited on the work of Briggs (2002) and Vassiliadis (2004), Watson (2004) such as project, environment and technical knowledge also influenced data warehouse implementation.

In this work on the new proposed model, hub-and-spoke architecture is use as “Central repository service”, as many scholars including Inmon, Kimball, Evin, Sherman and Nicola adopt to this data warehouse architecture. This approach allows locating the hub (data warehouse) and spokes (data marts) centrally and can be distributed across local or wide area network depending on business requirement.

In designing the new proposed model, the hub-and-spoke architecture clearly identifies six important data warehouse components that a data warehouse should have, which includes ETL, Staging Database

or operational database store, Data marts, MDDB, OLAP and data mining end users applications such as Data query, reporting, analysis, statistical tools. However, this process may differ from organization to organization. Depending on the ETL setup, some data warehouse may overwrite old data with new data and in some data warehouse may only maintain history and audit trial of all changes of the data.

ONLINE ANALYTICAL PROCESSING

OLAP Council (1997) define OLAP as a group of decision support system that facilitate fast, consistent and interactive access of information that has been reformulate, transformed and summarized from relational dataset mainly from data warehouse into MDDB which allow optimal data retrieval and for performing trend analysis.

According to Chaudhuri (1997), Burdick, D. et al. (2006) and Vassiladis, P. (1999), OLAP is important concept for strategic database analysis. OLAP have the ability to analyze large amount of data for the extraction of valuable information. Analytical development can be of business, education or medical sectors. The technologies of data warehouse, OLAP, and analyzing tools support that ability.

OLAP enable discovering pattern and relationship contain in business activity by query tons of data from multiple database source systems at one time (Nigel. P., 2008). Processing database information using OLAP required an OLAP server to organize and transformed and builds MDDB. MDDB are then separated by cubes for client OLAP tools to perform data analysis which aim to discover new pattern relationship between the cubes. Some popular OLAP server software programs include Oracle (C), IBM (C) and Microsoft (C).

Madeira (2003) supports the fact that OLAP and data warehouse are complementary technology which blends together. Data warehouse stores and manages data while OLAP transforms data warehouse

datasets into strategic information. OLAP function ranges from basic navigation and browsing (often known as “slice and dice”), to calculations and also serious analysis such as time series and complex modelling. As decision-makers implement more advanced OLAP capabilities, they move from basic data access to creation of information and to discovering of new knowledge.

OLAP FUNCTIONALITY

OLAP functionality offers dynamic multidimensional analysis supporting end users with analytical activities includes calculations and modelling applied across dimensions, trend analysis over time periods, slicing subsets for on-screen viewing, drilling to deeper levels of records (OLAP Council, 1997) OLAP is implemented in a multi-user client/server environment and provide reliably fast response to queries, in spite of database size and complexity. OLAP facilitate the end user integrate enterprise information through relative, customized viewing, analysis of historical and present data in various “what-if” data model scenario. This is achieved through use of an OLAP Server as depicted.

OLAP functionality is provided by an OLAP server. OLAP server design and data structure are optimized for fast information retrieval in any course and flexible calculation and transformation of unprocessed data. The OLAP server may either actually carry out the processed multidimensional information to distribute consistent and fast response times to end users, or it may fill its data structures in real time from relational databases, or offer a choice of both.

Essentially, OLAP create information in cube form which allows more composite analysis compares to relational database. OLAP analysis techniques employ ‘slice and dice’ and ‘drilling’ methods to segregate data into loads of information depending on given parameters. Slice is identifying a single value for one or more variable which is non-subset of multidimensional array. Whereas dice

function is application of slice function on more than two dimensions of multidimensional cubes. Drilling function allows end user to traverse between condensed data to most precise data unit as depict in Diagram 2.10.

OLAP Evaluation

Codd twelve rules of OLAP provide us an essential tool to verify the OLAP functions and OLAP models used are able to produce desired result. Berson, A. (2001) stressed that a good OLAP system should also support a complete database management tools as a utility for integrated centralized tool to permit database management to perform distribution of databases within the enterprise. OLAP ability to perform drilling mechanism within the MDDB allows the functionality of drill down right to the source or root of the detail record level. This implies that OLAP tool permit a smooth changeover from the MDDB to the detail record level of the source relational database. OLAP systems also must support incremental database refreshes. This is an important feature as to prevent stability issues on operations and usability problems when the size of the database increases.

OLTP and OLAP

The design of OLAP for multidimensional cube is entirely different compare to OLTP for database. OLTP is implemented into relational database to support daily processing in an organization. OLTP system main function is to capture data into computers. OLTP allow effective data manipulation and storage of data for daily operational resulting in huge quantity of transactional data. Organisations build multiple OLTP systems to handle huge quantities of daily operations transactional data can in short period of time.

OLAP is designed for data access and analysis to support managerial user strategic decision making process. OLAP technology focuses on aggregating datasets into multidimensional view

without hindering the system performance. According to Han, J. (2001), states OLTP systems as “Customer oriented” and OLAP is a “market oriented”. He summarized major differences between OLTP and OLAP system based on 17 key criteria as shown.

It is complicated to merge OLAP and OLTP into one centralized database system. The dimensional data design model used in OLAP is much more effective for querying than the relational database query used in OLTP system. OLAP may use one central database as data source and OLTP used different data source from different database sites. The dimensional design of OLAP is not suitable for OLTP system, mainly due to redundancy and the loss of referential integrity of the data. Organization chooses to have two separate information systems, one OLTP and one OLAP system (Poe, V., 1997). We can conclude that the purpose of OLTP systems is to get data into computers, whereas the purpose of OLAP is to get data or information out of computers.

DATA MINING

Many data mining scholars (Fayyad, 1998; Freitas, 2002; Han, J. et. al., 1996; Frawley, 1992) have defined data mining as discovering hidden patterns from historical datasets by using pattern recognition as it involves searching for specific, unknown information in a database. Chung, H. (1999) and Fayyad et al (1996) referred data mining as a step of knowledge discovery in database and it is the process of analyzing data and extracts knowledge from a large database also known as data warehouse (Han, J., 2000) and making it into useful information.

Freitas (2002) and Fayyad (1996) have recognized the advantageous tool of data mining for extracting knowledge from a data warehouse. The results of the extraction

uncover hidden patterns and inconsistency that are not visible in the existing a data sets. The discovery of such hidden patterns and data inconsistency cannot be achieved by using conventional data analysis and query tools approaches. Data mining techniques vary from conventional data analysis approach as data mining involve in extracting hidden patterns in a dataset while conventional data analysis tool only assume on the result from a data set.

There are several data mining techniques that are used to demonstrate different data mining technique in different areas of applications. Data mining techniques covers association, classification, clustering and prediction (Citation). Freitas (2002) stressed that data mining issues has to think about potential of solving the issues using different data mining techniques. Thus, to carry out a successful data mining applications with the chosen data mining techniques, a process model is required as it include a series of steps that will guide to agreeable results.

Data Mining Techniques

In general, data mining is capable of predicting or forecasting of future events based on historical data set and its purpose is to find hidden patterns in the database. There are several data mining techniques that are used and applied in the different areas of data. The knowledge on how each data mining technique is essential used is to select the suitable technique for a specific area.

According to Mailvaganam (2007), data mining techniques consists of two models which is descriptive and predictive models as describe in Diagram 2.16. Descriptive models can be generated by employing association rules discovery and clustering algorithms. As for Predictive models, it is generated by using classification and regression algorithms. Descriptive models can provide hidden relationships knowledge

in a give data set, for example, in student database, students who pass mathematics tends to pass science. Predictive models can influence the future results in a given data set, for example, in marketing, a customer’s gender, age, and purchase history might predict the likelihood of a future sale.

Data mining algorithm is the mechanisms that generate a data mining model. In order to generate a data mining model, a data mining algorithm needs to be define. The algorithm will then analyse a set of given data to investigate for an identifiable hidden patterns and trends results. This result will then be used by the algorithm to define parameters of the mining model. These parameters are then used across the whole data set to extract actionable patterns and detailed statistics. More details on the data mining algorithms are discussed as follows: Association algorithm is a powerful correlation counting engine. It can perform scalable and efficient analysis in identifying items in a collection that occur together.

Classification is the process of finding a set of models that describes and distinguishes data classes or concepts for the purpose of being capable of using the model to predict a class of objects with unknown class labels. This is the decision tree algorithm, including both classification & regression. It can also build multiple trees in a single model to performance association analysis.

Clustering algorithm includes 2 different clustering techniques: EM expectation and maximization) and K-means. It automatically detects the number of natural clusters in the datasets and discovering groups or categories.  Prediction can be viewed as a model constructed and used to access the class of a unlabeled sample or the value

ranges of an attribute that a given sample is likely to have.

In data mining techniques, choosing the best algorithm are based on specific business user case. It is possible to use different data mining algorithm to perform mining on the same business user case data sets, each algorithm will produce different set of results and some data mining algorithms can produce more than one type of result. Data mining algorithms are flexible and do not require to be use separately. Having a single data mining solutions, first action is to use an algorithm to explore the data set and then use other algorithm to perform prediction on a specific result based on the explored data (Citation). In a specific data mining solution, some algorithms like clustering can be use to explore data which is use for recognize patterns and break data set into groups and then use other algorithms like decision trees model based on classification algorithm to predict a specific outcome based on that data.

Data mining models are used to predict values, find hidden trends and generate summaries data. It is important to know which data mining algorithm to use in order to run a business user case.  As data mining model are built, deployed and trained, the result of the data mining model details is stored as data mining model nodes. This nodes is used to collect the attributes, description, probabilities, and distribution information for the model element it represents and relation to other nodes. Every data mining model node has a connected node type that assist in signifying a data mining model. A data mining model node is the uppermost node, regardless

of the actual structure of the model. All data mining models start with a model node. (Citation)

Decision Tree Model

Decision tree are standard data mining model for classification and prediction techniques (Citation). Decision tree are preferred in contrast to neural networks, decision trees representation of rules. These rules can easily express and understood. A decision tree model can be used to categorize an instance by initializing at the root of the tree and construct the leaf node which provides the classification of the instance.

A decision tree model is a tree like structure using classification techniques, in which a node in the tree structure represents each question used to further classify data. Decision tree is efficient and can be built faster than other model and acquiring results with similar accuracy in some cases. Thus, it is appropriate for large training data set. Decision tree model is easy to understand and interpret depending on the complexity of the decision tree and it handles non numerical data. The various methods used to create decision trees have been used widely for decades, and there is a large body of work describing these statistical techniques. According to Witten et al (2000), decision tree model approach is known for its fast data mining modelling, as it uses divide and conquers approach.

Witten et al (2000) describe decision tree process is constructed recursively. A model is placed at the root node of the tree and make out one or more tree node with possible value. Tree nodes training sets are then split up into subsets makes up Decision Tree 1, Decision Tree 2 or more tree nodes. This process is repeated recursively for each

branch until the node has the same classifications then the tree construction will stop. This means the leaf node with one class of “true” or “false” cannot be split further which resulted the recursive process to stop. The objective of decision tree model is to build as simplified decision tree model as possible to produce good classification or predictive performance results.

Clustering Model

Clustering is a data mining technique that is used to separate data set into groups or clusters based on the similarity between the data entities (Citation). Entities of the cluster share common features that differentiate them from other clusters. Similarity is measured in terms of distance between its elements or entities. Unlike classification, which has predefined labels (supervised learning), clustering is considered as unsupervised learning because it automatically comes up with the labels (Citation).

According to Kogan, J. et.al. (2006), clustering techniques is divided into partitioning and hierarchical methods. Partitioning methods construct various partitions of similar and dissimilar items in a group or clusters evaluated by conditions. For hierarchy methods, it builds hierarchical breakdown using a set of data progressively using either top-down approach or bottom up approach (Citation). Using top down approach begins with a cluster containing all data and breakdown into a smaller cluster known as sub clusters. Using bottom-up approach begins with small clusters and combine them recursively from larger cluster in a nested method.

The advantage of hierarchical clustering compared to partition is that it is flexible as regards to the label of granularity. Clustering techniques are assessed in provisions of certain features related to size, distance between parts of the cluster or shape of the cluster. Clustering techniques support application that

requires segmenting the data into common groups (Citation).

A cluster node gathers the attributes and data for the abstraction of the specific cluster. Basically, it gathers the set of distribution that comprises a cluster of cases for the data mining model. A clustering based model constantly has one model node and at least one cluster node. A user does not need to identify the number of clusters to be developed in advance. The clustering process automatically creates the exact number of clusters by specifying how similar the records within the individual clusters should be. The clustering approach works best with categorical and non repetitive variables.

Data Mining Process Model

A process model is required to implement a data mining project. This process model involves a sequence of steps that will produce correct results. Some examples of these process models are CRISP (Chapman et al, 2000) and TWOCROWS (Two Crows, 1999). In this study, applications experimental tools are based on the CRISP data mining process model. The difference phases of CRISP data mining process model are presented in Diagram 2.19 (Citation). The focus of this chapter is on the first three CRISP phases which relevant to the research objectives of this study.

According to Chapman et al (2000), CRISP data mining model is a life cycle for a data mining project which also includes the tasks and relationship between the tasks. CRISP life cycle consists of six phases which includes business understand, data understanding, data preparation, modelling, evaluation and deployment, and the arrows indicate the most important and frequent dependencies between phases (Citation INCLUDE PAGE NUMBER).

In CRISP data mining process model, it begins with the business understanding of the projects

objectives and requirements as this is important to convert it to data mining problem definition. Next step is to perform data understanding with the datasets to identify data quality problem and to discover interesting subsets to form hypothesis for hidden information. After the identification of the datasets, data preparation phase will load all data into the modelling tools from the initial datasets.

This phase will execute for multiple times to complete the transformation and cleaning of data for modelling. In modelling phase, various techniques are used and applied for the data mining problem to have high quality models for data analysis. In evaluation phase, the model(s) is thoroughly reviewed to ensure whether it achieve the specified business objectives. Finally, deployment phase is executed as to produce simple reporting or complex data mining results as this phase mainly triggered by the end-users.

One of the major advantage of CRISP data mining process model is tat its highly replicable in which it support this study. The process is flexible and can be applied on different types of data and used in any business user’s area. It also provides a uniform framework as a guideline and documentation.

Get an explanation on any task
Get unstuck with the help of our AI assistant in seconds
New