What’s the best process for data mining? In Data Science you hear this question often and the answer varies a lot based on who you ask. Some say it depends on the data, some say that there is no one process that can be applied to the field. While these are valid concerns, there is one methodology which seems to be true and tested, as well as robust enough to incorporate most, if not all, of the data mining projects. This is known as the CRISP-DM methodology, or the Cross-Industry Standard Process for Data Mining.
The cycle of CRISP-DM
As the diagram below shows, CRISP-DM is a cycle that enforces constant supervision and evaluation of the model the Data Scientist created. Let’s break down each of the building blocks of the cycle for a better understanding of how the model functions.
The first step in every project is understanding the business needs. This is not different in a data mining project either. In this stage, we need to define the goals, business value, and usage of the data. Amongst others, we need to answer the following questions:
- What are the objectives of the project?
- What is the current situation?
- What are the risks the business is facing? What contingencies are there to counteract these risks?
- What is the expected effect of the project? What are the success criteria for the project?
- What assumptions are we making? How do we confirm or contradict them?
These questions are crucial for understanding how the project will influence the business and whether there is even a point to start a data mining project. Once consensus is reached about them, we can move to the next step.
There is a general accord that cleaning and exploring the data takes over 50% of the time in a Data Science project and the CRISP-DM Cycle represents it well, by dedicating two steps solely to the understanding and the preparation of data. The next step in the CRISP-DM Cycle, after preparing the data, is to understand it. In this step we are expected to answer the following questions:
- Where does the data come from? What are the data sources?
- How can we best describe the data at a high level? What is the metadata of our datasets?
- What are the relations between attributes and tables?
- How is our data quality?
Answering these questions gives us a clear picture of whether the Business Understanding step took everything into account, whether our targets are realistic and whether we need more (or possibly less) data for the task to succeed. If there is no clear consensus on these questions, or if we see a risk we have not discovered in the previous step, we need to step back and discuss the Business Understanding once more.
Once both the data and the business are clear and understood, it’s time to take the gathered data and prepare it for modelling. In this step, we will select the data that we will use for analysis and then clean it. The main questions to answer:
- Did we select the right data?
- Do we need to generate new data, new records?
- Do we need to derive new attributes?
This step is more methodical than the previous ones, as in this step we prepare the already gathered and understood data. However, to move on to the next step, we need to be confident that we have everything and what we have is indeed what we need. We might need to generate completely new records from the available records or derive brand new attributes for the existing ones. All this depends on what’s needed for the modelling.
We have understood the business, we have gathered the data and cleaned it. Now it’s time to jump into the data mining task! It’s time to do some modelling. First, we need to select a technique: are we going to do clustering or rule finding? Will we be using a neural network or a decision tree? The main questions we need to take a look at are:
- What technique best fits our data and will lead to the insights we defined?
- What assumptions are we making
- How will we test the accuracy/feasibility of the models?
Once these questions are clear and answered, we can go ahead and build the models themselves. The resulting models need to be run, tested and assessed. If we didn’t select the right data or we didn’t clean it the right way, we need to return to the previous step with our new experiences. Otherwise, we are free to proceed with the evaluation of the model.
We are done, our models are working and we are ready to present the results to the business. However, we still need to decide whether what we achieved actually fits the plan. Questions to watch out for:
- How do our results relate to the business criteria?
- Did we achieve the success criteria?
- Did we confirm our assumptions? Did we contradict them?
- What is the next step? Can the models be deployed?
If all these questions are answered, everyone is satisfied and the business decided to deploy the models, we need to keep two things in mind. First of all, we need to make sure that deployment goes well, and move on to the deployment phase. Secondly, we need to take the new “as-is” situation back to the business and begin the cycle anew, albeit with less effort if there is no change in the business understanding. If the answers reached at this stage are not satisfactory, we need to step back to Business Understanding and restart the process without deployment.
The model is ready, business is satisfied, let’s make sure the company can benefit from the results. It’s time to deploy our data mining model. A couple of things to keep in mind:
- How are we going to deploy the model?
- How are we going to maintain the deployed model?
- How do we know if everything works as intended?
If we are confident to answer all these questions, the model can be deployed and the business can start to enjoy its benefits. Make sure to document deployment thoroughly and summarize all you have learned.
Congratulations! You are now done with one lap on the CRISP-DM cycle. You are ready to start from the Business Understanding phase once more and ensure that the business is continuously evolving.