Data Analytics with Classification and Regression Trees

Data Analytics with Classification and Regression Trees

Classification and Regression Trees

Greetings, everyone. We discuss the purpose of classification and the advantages and disadvantages of different methods. We then demonstrate the process using actual data, building a classification tree using the “tree” package in R, then “pruning” it to reduce its complexity. The purpose of classification is to classify, or divide up, different data points into several classes.

For example, in the photo, we see several food morsels, and we could wish to classify them into meat and vegetable classes. In classification problems, the output variable is typically categorical, not a numerical one. Input variables typically consist of characteristics of the data under study, which we will use to classify each data point into different classes. Classification techniques boast several advantages.

Classification Criteria

First, they provide clear decision rules, guiding how to classify data. Second, the classification trees produced by the techniques show the most important classification criteria first, at the top of the tree, giving us important insight into which input variables are the most meaningful for classification. And third, classification algorithms tend to be fairly easy to work with, not requiring linear data, and not being sensitive to outliers and missing data. We also face several disadvantages, primarily the large amount of data required for reliable classification, and a large amount of work that goes with that to collect and process large data sets.

Classification uses the recursive partition method to divide up an n-dimensional space into several categories, or classes, as suggested by the data. Here, n equals the number of input or predictor variables. Classes should be internally homogeneous, meaning that all members within a class should be similar, and externally heterogeneous, with one class different from other classes Popular packages in R for classification include part and tree, which will be the focus of the video For our example data set, we selected the Iris data set from the UC Irvine Machine Learning Repository.

Physical Characteristics

The purpose of the data is to classify Iris plants into 4 different classes, based on certain physical characteristics. The data set includes observations of many different Iris plants, including their physical characteristics and classes. Physical observations include the length and width of the sepal, the small leaves under a flower, and the length and width of the flower petals. The data set is in the comma-separated-values, or CSV, format, as shown in the image. Note that the data does not include a header row,

which is the first row at the top of the data set with the descriptive names for the variables. The data includes 150 samples with 4 continuous numerical variables and one categorical variable, namely the class of plant To get syntax and usage for all R functions in this video, we will use the R manual. We start by reading in the CSV data, using the read.csv() function in R, and stating that the header equals FALSE because the data does not include a header row.

Function in R

To make the data easier to work with, we add column names to each of the variables, using the function in R. The variable names include the four physical characteristics of Iris plants, namely, the length and width of the sepals and petals, as well as the variable which contains the class of the Iris plant, which we will call “class”. We ask R to list out the head, or first few lines of data, using the head function in R. In this case, we ask R to list out the first 3 rows.

Classification techniques benefit from large data sets for robust classification. We can check the number of records for each class by invoking the cross-tabulation function in R, called tabs(). The tabs() function asks for the cross-classifying variables, as well as the data set. In our case, we state our variable “class” in the function for our cross-classifying variable and iris. df as the data set. Executing the tabs() function, we see that the data has 50 records of each class. To generate our classification tree, we install the “tree” package.

Input Variables

The tree() function, part of the “tree” package, takes as inputs the cross-classifying variable, in this case, “class”, as well as the data set, in this case, iris. df. We can insert input variables to the right of the squiggly tilde (~) delimiter character, which separates the output variable from the input variables. In our case, we wish to include all of the input variables, so we just enter a period (.) to signify that R is to use all of the input variables in the data. We assign the results of the tree classification function to a variable we arbitrarily call iris.tree1.

From this name, you can already guess that we will have an iris.tree2, and you would be right. After we install the tree package, load the tree files using the library function, and execute the tree function, we ask R to give us the summary statistics using the summary() function. The summary function output shows us the result of the classification tree function. First, it tells us that R used only three of the four variables to build the tree. The fourth variable, sepal width, was not important to the tree. It also tells us that the misclassification error rate was 4 / 150, or about 2.7% We invoke the plot() function to generate a plot of the tree.

Classification Trees

We execute the text() function to get the explanatory text for the tree. This slide shows an enlargement of the tree, with the explanatory text added. The tree shows us the “petal < 2.45” classification criterion at the very top of the tree, which suggests that this criterion is important. At the bottom of the tree, we find the classification criterion, which suggests that this criterion is not as important as the others. We can “prune” the tree to reduce its complexity. First, we need to identify which branches we need to prune. When pruning trees in nature, we prune the small branches, the ones with the fewest leaves. Similarly, when pruning classification trees, we remove the branches associated with the fewest data points.

We use the cv. tree() function in R to run cross-validation, or cv, experiments to find the number of misclassifications as a function of the cost-complexity parameter K. In other words, what is the best number of nodes to include? From the cv. tree() function output, we can see that the K value appears to increase dramatically after 3 values. This finding suggests a tree size of 3.We acknowledge that this process is somewhat arbitrary. We invoke the prune.tree() function, part of the “tree” classification package, to prune the tree. In our case, we will use the prune. the misclass option of the prune.tree() function.

Misclassification Error

Previously, we had calculated the tree calculation function and placed the results in a variable we called iris.tree1, and we enter this variable into the prune. misclass function. We also enter the so-called “best” value of 3 to prune the tree, from our earlier work in cross-validation. We place the output in a variable we call iris.tree2 and ask R to give us summary statistics. The summary shows us that pruning the tree only increased the misclassification error rate from 4 in 150 to 6 in 150, or from 2.6% to 4.0%.

Here, we see Before and After versions of the tree, demonstrating the difference pruning can make. The left side shows the original classification tree, iris.tree1. The original tree includes so many branches that it is difficult to see them all. The right side shows the pruned classification tree, iris.tree2. We notice that the pruned tree is significantly simpler, with fewer branches.

Demonstrated The Technique

And we pay less than 1.4% in classification error for this decrease in complexity. That is a bargain, if you ask me, which is why pruning is so popular in classification tree methods. In this video, we discussed the purpose, advantages, and disadvantages of classification functions. We demonstrated the technique using sample data in a typical classification scenario. We developed the classification tree using the tree package, then reduced its complexity by pruning the tree, and compared the results to the original tree, noting that we achieved a significantly less complex tree for little additional error.

Leave a Reply

Your email address will not be published. Required fields are marked *