Categories

Versions

You are viewing the RapidMiner Studio documentation for version 10.0 - Check here for latest version

Local Polynomial Regression (RapidMiner Studio Core)

Synopsis

This operator generates a local polynomial regression model from the given ExampleSet. Regression is a technique used for numerical prediction.

Description

The Local Polynomial Regression operator provides functionality to perform a local regression. This means that if the label value for a point in the data space is requested, the local neighborhood of this point is searched. For this search the distance measure specified in the numerical measure parameter is used. After the neighborhood has been determined, its data points are used for fitting a polynomial of the specified degree using the weighted least squares optimization. The value of this polynomial at the requested point in the data space is then returned as result. During the fitting of the polynomial, the neighborhoods data points are weighted by their distance to the requested point. Here again the distance function specified in the parameters is used. The weight is calculated from the distance using the kernel smoother, specified by the smoothing kernel parameter. The resulting weight is then included into the least squares optimization. If the training ExampleSet contains a weight attribute, the distance based weight is multiplied by the example's weight. If the use robust estimation parameter is set to true, a Generate Weight (LPR) is performed with the same parameters as the following Local Polynomial Regression. For different settings the Generate Weight (LPR) operator might be used as a preprocessing step instead of using this parameter. As a result the outliers will be down-weighted so that the least squares fitting will not be affected by them anymore.

Since this is a local method, the computational need for training is minimal. In fact, each example is only stored in a way which provides a fast neighborhood search during the application time. Since all calculations are performed during the application time, it is slower than for example SVM, Linear Regression or Naive Bayes. In fact it really depends on the number of training examples and the number of attributes. If a higher degree than 1 is used, the calculations take much longer, because implicitly the polynomial expansion must be calculated.

Regression is a technique used for numerical prediction. It is a statistical measure that attempts to determine the strength of the relationship between one dependent variable ( i.e. the label attribute) and a series of other changing variables known as independent variables (regular attributes). Just like Classification is used for predicting categorical labels, Regression is used for predicting a continuous value. For example, we may wish to predict the salary of university graduates with 5 years of work experience, or the potential sales of a new product given its price. Regression is often used to determine how much specific factors such as the price of a commodity, interest rates, particular industries or sectors influence the price movement of an asset.

Input

  • training set (Data Table)

    This input port expects an ExampleSet. This operator cannot handle nominal attributes; it can be applied on data sets with numeric attributes. Thus often you may have to use the Nominal to Numerical operator before application of this operator.

Output

  • model (Model)

    The regression model is delivered from this output port. This model can now be applied on unseen data sets.

  • example set (Data Table)

    The ExampleSet that was given as input is passed without any modifications to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

Parameters

  • degreeThis parameter specifies the degree of the local fitted polynomial. Please keep in mind, that a degree higher than 2 will increase calculation time extremely and probably suffer from overfitting. Range: integer
  • ridge_factorThis parameter specifies the ridge factor. This factor is used to penalize high coefficients. In order to avoid overfitting, the ridge factor might be increased. Range: real
  • use_robust_estimationIf this parameter is set to true, a re-weighting of the examples is performed in order to down-weight outliers. Range: boolean
  • use_weightsThis parameter indicates if example weights should be used if present in the given example set. Range: boolean
  • iterationsThis parameter is only available when the use robust estimation parameter is set to true. This parameter specifies the number of iterations performed for weight calculation. Range: integer
  • numerical_measureThis parameter specifies the numerical measure for distance calculation. Range: selection
  • neighborhood_typeThis parameter determines which type of neighborhood should be used. Range: selection
  • kThis parameter is only available when the neighborhood type parameter is set to 'Fixed Number '. It specifies the number of neighbors in the neighborhood. Regardless of the local density, always k samples are returned. Range: integer
  • fixed_distanceThis parameter is only available when the neighborhood type parameter is set to 'Fixed Distance '. It specifies the size of the neighborhood. All points within this distance are added. Range: real
  • relative_sizeThis parameter is only available when the neighborhood type parameter is set to 'Relative Number '. It specifies the size of the neighborhood relative to the total number of examples. For example, a value of 0.04 would include 4% of the data points into the neighborhood. Range: real
  • distanceThis parameter is only available when the neighborhood type parameter is set to 'Distance but at least'. It specifies the size of the neighborhood. All points within this distance are added. Range: real
  • at_leastThis parameter is only available when the neighborhood type parameter is set to 'Distance but at least'. If the neighborhood count is less than this number, the distance is increased until this number is met. Range: integer
  • smoothing_kernelThis parameter determines which kernel type should be used to calculate the weights of distant examples. Range: selection

Tutorial Processes

Applying the Local Polynomial Regression operator on the Polynomial data set

The 'Polynomial' data set is loaded using the Retrieve operator. The Split Data operator is applied on it to split the ExampleSet into training and testing data sets. The Local Polynomial Regression operator is applied on the training data set. The degree parameter is set to 3, the neighborhood type parameter is set to 'relative number' and the relative size is set to 0.5. The regression model generated by the Local Polynomial Regression operator is applied on the testing data set of the 'Polynomial' data set using the Apply Model operator. The labeled data set generated by the Apply Model operator is provided to the Performance (Regression) operator. The absolute error and the prediction average parameters are set to true. Thus the Performance Vector generated by the Performance (Regression) operator has information regarding the absolute error and the prediction average in the labeled data set. The absolute error is calculated by adding the difference of all predicted values from the actual values of the label attribute, and dividing this sum by the total number of predictions. The prediction average is calculated by adding all actual label values and dividing this sum by the total number of examples. You can verify this from the results in the Results Workspace.