# What Is Regression In Data Mining?

• What is Regression?
• Regression in Data Mining
• Types of Regression techniques
• Application of Regression
• Difference between Regression, Classification and Clustering in Data Mining
• Overfitting
• Evaluating a Regression model
• Regression Algorithms in Oracle Data Mining

# Types of Regression techniques

• Linear regression is a type of regression that uses a straight line to build a link between the attribute value and one or more independent variables.
• Using the least square method best fit line is achieved for a linear relationship.
• The relation between the independent x variables and the dependent y variables is expressed as an nth order polynomial in polynomial regression.
• Based on past observations of a data set, the logistic regression technique is a statistical analytic approach for predicting a binary result, such as yes or no.
• A logistic regression model analyses the connection between more or one independent variable to predict a dependent data variable.
• The ridge regression method is a well-known regularised linear regression technique. It’s a method for analyzing multicollinear regression data.
• The occurrence of a linear connection between two independent variables is known as multicollinearity.
• A sort of shrinkage-based linear regression is the Lasso process.
• Data are shrunken towards a central point, such as the mean, in shrinkage. Simple, sparse models are encouraged by the lasso procedure.
• Decision tree / Regression tree — In the shape of a tree structure, a decision tree constructs regression or classification models. It gradually greatly reduces a dataset into increasingly smaller sections while also developing an associated decision tree. A tree containing a set of nodes with leaf nodes is the end result. Eg. Civil planning
• Support vector regression — The supervised learning technique Support Vector Regression model to predict distinct values. SVMs and Support Vector Regression are both based on the same premise. Eg. Identifying gene classifications, people with genetic disorders, and other biological issues.
• Random forest regression — For regression, this is a supervised learning technique that uses the ensemble learning approach. The ensemble learning approach combines predictions from several machine learning techniques to get a more accurate forecast than a single model. Eg. Stock market prediction, Product recommendation
• ElasticNet regression — Elastic net is a sort of regularised linear regression that includes two well-known penalties, the L2 and L1 penalty functions. Elastic Net is a linear regression modification that incorporates regularization penalties into the gradient descent during training. Eg. Analysis of genomic data
• Standard multiple regression — The most popular type of multiple regression analysis is this one. The equation is filled in with all of the independent variables at the same time. The predictive ability for every independent variable is assessed. Eg. Blood pressure may be predicted using independent factors such as height, weight, age, and weekly activity hours.
• Stepwise multiple regression — A stepwise regression technique will examine which predictors are most effective in predicting neighborhood choice — that is, the stepwise model will rank the predictor variables in order of relevance before selecting a meaningful subset. The regression equation is developed in “steps” in this sort of regression issue. All variables may not even exist in the overall regression model in this sort of analysis. Eg. To design an optimized electric machine — multiobjective optimization design
• Hierarchical regression — After controlling for all other factors, hierarchical regression may be used to see if variables of interest describe a statistically significant variance in your Exogenous Variables (Dependent variable). Instead of a statistical procedure, this is a paradigm for model comparison. Eg. Health service research
• Set-wise regression — It is the iterative creation of a regression model in which the data points to be utilized in the final model are chosen step by step. It entails incrementally adding or eliminating possible explanatory factors, with each iteration requiring statistical significance assessment. Eg. Utilizing Case/Control or Parent Data to Assess the Comparative Impact of Genetic variants within a Gene

# Application of Regression

1. Drug Response modeling
2. Planning for business and marketing
3. Forecasting or financial predictions
4. Observing and analyzing patterns or trends
5. Modeling of the environment
6. The pharmacological response over time
7. Calibration of statistical data
8. Physiochemical relationship
9. Satellite image analysis
10. Crop yield estimation

# Overfitting

• Cross-validation — A cross-validation is an effective tool for avoiding overfitting. Create several tiny train-test splits using your initial training data. These divisions can be used to fine-tune your model. We split it up into k subsets, or folds, in typical k-fold cross-validation.
• Data augmentation — Data augmentation, that are less expensive than training with extra data, is an alternative to the former. You can make the given data sets look varied if you are unable to acquire new data on a regular basis. Data augmentation changes the appearance of a sample data set each time it is analyzed by the model. The process made each data set look unique to the algorithm and prevents it from learning the data sets’ properties.
• Regularization — Regularization is a strategy for reducing the model’s complexity. This is accomplished by penalizing the loss function. This aids in the resolution of the overfitting issues

# Evaluating a Regression model

1. R Square or adjusted R square — Adjusted R-squared is just a variant of R-squared that takes into account the degree of predictor variables. When the additional word enhances the equation more than would be anticipated by chance, the adjusted R-squared rises. When a prediction enhances the model than less than predicted, it declines.
2. Root Mean Square Error(RMSE) or Mean Square Error(MSE) — The root mean square error is the residuals’ standard deviation (prediction errors). Residuals are a measurement of how distant the data points are from the regression line; RMSE is an estimate of how wide out these residuals are. In the other words, it indicates how tightly the data is clustered all-around line of best fit.
3. Mean Absolute Error (MAE) — It is a regression model assessment statistic. The mean absolute error of the predictor with regard to a test set is the average of all specific prediction errors on all occurrences in the test set.

# Regression algorithms in Oracle Data Mining

• Generalized Linear Models (GLMs): They are a type of linear model and are widely used as a statistical approach for linear modeling. GLM is used by Oracle Data Mining for regression and binary classification. GLM has a wide range of coefficient and model stats, and also row diagnostics. Confidence bounds are also supported by GLM. Eg. To predict customer affinity, the agriculture weather modeling logistic regression model of GLM is used.
• Support Vector Machines (SVMs): SVM is a strong, cutting-edge linear and nonlinear regression technique. SVM is used by Oracle Data Mining for regression as well as other mining activities. The Gaussian kernel for nonlinear regression and the linear kernel for linear regression are supported by SVM regression. Active learning is also supported by SVM. Eg. Facial recognition, Speech recognition, Text classification

# Summing Up

--

-- ## Unstop

Unstop (formerly Dare2Compete) enables companies to engage with candidates in the most interactive way to discover, assess, and hire the best talent.