Skip to content

Commit

Permalink
upd
Browse files Browse the repository at this point in the history
  • Loading branch information
majidrouhanintnu committed Oct 20, 2024
1 parent 8ea9056 commit e98e91a
Show file tree
Hide file tree
Showing 37 changed files with 655 additions and 324 deletions.
Binary file added data/diabetes_X.xlsx
Binary file not shown.
Binary file added data/diabetes_y.xlsx
Binary file not shown.
Binary file modified docs/crisp.pptx
Binary file not shown.
42 changes: 17 additions & 25 deletions session_1/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Predictive analytics

## Prerequisits
## Prerequisits

- You should already have basic knowledge about:
- [NumPy](https://numpy.org/)
Expand All @@ -11,7 +11,7 @@

- Before starting to work on the example codes, make sure the above packages are installed on your Python.

## Usefull resources
## Usefull resources 📖

- [What is CRISP DM?](https://www.datascience-pm.com/crisp-dm-2/)

Expand All @@ -25,59 +25,51 @@

- Textbook: Python for realfag, ch. 10, 17.

## Objectives
## Objectives 🎯

- Get to know your data (data understanding)
- Using pandas to handle missing data in a dataset (data preparation)
- Applying simple regression models to a dataset (modelling)

## Introduction
## Introduction 📖

Predictive analytics is the practice of forecasting future results and performance using statistics and modelling approaches.

With [predictive programming](https://365datascience.com/tutorials/python-tutorials/predictive-model-python/) you collect and analyze historical data to recognize patterns. A model is trained using this data, allowing it to forecast future results when presented with new information

Cross-industry standard process for data mining ([CRISP DM](https://www.datascience-pm.com/crisp-dm-2/)):

<img src="./images/crisp_1.png" alt="CRISP-DM Process"/>
<img src="./images/crisp_1.png" alt="CRISP-DM Process" width="400"/> \
<a href="https://www.datascience-pm.com/crisp-dm-2/">Figure 1: CRISP-DM Process</a>


- Business Understanding - What does the business need?
- The Business Understanding phase centers on comprehending the project's objectives and needs.
- **Data Understanding** - What data do we have / need? Is it clean?
- The data understanding phase emphasizes identifying, gathering, and analyzing datasets that will aid in achieving the project's objectives.
- In this lecture we look into various **libraries** and **sources** for finding datasets, along with specific **examples** and tasks to help users analyze and visualize data effectively.


- **Data Preparation** - How do we organize the data for modeling?
- In this stage, it's essential to ensure the prediction models can process the data. You might prepare data by handling missing values and applying normalization or standardization techniques. Additionally, as most models only accept numerical data, categorical data must be converted into numerical data.

- In this lecture, we cover essential steps such as **data cleaning**, managing **categorical data**, **splitting data into training and test sets**, and **feature scaling**.


- **Modelling** - What modeling techniques should we apply?
- We make predictions by dividing the data into test and training. Use training data to train the model and test data to evaluate the model's success score. There are several predictive models and modeling techniques in machine learning. We look at some simple linear regression model examples.

- In this lecture, we provide an introduction to various **linear regression models** used in predictive analysis.

- Evaluation - Which model best meets the business objectives?
- The Evaluation phase takes a broader perspective on identifying which model best aligns with the business and determining the next steps.
- Deployment - How do stakeholders access the results?
- “Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.”

[Lean more](https://www.datascience-pm.com/crisp-dm-2/)

## Predictive analytics

<img src="./images/crisp_2.png" alt="CRISP-DM Process"/>
<a href="https://www.datascience-pm.com/crisp-dm-2/">Figure 2: CRISP-DM Process - focus areas in this lecture</a>

### Overview of the lecture

- Understanding data
- Understanding and working with datasets. This section includes references to various libraries and sources for finding datasets, along with specific examples and tasks to help users analyze and visualize data effectively.
- “Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.”

- Data preparation
- Preparing data for modeling. This section covers essential steps such as data cleaning, managing categorical data, splitting data into training and test sets, and feature scaling. Each section includes references to further reading and lectures for a deeper understanding of the topics
- Complete example of applying CRISP DM
- A detailed breakdown of a **machine learning workflow** using the CRISP-DM methodology. It covers all phases from **business understanding to deployment**, using a dataset on CO2 emissions of cars to illustrate each step.

- Modelling
- This section provides an introduction to various linear regression models used in predictive analysis. It includes descriptions and links to Jupyter Notebooks for simple linear regression, polynomial regression, multiple regression, and robust linear regression models. The content serves as a guide for understanding and implementing these models in predictive analytics tasks.

- CRISP DM: Complete example
- This section provides a detailed breakdown of a machine learning workflow (putting pices together) using the CRISP-DM methodology. It covers all phases from business understanding to deployment, using a dataset on CO2 emissions of cars to illustrate each step. The example includes data loading, exploratory data analysis, data preparation, model training, evaluation, and deployment.
[📖Lean more about CRISP DM](https://www.datascience-pm.com/crisp-dm-2/)

[Top](../README.md) | [Next -> Understanding Data](./understanding_data/README.md)
[Top](../README.md) | [➡️ Understanding Data](./understanding_data/README.md)
10 changes: 5 additions & 5 deletions session_1/crisp_dm/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## CRISP DM - Example
## *CRISP DM - Example

[CO2 emission of cars dataset](https://www.kaggle.com/code/midhundasl/co2-prediction)
[📖 CO2 emission of cars dataset](https://www.kaggle.com/code/midhundasl/co2-prediction)
By [Midhun Das L](https://www.kaggle.com/midhundasl)

We use this dataset as the basis for demonstration of the comprehensive example of a machine learning workflow that follows the CRISP-DM methodology.
Expand All @@ -18,7 +18,7 @@ Here's a detailed breakdown of what each section:
- Loading Data: The data is loaded from a CSV file.
- Exploratory Data Analysis (EDA):
- info(): Provides a concise summary of the DataFrame, including the data types and non-null values.
- describe(): Generates descriptive statistics that summarize the central tendency, dispersion, and shape of the datasets distribution.
- describe(): Generates descriptive statistics that summarize the central tendency, dispersion, and shape of the dataset's distribution.
- pairplot(): Creates a pairwise plot of the dataset to visualize relationships between variables.

3. Data Preparation
Expand All @@ -39,7 +39,7 @@ Here's a detailed breakdown of what each section:
- Saving the Model: The trained model and scaler are saved to disk using joblib.
- Loading the Model: Demonstrates how to load the saved model and scaler for future predictions.

[Lecture: Predict CO2 emissions](data_analytics_example.ipynb)
[🎥 Predict CO2 emissions](data_analytics_example.ipynb)

[Predictive Analytics <- Previous](../modelling/README.md) |
[Modelling ⬅️](../modelling/README.md) |
[TOP](../../README.md)
109 changes: 67 additions & 42 deletions session_1/crisp_dm/data_analytics_example.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,16 @@
"source": [
"## Business Understanding\n",
"\n",
"Objective: Predict CO2 emissions based on car weight and volume to help reduce environmental impact."
"<div style=\"background-color: white; display: inline-block;\">\n",
" <img src=\"../images/crisp_business_understanding.png\" width=\"400\">\n",
"</div>\n",
"\n",
"Predict CO2 emissions based on car weight and volume to help reduce environmental impact."
]
},
{
"cell_type": "code",
"execution_count": 76,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -22,14 +26,14 @@
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import mean_squared_error, r2_score\n",
"from sklearn.metrics import mean_absolute_error, r2_score\n",
"import joblib\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 77,
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -42,7 +46,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Understanding the data"
"## Understanding the data\n",
"\n",
"<div style=\"background-color: white; display: inline-block;\">\n",
" <img src=\"../images/crisp_data_understanding.png\" width=\"400\">\n",
"</div>"
]
},
{
Expand Down Expand Up @@ -141,12 +149,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Preparation"
"## Data Preparation\n",
"\n",
"<div style=\"background-color: white; display: inline-block;\">\n",
" <img src=\"../images/crisp_data_preparation.png\" width=\"400\">\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": 83,
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -156,29 +168,33 @@
},
{
"cell_type": "code",
"execution_count": 84,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Features and target variable\n",
"X = data[['Weight', 'Volume']].copy()\n",
"\n",
"y = data[['CO2']].copy()"
"y = data[['CO2']].copy()\n",
"\n",
"X.head(),y.head()"
]
},
{
"cell_type": "code",
"execution_count": 85,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Split data into training and test sets\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)"
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)\n",
"\n",
"X_train,X_test"
]
},
{
"cell_type": "code",
"execution_count": 86,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -187,19 +203,25 @@
"\n",
"# Fits the StandardScaler to the data and then transforms the data.\n",
"X_train_scaled = scaler.fit_transform(X_train)\n",
"X_test_scaled = scaler.transform(X_test)"
"X_test_scaled = scaler.transform(X_test)\n",
"\n",
"X_train_scaled,X_test_scaled"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modeling"
"## Modeling\n",
"\n",
"<div style=\"background-color: white; display: inline-block;\">\n",
" <img src=\"../images/crisp_modeling.png\" width=\"400\">\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": 87,
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -243,7 +265,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluation"
"## Evaluation\n",
"\n",
"<div style=\"background-color: white; display: inline-block;\">\n",
" <img src=\"../images/crisp_evaluation.png\" width=\"400\">\n",
"</div>"
]
},
{
Expand All @@ -252,7 +278,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(\"Mean Squared Error:\", mean_squared_error(y_test, y_pred))\n",
"print(\"Mean Abspolute Error:\", mean_absolute_error(y_test, y_pred))\n",
"print(\"R-squared:\", r2_score(y_test, y_pred))"
]
},
Expand All @@ -261,29 +287,25 @@
"metadata": {},
"source": [
"The provided output includes two key metrics for evaluating the performance of a regression model: \n",
"- Mean Squared Error (MSE)\n",
"- Mean Absolute Error (MAE)\n",
"- R-squared (R²). \n",
"\n",
"1. Mean Squared Error (MSE)\n",
"Value: 41.48536307266049\n",
"\n",
"Explanation: MSE measures the average squared difference between the actual and predicted values. It is calculated as: $ \\text{MSE} = \\frac{1}{n} \\sum_{i=1}^{n} (y_i - \\hat{y}_i)^2 $ where ( $y_i$ ) is the actual value, ($ \\hat{y}_i $) is the predicted value, and ( n ) is the number of observations.\n",
"1. Mean Absolute Error\n",
"Value: 5.6\n",
"\n",
"Interpretation: A lower MSE indicates a better fit of the model to the data. In this case, an MSE of 41.48536307266049 suggests that, on average, the squared difference between the actual and predicted CO2 emissions is about 41.49. This value is context-dependent, so without a baseline or comparison, it's hard to judge if this is good or bad.\n",
"In this case, an MAE of 5.6 suggests that, on average, the absolute difference between the actual and predicted CO2 emissions is about 5.6 units.\n",
"This value provides a straightforward measure of how close the predictions are to the actual values, with all errors contributing equally to the average.\n",
"\n",
"2. R-squared (R²)\n",
"Value: 0.40615897189660666\n",
"Value: 0.4\n",
"\n",
"Explanation: R² measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It is calculated as: $ R^2 = 1 - \\frac{\\sum_{i=1}^{n} (y_i - \\hat{y}i)^2}{\\sum{i=1}^{n} (y_i - \\bar{y})^2} $ where $ \\bar{y} $ is the mean of the actual values.\n",
"\n",
"Interpretation: R² ranges from 0 to 1, where:\n",
"R² ranges from 0 to 1, where:\n",
"\n",
"0: The model explains none of the variance in the dependent variable.\\\n",
"1: The model explains all the variance in the dependent variable.\\\n",
"Negative values: The model performs worse than a horizontal line (mean of the actual values).\\\n",
"An R² value of 0.40615897189660666 means that approximately 40.62% of the variance in CO2 emissions is explained by the model. This indicates a moderate fit, suggesting that there is room for improvement.\n",
"\n",
"Improving a linear regression model to achieve a lower Mean Squared Error (MSE) and a higher R-squared value involves several steps."
"Negative values: The model performs worse than a horizontal line (mean of the actual values).\\\n",
"An R² value of 0.4 means that approximately 40.62% of the variance in CO2 emissions is explained by the model. This indicates a moderate fit, suggesting that there is room for improvement."
]
},
{
Expand Down Expand Up @@ -311,6 +333,17 @@
"print(\"Coefficients:\", regr.coef_)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deployment\n",
"\n",
"<div style=\"background-color: white; display: inline-block;\">\n",
" <img src=\"../images/crisp_deployment.png\" width=\"400\">\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -324,7 +357,7 @@
},
{
"cell_type": "code",
"execution_count": 92,
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -335,17 +368,9 @@
},
{
"cell_type": "code",
"execution_count": 94,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Predicted CO2 using loaded model: [[7071.54172594]]\n"
]
}
],
"outputs": [],
"source": [
"# Predict using the loaded model and scaler\n",
"X_new_loaded = pd.DataFrame({\n",
Expand All @@ -363,7 +388,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"[CRISP DM <- Previous](./README.md)"
"[CRISP DM ⬅️](./README.md)"
]
}
],
Expand Down
Loading

0 comments on commit e98e91a

Please sign in to comment.