diff --git a/data/diabetes_X.xlsx b/data/diabetes_X.xlsx new file mode 100644 index 0000000..761ccbe Binary files /dev/null and b/data/diabetes_X.xlsx differ diff --git a/data/diabetes_y.xlsx b/data/diabetes_y.xlsx new file mode 100644 index 0000000..ec55f48 Binary files /dev/null and b/data/diabetes_y.xlsx differ diff --git a/docs/crisp.pptx b/docs/crisp.pptx index bc67993..69ede40 100644 Binary files a/docs/crisp.pptx and b/docs/crisp.pptx differ diff --git a/session_1/README.md b/session_1/README.md index c49554f..2662fc4 100644 --- a/session_1/README.md +++ b/session_1/README.md @@ -1,6 +1,6 @@ # Predictive analytics -## Prerequisits +## Prerequisits ✅ - You should already have basic knowledge about: - [NumPy](https://numpy.org/) @@ -11,7 +11,7 @@ - Before starting to work on the example codes, make sure the above packages are installed on your Python. -## Usefull resources +## Usefull resources 📖 - [What is CRISP DM?](https://www.datascience-pm.com/crisp-dm-2/) @@ -25,13 +25,13 @@ - Textbook: Python for realfag, ch. 10, 17. -## Objectives +## Objectives 🎯 - Get to know your data (data understanding) - Using pandas to handle missing data in a dataset (data preparation) - Applying simple regression models to a dataset (modelling) -## Introduction +## Introduction 📖 Predictive analytics is the practice of forecasting future results and performance using statistics and modelling approaches. @@ -39,7 +39,7 @@ With [predictive programming](https://365datascience.com/tutorials/python-tutori Cross-industry standard process for data mining ([CRISP DM](https://www.datascience-pm.com/crisp-dm-2/)): -CRISP-DM Process +CRISP-DM Process \ Figure 1: CRISP-DM Process @@ -47,37 +47,29 @@ Cross-industry standard process for data mining ([CRISP DM](https://www.datascie - The Business Understanding phase centers on comprehending the project's objectives and needs. - **Data Understanding** - What data do we have / need? Is it clean? - The data understanding phase emphasizes identifying, gathering, and analyzing datasets that will aid in achieving the project's objectives. + - In this lecture we look into various **libraries** and **sources** for finding datasets, along with specific **examples** and tasks to help users analyze and visualize data effectively. + - **Data Preparation** - How do we organize the data for modeling? - In this stage, it's essential to ensure the prediction models can process the data. You might prepare data by handling missing values and applying normalization or standardization techniques. Additionally, as most models only accept numerical data, categorical data must be converted into numerical data. + - In this lecture, we cover essential steps such as **data cleaning**, managing **categorical data**, **splitting data into training and test sets**, and **feature scaling**. + + - **Modelling** - What modeling techniques should we apply? - We make predictions by dividing the data into test and training. Use training data to train the model and test data to evaluate the model's success score. There are several predictive models and modeling techniques in machine learning. We look at some simple linear regression model examples. + - In this lecture, we provide an introduction to various **linear regression models** used in predictive analysis. + - Evaluation - Which model best meets the business objectives? - The Evaluation phase takes a broader perspective on identifying which model best aligns with the business and determining the next steps. - Deployment - How do stakeholders access the results? - - “Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.” - -[Lean more](https://www.datascience-pm.com/crisp-dm-2/) - -## Predictive analytics - -CRISP-DM Process -Figure 2: CRISP-DM Process - focus areas in this lecture - -### Overview of the lecture - -- Understanding data - - Understanding and working with datasets. This section includes references to various libraries and sources for finding datasets, along with specific examples and tasks to help users analyze and visualize data effectively. + - “Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.†-- Data preparation - - Preparing data for modeling. This section covers essential steps such as data cleaning, managing categorical data, splitting data into training and test sets, and feature scaling. Each section includes references to further reading and lectures for a deeper understanding of the topics +- Complete example of applying CRISP DM + - A detailed breakdown of a **machine learning workflow** using the CRISP-DM methodology. It covers all phases from **business understanding to deployment**, using a dataset on CO2 emissions of cars to illustrate each step. -- Modelling - - This section provides an introduction to various linear regression models used in predictive analysis. It includes descriptions and links to Jupyter Notebooks for simple linear regression, polynomial regression, multiple regression, and robust linear regression models. The content serves as a guide for understanding and implementing these models in predictive analytics tasks. -- CRISP DM: Complete example - - This section provides a detailed breakdown of a machine learning workflow (putting pices together) using the CRISP-DM methodology. It covers all phases from business understanding to deployment, using a dataset on CO2 emissions of cars to illustrate each step. The example includes data loading, exploratory data analysis, data preparation, model training, evaluation, and deployment. +[📖Lean more about CRISP DM](https://www.datascience-pm.com/crisp-dm-2/) -[Top](../README.md) | [Next -> Understanding Data](./understanding_data/README.md) +[Top](../README.md) | [âž¡ï¸ Understanding Data](./understanding_data/README.md) diff --git a/session_1/crisp_dm/README.md b/session_1/crisp_dm/README.md index 170036e..e7232d1 100644 --- a/session_1/crisp_dm/README.md +++ b/session_1/crisp_dm/README.md @@ -1,6 +1,6 @@ -## CRISP DM - Example +## *CRISP DM - Example -[CO2 emission of cars dataset](https://www.kaggle.com/code/midhundasl/co2-prediction) +[📖 CO2 emission of cars dataset](https://www.kaggle.com/code/midhundasl/co2-prediction) By [Midhun Das L](https://www.kaggle.com/midhundasl) We use this dataset as the basis for demonstration of the comprehensive example of a machine learning workflow that follows the CRISP-DM methodology. @@ -18,7 +18,7 @@ Here's a detailed breakdown of what each section: - Loading Data: The data is loaded from a CSV file. - Exploratory Data Analysis (EDA): - info(): Provides a concise summary of the DataFrame, including the data types and non-null values. - - describe(): Generates descriptive statistics that summarize the central tendency, dispersion, and shape of the dataset’s distribution. + - describe(): Generates descriptive statistics that summarize the central tendency, dispersion, and shape of the dataset's distribution. - pairplot(): Creates a pairwise plot of the dataset to visualize relationships between variables. 3. Data Preparation @@ -39,7 +39,7 @@ Here's a detailed breakdown of what each section: - Saving the Model: The trained model and scaler are saved to disk using joblib. - Loading the Model: Demonstrates how to load the saved model and scaler for future predictions. -[Lecture: Predict CO2 emissions](data_analytics_example.ipynb) +[🎥 Predict CO2 emissions](data_analytics_example.ipynb) -[Predictive Analytics <- Previous](../modelling/README.md) | +[Modelling ⬅ï¸](../modelling/README.md) | [TOP](../../README.md) diff --git a/session_1/crisp_dm/data_analytics_example.ipynb b/session_1/crisp_dm/data_analytics_example.ipynb index 1424722..498f5f7 100644 --- a/session_1/crisp_dm/data_analytics_example.ipynb +++ b/session_1/crisp_dm/data_analytics_example.ipynb @@ -6,12 +6,16 @@ "source": [ "## Business Understanding\n", "\n", - "Objective: Predict CO2 emissions based on car weight and volume to help reduce environmental impact." + "
\n", + " \n", + "
\n", + "\n", + "Predict CO2 emissions based on car weight and volume to help reduce environmental impact." ] }, { "cell_type": "code", - "execution_count": 76, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -22,14 +26,14 @@ "from sklearn.linear_model import LinearRegression\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.model_selection import train_test_split\n", - "from sklearn.metrics import mean_squared_error, r2_score\n", + "from sklearn.metrics import mean_absolute_error, r2_score\n", "import joblib\n", "\n" ] }, { "cell_type": "code", - "execution_count": 77, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -42,7 +46,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Understanding the data" + "## Understanding the data\n", + "\n", + "
\n", + " \n", + "
" ] }, { @@ -141,12 +149,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Data Preparation" + "## Data Preparation\n", + "\n", + "
\n", + " \n", + "
" ] }, { "cell_type": "code", - "execution_count": 83, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -156,29 +168,33 @@ }, { "cell_type": "code", - "execution_count": 84, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Features and target variable\n", "X = data[['Weight', 'Volume']].copy()\n", "\n", - "y = data[['CO2']].copy()" + "y = data[['CO2']].copy()\n", + "\n", + "X.head(),y.head()" ] }, { "cell_type": "code", - "execution_count": 85, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Split data into training and test sets\n", - "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)" + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)\n", + "\n", + "X_train,X_test" ] }, { "cell_type": "code", - "execution_count": 86, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -187,19 +203,25 @@ "\n", "# Fits the StandardScaler to the data and then transforms the data.\n", "X_train_scaled = scaler.fit_transform(X_train)\n", - "X_test_scaled = scaler.transform(X_test)" + "X_test_scaled = scaler.transform(X_test)\n", + "\n", + "X_train_scaled,X_test_scaled" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Modeling" + "## Modeling\n", + "\n", + "
\n", + " \n", + "
" ] }, { "cell_type": "code", - "execution_count": 87, + "execution_count": 18, "metadata": {}, "outputs": [], "source": [ @@ -243,7 +265,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Evaluation" + "## Evaluation\n", + "\n", + "
\n", + " \n", + "
" ] }, { @@ -252,7 +278,7 @@ "metadata": {}, "outputs": [], "source": [ - "print(\"Mean Squared Error:\", mean_squared_error(y_test, y_pred))\n", + "print(\"Mean Abspolute Error:\", mean_absolute_error(y_test, y_pred))\n", "print(\"R-squared:\", r2_score(y_test, y_pred))" ] }, @@ -261,29 +287,25 @@ "metadata": {}, "source": [ "The provided output includes two key metrics for evaluating the performance of a regression model: \n", - "- Mean Squared Error (MSE)\n", + "- Mean Absolute Error (MAE)\n", "- R-squared (R²). \n", "\n", - "1. Mean Squared Error (MSE)\n", - "Value: 41.48536307266049\n", - "\n", - "Explanation: MSE measures the average squared difference between the actual and predicted values. It is calculated as: $ \\text{MSE} = \\frac{1}{n} \\sum_{i=1}^{n} (y_i - \\hat{y}_i)^2 $ where ( $y_i$ ) is the actual value, ($ \\hat{y}_i $) is the predicted value, and ( n ) is the number of observations.\n", + "1. Mean Absolute Error\n", + "Value: 5.6\n", "\n", - "Interpretation: A lower MSE indicates a better fit of the model to the data. In this case, an MSE of 41.48536307266049 suggests that, on average, the squared difference between the actual and predicted CO2 emissions is about 41.49. This value is context-dependent, so without a baseline or comparison, it's hard to judge if this is good or bad.\n", + "In this case, an MAE of 5.6 suggests that, on average, the absolute difference between the actual and predicted CO2 emissions is about 5.6 units.\n", + "This value provides a straightforward measure of how close the predictions are to the actual values, with all errors contributing equally to the average.\n", "\n", "2. R-squared (R²)\n", - "Value: 0.40615897189660666\n", + "Value: 0.4\n", "\n", - "Explanation: R² measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It is calculated as: $ R^2 = 1 - \\frac{\\sum_{i=1}^{n} (y_i - \\hat{y}i)^2}{\\sum{i=1}^{n} (y_i - \\bar{y})^2} $ where $ \\bar{y} $ is the mean of the actual values.\n", - "\n", - "Interpretation: R² ranges from 0 to 1, where:\n", + "R² ranges from 0 to 1, where:\n", "\n", "0: The model explains none of the variance in the dependent variable.\\\n", "1: The model explains all the variance in the dependent variable.\\\n", - "Negative values: The model performs worse than a horizontal line (mean of the actual values).\\\n", - "An R² value of 0.40615897189660666 means that approximately 40.62% of the variance in CO2 emissions is explained by the model. This indicates a moderate fit, suggesting that there is room for improvement.\n", "\n", - "Improving a linear regression model to achieve a lower Mean Squared Error (MSE) and a higher R-squared value involves several steps." + "Negative values: The model performs worse than a horizontal line (mean of the actual values).\\\n", + "An R² value of 0.4 means that approximately 40.62% of the variance in CO2 emissions is explained by the model. This indicates a moderate fit, suggesting that there is room for improvement." ] }, { @@ -311,6 +333,17 @@ "print(\"Coefficients:\", regr.coef_)\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Deployment\n", + "\n", + "
\n", + " \n", + "
" + ] + }, { "cell_type": "code", "execution_count": null, @@ -324,7 +357,7 @@ }, { "cell_type": "code", - "execution_count": 92, + "execution_count": 18, "metadata": {}, "outputs": [], "source": [ @@ -335,17 +368,9 @@ }, { "cell_type": "code", - "execution_count": 94, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Predicted CO2 using loaded model: [[7071.54172594]]\n" - ] - } - ], + "outputs": [], "source": [ "# Predict using the loaded model and scaler\n", "X_new_loaded = pd.DataFrame({\n", @@ -363,7 +388,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[CRISP DM <- Previous](./README.md)" + "[CRISP DM ⬅ï¸](./README.md)" ] } ], diff --git a/session_1/data_preparation/README.md b/session_1/data_preparation/README.md index 5bb9586..01b2771 100644 --- a/session_1/data_preparation/README.md +++ b/session_1/data_preparation/README.md @@ -1,20 +1,22 @@ ## 2. Data preparation - How do we organize the data for modeling? + - What data is missing, and how do we handle missing values? + - Are there any noisy data, and how do we clean it? + - Are there any outliers, and how should we treat them? + - How do we handle categorical data? + - Do we need to scale or normalize the data? + - Are there any feature engineering steps we should consider? -We rarely have complete datasets to work with. -In Python, missing values are represented as NaN, which is "not a number". +These questions are about the essential steps in data preparation for modeling. They guide the process of identifying and handling missing values, cleaning noisy data, addressing outliers, converting categorical data into numerical formats, scaling or normalizing features, and performing feature engineering. By systematically addressing these aspects, we ensure that the data is clean, well-structured, and suitable for building effective machine learning models. ### Data cleaning -Most prediction methods cannot work with missing data, so we need to fix the problem of missing values. +This step involves identifying and correcting inaccuracies or inconsistencies in the data. It addresses issues such as missing values, outliers, duplicate records, and errors. The goal is to enhance the quality of the data, ensuring it is reliable for analysis. -There are several ways to handle this. -We are looking at three options: 1) Delete the entire column where the NaN values are found. 2) Delete the rows with NaN values. 3) fill in the NaN values. +[📖 Read more about data cleaning](https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/) -[Read more](https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/) - -[Lecture: Data cleaning](./preprocessing_calibration/README.md) +[🎥 Data cleaning](./preprocessing_calibration/README.md) @@ -26,14 +28,14 @@ Because prediction models only accept numerical data, we must translate categori There are two ways we may approach this. Label encoding is one method, and hot encoding is another. -- [Label encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html): works well if we have two distinct values. We can use e.g. sklearn.LabelEncoder or pandas.factorize() +- [📖 Label encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html): works well if we have two distinct values. We can use e.g. sklearn.LabelEncoder or pandas.factorize() -- [One hot encoding](https://scikit-learn.org/0.18/modules/generated/sklearn.preprocessing.OneHotEncoder.html): works well if we have three or more distinct values. We can use e.g. sklearn OneHotEncoder or [pandas get_dummies()](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) +- [📖 One hot encoding](https://scikit-learn.org/0.18/modules/generated/sklearn.preprocessing.OneHotEncoder.html): works well if we have three or more distinct values. We can use e.g. sklearn OneHotEncoder or [pandas get_dummies()](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) -[Read more](https://www.datacamp.com/tutorial/categorical-data) +[📖 Read more about managing categorical data](https://www.datacamp.com/tutorial/categorical-data) -[Lecture: categorical data](./data_transformation/categorical_data.ipynb) +[🎥 Categorical data](./data_transformation/categorical_data.ipynb) ### Deviding data into training and test @@ -43,9 +45,9 @@ We divide data into what's known as train and test. One portion is used to asses - The training set contains a known output and the model learns on this data (2/3). - The test data is for testing our model's prediction (1/3). -[Read more](https://www.techtarget.com/searchenterpriseai/definition/data-splitting) +[📖 Read more about deviding data into training and test](https://www.techtarget.com/searchenterpriseai/definition/data-splitting) -[Lecture: deviding data into trining and test](./data_transformation/deviding_data.ipynb) +[🎥 Deviding data into trining and test](./data_transformation/deviding_data.ipynb) ### Feature scaling @@ -57,12 +59,12 @@ Here, we'll demonstrate two techniques called normalization and standardization. - Standardization: Divide the result by the standard deviation after deducting the mean value from the number. In contrast to normalization, we are not confined to a certain range in this situation. -[Read more](https://www.geeksforgeeks.org/normalization-vs-standardization/) +[📖 Read more about feature scaling](https://www.geeksforgeeks.org/normalization-vs-standardization/) -[Lecture: Feature scalling](./data_transformation/feature_scaling.ipynb) +[🎥 Feature scalling](./data_transformation/feature_scaling.ipynb) -### Task 2: Data preparation -[Analyzing and Visualizing Existing Datasets](task_2.md) +### Task 2: Data preparation 📋 +[📋 Analyzing and Visualizing Existing Datasets](task_2.md) -[Data understanding <- Previous](../understanding_data/README.md) | -[Next -> Modelling](../modelling/README.md) \ No newline at end of file +[Data understanding ⬅ï¸](../understanding_data/README.md) | +[âž¡ï¸ Modelling](../modelling/README.md) \ No newline at end of file diff --git a/session_1/data_preparation/data_transformation/categorical_data.ipynb b/session_1/data_preparation/data_transformation/categorical_data.ipynb index 52f50af..fad85e1 100644 --- a/session_1/data_preparation/data_transformation/categorical_data.ipynb +++ b/session_1/data_preparation/data_transformation/categorical_data.ipynb @@ -12,7 +12,7 @@ "The pandas library provides a Categorical data type that is specifically designed to handle categorical data. \n", "This data type is useful for representing data that can take on a limited, fixed number of possible values (categories).\n", "\n", - "[Learn more](https://www.datacamp.com/tutorial/categorical-data)" + "[📖 Learn more about categorical data](https://www.datacamp.com/tutorial/categorical-data)" ] }, { @@ -281,7 +281,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Data preparation <- Previous](../README.md)" + "[Data preparation ⬅ï¸](../README.md)" ] } ], diff --git a/session_1/data_preparation/data_transformation/deviding_data.ipynb b/session_1/data_preparation/data_transformation/deviding_data.ipynb index 43b8362..dbc478a 100644 --- a/session_1/data_preparation/data_transformation/deviding_data.ipynb +++ b/session_1/data_preparation/data_transformation/deviding_data.ipynb @@ -27,7 +27,7 @@ "outputs": [], "source": [ "# Load data\n", - "file_name = r\"C:\\git\\gitlab\\it6209\\lectures\\seminar_4\\session_1\\datasets\\insurance.csv\"\n", + "file_name = \"../../../data/insurance.csv\"\n", "data = pd.read_csv(file_name)\n", "data.head()" ] @@ -108,7 +108,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ @@ -140,7 +140,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Data preparation <- Previous](../README.md)" + "[Data preparation ⬅ï¸](../README.md)" ] } ], diff --git a/session_1/data_preparation/data_transformation/feature_scaling.ipynb b/session_1/data_preparation/data_transformation/feature_scaling.ipynb index d762bcb..6c44038 100644 --- a/session_1/data_preparation/data_transformation/feature_scaling.ipynb +++ b/session_1/data_preparation/data_transformation/feature_scaling.ipynb @@ -22,25 +22,53 @@ "# Load data\n", "file_name = \"../../../data/insurance.csv\"\n", "data = pd.read_csv(file_name)\n", - "print(data.head(10))\n", - "\n", + "print(data.head(10))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "# 2. Data cleaning\n", "# Fill in missing data using the interpolate()\n", "data.interpolate(method='linear', inplace=True)\n", - "print(data.head(10))\n", - "\n", + "print(data.head(10))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "# 3. Data transformation\n", "# encode none-numerical data\n", "data[\"smoker\"] = pd.factorize(data[\"smoker\"])[0]\n", "data[\"sex\"] = pd.factorize(data[\"sex\"])[0]\n", "data = pd.get_dummies(data, columns=[\"region\"])\n", - "print(data.head(10))\n", - "\n", + "print(data.head(10))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "# define x as being all columns exept charges column\n", "X_final = data[['age', 'bmi', 'children', 'region_northeast', 'region_northwest',\n", " 'region_southeast', 'region_southwest', 'sex', 'smoker']].copy()\n", - "print(X_final.head(10))\n", - "\n", + "print(X_final.head(10))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "# define y as being the \"charges column\" from the original dataset\n", "y_final = data[['charges']].copy()\n", "print(y_final.head(10))" @@ -48,11 +76,10 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ - "\n", "# Test train split\n", "X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.33, random_state=0)" ] @@ -94,7 +121,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -109,7 +136,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Data preparation <- Previous](../README.md)" + "[Data preparation ⬅ï¸](../README.md)" ] } ], diff --git a/session_1/data_preparation/preprocessing_calibration/README.md b/session_1/data_preparation/preprocessing_calibration/README.md index d8d6db0..e154aac 100644 --- a/session_1/data_preparation/preprocessing_calibration/README.md +++ b/session_1/data_preparation/preprocessing_calibration/README.md @@ -1,26 +1,24 @@ ## Data Cleaning -This step involves identifying and correcting inaccuracies or inconsistencies in the data. It addresses issues such as missing values, outliers, duplicate records, and errors. The goal is to enhance the quality of the data, ensuring it is reliable for analysis. - -### [Removing Duplicates](./removing_duplicates.ipynb) +- [🎥 Removing Duplicates](./removing_duplicates.ipynb) Identifying and removing duplicate records in a customer database. For instance, when merging customer data from different sources, you might find that the same customer is listed multiple times with slightly different variations. -### [Handling Missing Values](./filling_missing_data.ipynb) +- [🎥 Handling Missing Values](./filling_missing_data.ipynb) Dealing with missing data in a dataset. For instance, in a survey dataset, some respondents might have skipped certain questions, leading to missing values that need to be imputed or handled Two primary ways: (1) Ignore if the dataset is large, (2) fill in the missing values -### [Outlier removal](./outlier_removal.ipynb) +- [🎥 Outlier removal](./outlier_removal.ipynb) Outlier removal is a crucial process in data cleaning that involves identifying and eliminating data points that significantly deviate from the rest of the dataset. Outliers can arise due to various reasons, such as measurement errors, data entry mistakes, or genuine variability in the data. -### [Time series preprocessing](time_series_data_preprocessing.ipynb) +- [🎥 Time series preprocessing](time_series_data_preprocessing.ipynb) Time series data refers to data that associates a timestamp with each recorded entry. These records may be incomplete, or the events may occur irregularly. -### [Signal preprocessing](signal_preprocessing.ipynb) +- [🎥 Signal preprocessing](signal_preprocessing.ipynb) -[Data preparation <- Previous](../README.md) \ No newline at end of file +[Data preparation ⬅ï¸](../README.md) \ No newline at end of file diff --git a/session_1/data_preparation/preprocessing_calibration/filling_missing_data.ipynb b/session_1/data_preparation/preprocessing_calibration/filling_missing_data.ipynb index 55bc373..0da63a5 100644 --- a/session_1/data_preparation/preprocessing_calibration/filling_missing_data.ipynb +++ b/session_1/data_preparation/preprocessing_calibration/filling_missing_data.ipynb @@ -24,9 +24,10 @@ "outputs": [], "source": [ "# Import libraries\n", - "import pandas as pd\n", - "import numpy as np\n", - "import seaborn as sns" + "import pandas as pd # for dataframes\n", + "import numpy as np # for mathematical operations\n", + "import matplotlib.pyplot as plt # for plotting\n", + "import missingno as msno # for visualizing missing values" ] }, { @@ -42,10 +43,6 @@ "metadata": {}, "outputs": [], "source": [ - "import pandas as pd\n", - "import matplotlib.pyplot as plt\n", - "import missingno as msno\n", - "\n", "# Load the data\n", "file_name = \"../../../data/insurance.csv\"\n", "data = pd.read_csv(file_name)\n", @@ -360,7 +357,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Data cleaning <- Previous](./README.md)" + "[Data cleaning ⬅ï¸](./README.md)" ] } ], diff --git a/session_1/data_preparation/preprocessing_calibration/outlier_removal.ipynb b/session_1/data_preparation/preprocessing_calibration/outlier_removal.ipynb index 5ec25ad..0b56900 100644 --- a/session_1/data_preparation/preprocessing_calibration/outlier_removal.ipynb +++ b/session_1/data_preparation/preprocessing_calibration/outlier_removal.ipynb @@ -9,25 +9,53 @@ "\n", "## [Standard deviation outlier](https://study.com/skill/learn/determining-outliers-using-standard-deviation-explanation.html)\n", "\n", - "Standard deviation is a measure of the amount of variation or dispersion in a set of values. When it comes to identifying outliers, the standard deviation can be used to determine how far a data point is from the mean of the dataset. Outliers are typically defined as data points that are significantly different from the rest of the data.\n", + "Standard deviation is a measure of the amount of variation or dispersion in a set of values. \n", + "When it comes to identifying **outliers**, the standard deviation can be used to **determine how far a data point is from the mean** of the dataset. \n", + "Outliers are typically defined as data points that are significantly different from the rest of the data.\n", "\n" ] }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 114, "metadata": {}, "outputs": [], "source": [ - "# Example 1: numpy array with outliers\n", - "import numpy as np\n", + "# Import libraries\n", + "import numpy as np # Used for outlier detection\n", + "import matplotlib.pyplot as plt # Used for plotting\n", + "import pandas as pd # Used for data manipulation and analysis, particularly for handling and filtering outliers in a dataset\n", + "import seaborn as sns # Used for visualizations\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Example 1: numpy array with outliers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Parameters\n", + "mean = 10\n", + "std_dev = 20\n", + "num_samples = 100\n", "\n", - "# Create sample data\n", - "sample_data = np.array([5.7, 6.8, 9.4, 8.6, 7.1, 5.9, 8.3])\n", + "# Generate random normal numbers\n", + "np.random.seed(10) # Set seed for reproducibility\n", + "sample_data = np.random.normal(loc=mean, scale=std_dev, size=num_samples)\n", "\n", "# Add outlier points\n", - "outliers = np.array([200.5, 220.3, 0.05, 10.2])\n", - "sample_data_with_outliers = np.concatenate((sample_data, outliers))\n" + "outliers = np.array([200.5, 220.3, -100,-101])\n", + "sample_data_with_outliers = np.concatenate((sample_data, outliers))\n", + "\n", + "np.set_printoptions(formatter={'float': '{:0.1f}'.format})\n", + "sample_data_with_outliers" ] }, { @@ -50,14 +78,25 @@ "outputs": [], "source": [ "# Calculate upper and lower bounds\n", - "threshold = 2\n", + "threshold = 3\n", "\n", "lower_limit = mean - threshold * std_dev\n", "upper_limit = mean + threshold * std_dev\n", "\n", "# Calculate outliers\n", "outliers = [x for x in sample_data_with_outliers if x > upper_limit or x < lower_limit]\n", - "outliers" + "print(outliers)\n", + "\n", + "# Alternative:Calculate outliers using NumPy\n", + "outliers = sample_data_with_outliers[(sample_data_with_outliers < lower_limit) | (sample_data_with_outliers > upper_limit)]\n", + "print(outliers)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A threshold of 3 result in outliers not be detected since mean and std_dev significantly influence the outlier itself, resulting in wide bounds." ] }, { @@ -67,28 +106,24 @@ "outputs": [], "source": [ "filtered_data = [item for item in sample_data_with_outliers if item not in outliers]\n", + "\n", "filtered_data" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Example 2: pandas dataframe with outliers" + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# Example 2: pandas dataframe with outliers\n", - "import pandas as pd\n", - "import seaborn as sns\n", - "\n", - "def generate_scores(mean=60,std_dev=12,num_samples=200):\n", - " np.random.seed(27)\n", - " scores = np.random.normal(loc=mean,scale=std_dev,size=num_samples)\n", - " scores = np.round(scores, decimals=0)\n", - " return scores\n", - "\n", - "scores_data = generate_scores()\n", - "\n", - "df_scores = pd.DataFrame(scores_data,columns=['score'])\n", + "df_scores = pd.DataFrame(sample_data_with_outliers,columns=['score'])\n", "\n", "\n", "# Calculate lower and upper limits\n", @@ -97,8 +132,8 @@ "upper_limit = df_scores['score'].mean() + threshold * df_scores['score'].std()\n", "\n", "# Outliers\n", - "df_scores_filtered = df_scores[df_scores['score'].between(lower_limit, upper_limit) == False]\n", - "print(df_scores_filtered)\n" + "outliers = df_scores[df_scores['score'].between(lower_limit, upper_limit) == False]\n", + "print(outliers)\n" ] }, { @@ -107,10 +142,9 @@ "metadata": {}, "outputs": [], "source": [ - "\n", "# Plot distribution of scores\n", "sns.set_theme()\n", - "plot = sns.displot(data=scores_data).set(title=\"Distribution of Scores\", xlabel=\"Scores\")\n", + "plot = sns.displot(data=df_scores).set(title=\"Distribution of Scores\", xlabel=\"Scores\")\n", "\n", "# Add vertical lines for lower and upper limits\n", "for ax in plot.axes.flat:\n", @@ -129,27 +163,24 @@ "\n", "Steps to Identify Outliers Using IQR\n", "\n", - "- Calculate the First Quartile (Q1):\n", - "\n", - "Q1 is the 25th percentile of the data. It represents the value below which 25% of the data falls.\n", - "\n", - "- Calculate the Third Quartile (Q3):\n", + "- Calculate the First Quartile (Q1): \\\n", + " Q1 is the 25th percentile of the data. It represents the value below which 25% of the data falls.\n", "\n", - "Q3 is the 75th percentile of the data. It represents the value below which 75% of the data falls.\n", + "- Calculate the Third Quartile (Q3): \\\n", + " Q3 is the 75th percentile of the data. It represents the value below which 75% of the data falls.\n", "\n", - "- Calculate the Interquartile Range (IQR):\n", + "- Calculate the Interquartile Range (IQR): \\\n", + " IQR is the difference between Q3 and Q1.\n", + " ( $\\text{IQR} = Q3 - Q1 $)\n", "\n", - "IQR is the difference between Q3 and Q1.\n", - "( $\\text{IQR} = Q3 - Q1 $)\n", + "- Define the Bounds for Outliers: \\\n", + " Lower Bound: ($\\text{Lower Bound} = Q1 - 1.5 \\times \\text{IQR}$) \\\n", + " Upper Bound: ($\\text{Upper Bound} = Q3 + 1.5 \\times \\text{IQR}$)\n", "\n", - "- Define the Bounds for Outliers:\n", + " For a normal distribution, approximately 99.3% of the data lies within 3 standard deviations from the mean. The IQR method with a factor of 1.5 captures a similar range, as the IQR is related to the standard deviation for normally distributed data. See [usage of 1.5 as a Multiplier in the Interquartile Range (IQR) Method and the Normal Distribution.](https://procogia.com/interquartile-range-method-for-reliable-data-analysis/)\n", "\n", - "Lower Bound: ($\\text{Lower Bound} = Q1 - 1.5 \\times \\text{IQR}$) \\\n", - "Upper Bound: ($\\text{Upper Bound} = Q3 + 1.5 \\times \\text{IQR}$)\n", - "\n", - "- Identify Outliers:\n", - "\n", - "Any data point below the lower bound or above the upper bound is considered an outlier." + "- Identify Outliers: \\\n", + " Any data point below the lower bound or above the upper bound is considered an outlier." ] }, { @@ -159,7 +190,16 @@ "outputs": [], "source": [ "# IQR\n", - "def find_outliers_iqr(data, threshold=3):\n", + "def find_outliers_iqr(data, threshold=1.5):\n", + " \"\"\"\n", + " Identify outliers in a dataset using the Interquartile Range (IQR) method.\n", + " Parameters:\n", + " data (array-like): The dataset to analyze for outliers.\n", + " threshold (float, optional): The multiplier for the IQR to define the bounds for outliers. Default is 1.5.\n", + " Returns:\n", + " pandas.Series: A series containing the outliers in the dataset.\n", + " \"\"\"\n", + " \n", " # Calculate Q1 (25th percentile) and Q3 (75th percentile)\n", " data_series = pd.Series(data)\n", " q1 = data_series.quantile(0.25)\n", @@ -173,11 +213,15 @@ " upper_bound = q3 + threshold * IQR\n", "\n", " # Identify outliers\n", - " outliers = data_series[(data_series < lower_bound) | (data_series > upper_bound)]\n", + " outliers = data_series[~data_series.between(lower_bound, upper_bound)]\n", + " \n", + " # Alternative: Calculate outliers\n", + " # outliers = data_series[data_series.between(lower_bound, upper_bound) == False]\n", + " \n", " return outliers\n", "\n", "\n", - "print(find_outliers_iqr(generate_scores(),2))" + "find_outliers_iqr(sample_data_with_outliers)" ] }, { @@ -188,10 +232,19 @@ "\n", "\n", "Binning is a data preprocessing technique used to group a range of values into a smaller number of \"bins.\" \n", - "This can be useful for various purposes, such as smoothing noisy data, reducing the impact of minor observation errors, or transforming continuous data into categorical data.\n", + "This can be useful for various purposes, such as \n", + "- smoothing noisy data, \n", + "- reducing the impact of minor observation errors, or \n", + "- transforming continuous data into categorical data.\n", "\n", "width = (max - min) / number_of_bins\n", - "bins_values = [min, min + width, min + 2 * width, ..., min + number_of_bins * width]" + "bins_values = [min, min + width, min + 2 * width, ..., min + number_of_bins * width]\n", + "\n", + "### Example 1:\n", + " - Generate random data using np.random.rand.\n", + " - Define bin edges using np.linspace.\n", + " - Bin the data using np.digitize.\n", + " - Calculate histogram counts using np.bincount\n" ] }, { @@ -200,19 +253,20 @@ "metadata": {}, "outputs": [], "source": [ - "\"\"\" Example 1: This example demonstrates how to:\n", - " - Generate random data using np.random.rand.\n", - " - Define bin edges using np.linspace.\n", - " - Bin the data using np.digitize.\n", - " - Calculate histogram counts using np.bincount\n", - "\"\"\"\n", " \n", "# Generate some example data\n", "data = np.random.rand(100)\n", - "print(\"Data:\", data, len(data))\n", - "\n", + "print(\"Data:\", data, len(data))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "# Define bin edges using linspace\n", - "bin_edges = np.linspace(0, 1, 6) # Create 5 bins from 0 to 1\n", + "bin_edges = np.linspace(0, 1, 6) # Creates 6 points, which define 5 bins.\n", "print(\"Bin Edges:\", bin_edges,len(bin_edges))\n", " \n", "# Bin the data using digitize\n", @@ -224,34 +278,144 @@ "print(\"Histogram Counts:\", hist)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Bin Edges\n", + "- Bin 1: [0.0, 0.2)\n", + "- Bin 2: [0.2, 0.4)\n", + "- Bin 3: [0.4, 0.6)\n", + "- Bin 4: [0.6, 0.8)\n", + "- Bin 5: [0.8, 1.0]" + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "\"\"\" Example 2: This example demonstrates how to:\n", - "- Create bins using np.linspace.\n", - "- Assign data points to bins using np.digitize.\n", - "- Count the occurrences of data points in each bin using np.bincount\n", - "\"\"\"\n", + "# Manually identifying bins (example): numbers in bin 2\n", + "len(data[np.logical_and(data > 0.2, data < 0.4)])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Plot the histogram\n", + "fig, ax = plt.subplots(2,1)\n", "\n", - "def find_outliers_binning(data, number_of_bins):\n", - " \n", - " # Creates an array of number_of_bins equally spaced values between the minimum and maximum values in data. \n", - " # These values represent the edges of the bins.\n", - " bins = np.linspace(min(data), max(data), number_of_bins) \n", + "ax[0].plot(data)\n", + "ax[1].hist(data, bins=bin_edges, edgecolor='black', alpha=0.7)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Example 2\n", + "Bin the data into a specified number of bins and counts the occurrences of data points in each bin. \n", + "\n", + "We use np.linspace to create the bin edges, np.digitize to assign data points to bins, and np.bincount to count the occurrences in each bin. \n", " \n", - " # Assigns each value in data to a bin. \n", - " # It returns an array where each element is the index of the bin to which the corresponding element in data belongs.\n", - " digitized = np.digitize(data, bins)\n", + "This function is useful for identifying the distribution of data points across different bins, which can help in detecting outliers or understanding the data distribution." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Creates an array of number_of_bins equally spaced values between the minimum and maximum values in data. \n", + "# These values represent the edges of the bins.\n", + "bin_edges = np.linspace(min(sample_data_with_outliers), max(sample_data_with_outliers), 10) \n", + "bin_edges\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "\n", - " # Counts the number of occurrences of each bin index in digitized. \n", - " # It returns an array where the value at each index is the count of data points in the corresponding bin.\n", - " return np.bincount(digitized)\n", + "# Assigns each value in data to a bin. \n", + "# It returns an array where each element is the index of the bin to which the corresponding element in data belongs.\n", + "digitized_data = np.digitize(sample_data_with_outliers, bin_edges)\n", + "digitized_data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "bin_counts = np.bincount(digitized_data)\n", + "bin_counts\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Manually identifying bins (example): numbers in bin 1\n", + "sample_data_with_outliers[sample_data_with_outliers <= 6.1]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Identify bins with counts below the threshold\n", + "threshold = 2\n", + "outlier_bins = np.where(bin_counts <= threshold)[0]\n", + "outlier_bins" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# How np.where works\n", + "arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])\n", + "np.where(arr < 9, arr,0)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Extract outliers\n", + "sample_data_with_outliers[np.isin(digitized_data, outlier_bins)]\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "\n", - "number_of_bins = 4\n", - "find_outliers_binning(sample_data, number_of_bins)" + "outliers = sample_data_with_outliers[np.isin(digitized_data, outlier_bins)]\n", + " \n", + "print(\"Outliers:\", outliers)\n", + "print(\"Outlier Bins:\", outlier_bins)\n", + "print(\"Bin Counts:\", bin_counts)" ] }, { @@ -259,7 +423,7 @@ "metadata": {}, "source": [ "\n", - "## [Dealing with Outliers Using Three Robust Linear Regression Models](https://developer.nvidia.com/blog/dealing-with-outliers-using-three-robust-linear-regression-models/)\n", + "### [📖 Dealing with Outliers Using Three Robust Linear Regression Models](https://developer.nvidia.com/blog/dealing-with-outliers-using-three-robust-linear-regression-models/)\n", "\n", "\n" ] @@ -268,7 +432,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Data cleaning <- Previous](./README.md)" + "[Data cleaning ⬅ï¸](./README.md)" ] } ], diff --git a/session_1/data_preparation/preprocessing_calibration/removing_duplicates.ipynb b/session_1/data_preparation/preprocessing_calibration/removing_duplicates.ipynb index fd0be32..2d7aa73 100644 --- a/session_1/data_preparation/preprocessing_calibration/removing_duplicates.ipynb +++ b/session_1/data_preparation/preprocessing_calibration/removing_duplicates.ipynb @@ -109,7 +109,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Data cleaning <- Previous](./README.md)" + "[Data cleaning ⬅ï¸](./README.md)" ] } ], diff --git a/session_1/data_preparation/preprocessing_calibration/signal_preprocessing.ipynb b/session_1/data_preparation/preprocessing_calibration/signal_preprocessing.ipynb index 22c1eab..3b4f298 100644 --- a/session_1/data_preparation/preprocessing_calibration/signal_preprocessing.ipynb +++ b/session_1/data_preparation/preprocessing_calibration/signal_preprocessing.ipynb @@ -357,7 +357,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Data preparation <- Previous](./README.md)" + "[Data preparation ⬅ï¸](./README.md)" ] } ], diff --git a/session_1/data_preparation/preprocessing_calibration/time_series_data_preprocessing.ipynb b/session_1/data_preparation/preprocessing_calibration/time_series_data_preprocessing.ipynb index 6044081..0aa110d 100644 --- a/session_1/data_preparation/preprocessing_calibration/time_series_data_preprocessing.ipynb +++ b/session_1/data_preparation/preprocessing_calibration/time_series_data_preprocessing.ipynb @@ -13,9 +13,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Sample data sets from pydataset\n", - "\n", - "https://pydataset.readthedocs.io/en/latest/" + "## [Sample data sets from pydataset](https://pydataset.readthedocs.io/en/latest/)\n" ] }, { @@ -25,6 +23,7 @@ "outputs": [], "source": [ "from pydataset import data\n", + "import numpy as np\n", "\n", "# Get metadata about all datasets\n", "datasets_info = data()\n", @@ -33,7 +32,9 @@ "print(datasets_info.head())\n", "\n", "# Time series datasets\n", - "time_series_datasets = datasets_info[datasets_info['title'].str.contains('time series', case=False)]\n", + "time_series_datasets = datasets_info[datasets_info['title'].str.contains(\n", + " 'time series', case=False)]\n", + "\n", "print(time_series_datasets.head())" ] }, @@ -82,30 +83,14 @@ "metadata": {}, "outputs": [], "source": [ - "# The bitwise '&' operator is used in series instead of boolean 'and' operator\n", - "# define a variable of type binary\n", - "a = 0b1010 # 1010 in binary is 10 in decimal\n", - "b = 0b1011 # 1011 in binary is 11 in decimal\n", - "\n", - "# bitwise AND operation\n", - "print(a & b) # 1010 & 1011 = 1010" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "\n", "# Introduce a large discrepancy in the unemploy column at a specific data range\n", "# Define the start and end dates\n", "start_date = '1992-01-01'\n", "end_date = '1997-01-01'\n", "\n", "# Set unemploy values to None between the start and end dates\n", - "economics_data.loc[(economics_data['date'] >= start_date) & (economics_data['date'] <= end_date), 'unemploy'] = None\n", + "economics_data.loc[np.logical_and(\n", + " economics_data['date'] >= start_date, economics_data['date'] <= end_date), 'unemploy'] = None\n", "economics_data.plot(x='date', y='unemploy')" ] }, @@ -124,10 +109,12 @@ "source": [ "# Fill missing data using forward fill method\n", "economics_data_ffill = economics_data.copy()\n", - "economics_data_ffill['unemploy'] = economics_data_ffill['unemploy'].fillna(method='ffill')\n", + "economics_data_ffill['unemploy'] = economics_data_ffill['unemploy'].fillna(\n", + " method='ffill')\n", "\n", "# Print the modified dataset to verify the changes\n", - "print(economics_data_ffill[(economics_data_ffill['date'] >= start_date) & (economics_data_ffill['date'] <= end_date)])\n", + "print(economics_data_ffill[np.logical_and(\n", + " economics_data_ffill['date'] >= start_date, economics_data_ffill['date'] <= end_date)])\n", "\n", "economics_data_ffill.plot(x='date', y='unemploy')" ] @@ -148,10 +135,12 @@ "# Fill missing data using forward fill method\n", "economics_data_mean = economics_data.copy()\n", "mean_unemploy = economics_data['unemploy'].mean()\n", - "economics_data_mean['unemploy'] = economics_data_mean['unemploy'].fillna(mean_unemploy)\n", + "economics_data_mean['unemploy'] = economics_data_mean['unemploy'].fillna(\n", + " mean_unemploy)\n", "\n", "# Print the modified dataset to verify the changes\n", - "print(economics_data_mean[(economics_data_mean['date'] >= start_date) & (economics_data_mean['date'] <= end_date)])\n", + "print(economics_data_mean[np.logical_and(\n", + " economics_data_mean['date'] >= start_date, economics_data_mean['date'] <= end_date)])\n", "\n", "economics_data_mean.plot(x='date', y='unemploy')" ] @@ -171,10 +160,11 @@ "source": [ "# Fill missing data using linear interpolation and save to a new DataFrame\n", "economics_data_linear = economics_data.copy()\n", - "economics_data_linear['unemploy'] = economics_data_linear['unemploy'].interpolate(method='linear')\n", + "economics_data_linear['unemploy'] = economics_data_linear['unemploy'].interpolate(\n", + " method='linear')\n", "\n", "# Print the modified dataset to verify the changes\n", - "print(economics_data_linear[(economics_data_linear['date'] >= start_date) & (economics_data_linear['date'] <= end_date)])\n", + "print(economics_data_linear[np.logical_and(economics_data_linear['date'] >= start_date, economics_data_linear['date'] <= end_date)])\n", "\n", "# Plot the modified dataset\n", "economics_data_linear.plot(x='date', y='unemploy')" @@ -195,10 +185,11 @@ "source": [ "# Fill missing data using polynomial interpolation (order=3) and save to a new DataFrame\n", "economics_data_poly3 = economics_data.copy()\n", - "economics_data_poly3['unemploy'] = economics_data_poly3['unemploy'].interpolate(method='polynomial', order=2)\n", + "economics_data_poly3['unemploy'] = economics_data_poly3['unemploy'].interpolate(\n", + " method='polynomial', order=2)\n", "\n", "# Print the modified dataset to verify the changes\n", - "print(economics_data_poly3[(economics_data_poly3['date'] >= start_date) & (economics_data_poly3['date'] <= end_date)])\n", + "print(economics_data_poly3[np.logical_and(economics_data_poly3['date'] >= start_date, economics_data_poly3['date'] <= end_date)])\n", "\n", "# Plot the modified dataset\n", "economics_data_poly3.plot(x='date', y='unemploy')" @@ -208,7 +199,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Data preparation <- Previous](./README.md)" + "[Data preparation ⬅ï¸](./README.md)" ] } ], diff --git a/session_1/data_preparation/task_2.md b/session_1/data_preparation/task_2.md index 1a72f8e..632d03e 100644 --- a/session_1/data_preparation/task_2.md +++ b/session_1/data_preparation/task_2.md @@ -15,4 +15,4 @@ Continue with [task 1](../understanding_data/task_1.md). - Create new features if necessary to enhance the dataset. - Normalize or standardize the data if required. -[Understanding Data <- Previous](./README.md) \ No newline at end of file +[Understanding Data ⬅ï¸](./README.md) \ No newline at end of file diff --git a/session_1/images/crisp_business_understanding.png b/session_1/images/crisp_business_understanding.png new file mode 100644 index 0000000..e29b8ab Binary files /dev/null and b/session_1/images/crisp_business_understanding.png differ diff --git a/session_1/images/crisp_data_preparation.png b/session_1/images/crisp_data_preparation.png new file mode 100644 index 0000000..47dd17e Binary files /dev/null and b/session_1/images/crisp_data_preparation.png differ diff --git a/session_1/images/crisp_data_understanding.png b/session_1/images/crisp_data_understanding.png new file mode 100644 index 0000000..408c6d9 Binary files /dev/null and b/session_1/images/crisp_data_understanding.png differ diff --git a/session_1/images/crisp_deployment.png b/session_1/images/crisp_deployment.png new file mode 100644 index 0000000..70f0124 Binary files /dev/null and b/session_1/images/crisp_deployment.png differ diff --git a/session_1/images/crisp_evaluation.png b/session_1/images/crisp_evaluation.png new file mode 100644 index 0000000..4200b84 Binary files /dev/null and b/session_1/images/crisp_evaluation.png differ diff --git a/session_1/images/crisp_modeling.png b/session_1/images/crisp_modeling.png new file mode 100644 index 0000000..b3a5e79 Binary files /dev/null and b/session_1/images/crisp_modeling.png differ diff --git a/session_1/modelling/README.md b/session_1/modelling/README.md index 4b8565d..f65aa9e 100644 --- a/session_1/modelling/README.md +++ b/session_1/modelling/README.md @@ -1,19 +1,19 @@ # 3. Linear models -A fundamental and often used form of predictive analysis is [linear regression](https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/what-is-linear-regression/) +📖 A fundamental and often used form of predictive analysis is [linear regression](https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/what-is-linear-regression/) We use models to make predictions. In this lesson, we demonstrate two commonly used regression models -- [Linear regression](regression/linear/linear_regression.ipynb): Simple linear regression is about explaining the dependent variable Y with independent variable X. +- [🎥 Linear regression](regression/linear/linear_regression.ipynb): Simple linear regression is about explaining the dependent variable Y with independent variable X. -- [Polynomial regression](regression/polynomial/linear_regression_poly.ipynb): Our data may not have a linear relationship, but still, we may use a linear model to fit nonlinear data. One method is to add capabilities to each variable as if they were new variables, or new features. Then, a model will be trained using these variables. This linear model is referred to as polynomial regression. +- [🎥 Polynomial regression](regression/polynomial/linear_regression_poly.ipynb): Our data may not have a linear relationship, but still, we may use a linear model to fit nonlinear data. One method is to add capabilities to each variable as if they were new variables, or new features. Then, a model will be trained using these variables. This linear model is referred to as polynomial regression. -- [Multiple regression](multiple/multiple_feature_regression.ipynb): Is similar to linear regression, but with more than one independent value, in which we attempt to predict a value using two or more independent variables. +- [🎥 Multiple regression](multiple/multiple_feature_regression.ipynb): Is similar to linear regression, but with more than one independent value, in which we attempt to predict a value using two or more independent variables. -- [Robust Linear Regression Models](linear_regression_models.ipynb): Dealing with Outliers Using Three Robust Linear Regression Models. +- [🎥 Robust Linear Regression Models](linear_regression_models.ipynb): Dealing with Outliers Using Three Robust Linear Regression Models. -[Predictive Analytics <- Previous](../README.md) | -[Next -> CRISP DM: Complete example](../crisp_dm/README.md) +[Data preparation ⬅ï¸](../data_preparation/README.md) | +[âž¡ï¸ CRISP DM: Complete example](../crisp_dm/README.md) diff --git a/session_1/modelling/linear_regression_models.ipynb b/session_1/modelling/linear_regression_models.ipynb index 883eaa0..45e57b6 100644 --- a/session_1/modelling/linear_regression_models.ipynb +++ b/session_1/modelling/linear_regression_models.ipynb @@ -22,7 +22,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Sample Data" + "## [📖 Sample Data](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html#sklearn.datasets.make_regression)" ] }, { @@ -214,7 +214,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Linear Models <- Previous](./README.md)" + "[Linear Models â¬…ï¸ Previous](./README.md)" ] } ], diff --git a/session_1/modelling/multiple/multiple_feature_regression.ipynb b/session_1/modelling/multiple/multiple_feature_regression.ipynb index 7de4c14..93b50dd 100644 --- a/session_1/modelling/multiple/multiple_feature_regression.ipynb +++ b/session_1/modelling/multiple/multiple_feature_regression.ipynb @@ -188,7 +188,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Linear Models <- Previous](../README.md)" + "[Linear Models ⬅ï¸](../README.md)" ] } ], diff --git a/session_1/modelling/regression/linear/linear_regression.ipynb b/session_1/modelling/regression/linear/linear_regression.ipynb index 7e21a3b..308103d 100644 --- a/session_1/modelling/regression/linear/linear_regression.ipynb +++ b/session_1/modelling/regression/linear/linear_regression.ipynb @@ -8,7 +8,7 @@ "\n", "Linear regression is a fundamental statistical and machine learning technique used to model the relationship between a dependent variable (target) and one or more independent variables (features). The goal is to find the best-fitting linear equation that describes the relationship between the variables.\n", "\n", - "**Example 1: Fuel consumption in cars**\n", + "📖 **Example 1: Fuel consumption in cars**\n", "\n", "\n", "The purpose of the regression analysis is to find the best possible\n", @@ -33,7 +33,9 @@ "${\\hat{y} = \\hat{\\alpha} + \\hat{\\beta}\\cdot x}$\n", "\n", "\n", - "Least square method:\n", + "[Least square](https://en.wikipedia.org/wiki/Least_squares) method:\n", + "\n", + "The least squares method is a fundamental technique in regression analysis used to find the best-fitting line through a set of data points by minimizing the sum of the squared residuals. This method provides estimates for the slope and intercept of the regression line, which can then be used to make predictions.\n", "\n", "\"Least\n", "\n", @@ -58,7 +60,7 @@ }, { "cell_type": "code", - "execution_count": 74, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -275,7 +277,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Example 2: [Diabetes](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py)\n", + "📖 **Example 2:** [Diabetes](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py)\n", "\n", "[sklearn.datasets.load_diabetes()](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html#sklearn.datasets.load_diabetes) returns a Bunch object. The bunch object have 'data' (none dependent variables, diabetes_X), 'target' (dependent variables, diabetes_y), feature_names, etc." ] @@ -312,7 +314,7 @@ }, { "cell_type": "code", - "execution_count": 88, + "execution_count": 15, "metadata": {}, "outputs": [], "source": [ @@ -367,7 +369,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Example 3: [US Health Insurance Dataset](https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset)\n", + "📖 **Example 3:** [US Health Insurance Dataset](https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset)\n", "\n", "[Regression Modellng using Insurance Dataset](https://www.kaggle.com/code/maverickss26/regression-modellng-using-insurance-dataset)\n" ] @@ -492,7 +494,7 @@ }, { "cell_type": "code", - "execution_count": 97, + "execution_count": 24, "metadata": {}, "outputs": [], "source": [ @@ -512,7 +514,7 @@ }, { "cell_type": "code", - "execution_count": 98, + "execution_count": 25, "metadata": {}, "outputs": [], "source": [ @@ -618,7 +620,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Linear Models <- Previous](../../README.md)" + "[Linear Models ⬅ï¸](../../README.md)" ] } ], diff --git a/session_1/modelling/regression/polynomial/linear_regression_poly.ipynb b/session_1/modelling/regression/polynomial/linear_regression_poly.ipynb index b76d604..92b0c72 100644 --- a/session_1/modelling/regression/polynomial/linear_regression_poly.ipynb +++ b/session_1/modelling/regression/polynomial/linear_regression_poly.ipynb @@ -7,7 +7,7 @@ "# [Polynomial regression: extending linear models with basis functions](https://scikit-learn.org/stable/modules/linear_model.html#polynomial-regression-extending-linear-models-with-basis-functions)\n", "\n", "\n", - "Example 1: [3rd degree polynomial](https://www.w3schools.com/python/python_ml_polynomial_regression.asp)\n", + "📖**Example 1:** [3rd degree polynomial](https://www.w3schools.com/python/python_ml_polynomial_regression.asp)\n", "\n", "In this example we use [numpy.polyfit](https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html)\n", "\n", @@ -72,7 +72,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Example 2: [Polynomial Regression in Python using scikit-learn](https://data36.com/polynomial-regression-python-scikit-learn/)\n", + "📖 **Example 2:** [Polynomial Regression in Python using scikit-learn](https://data36.com/polynomial-regression-python-scikit-learn/)\n", "\n", "This example demonstrates how to perform polynomial regression using Python's scikit-learn library. \n", "It begins by importing necessary libraries such as numpy, pandas, matplotlib.pyplot, PolynomialFeatures, and LinearRegression. \n", @@ -201,7 +201,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Linear Models <- Previous](../../README.md)" + "[Linear Models ⬅ï¸](../../README.md)" ] } ], diff --git a/session_1/understanding_data/README.md b/session_1/understanding_data/README.md index cb89f35..a37f676 100644 --- a/session_1/understanding_data/README.md +++ b/session_1/understanding_data/README.md @@ -6,12 +6,12 @@ ### Sample Datasets -- Learn about few datasets provided by libraries such as +- 📖Learn about few datasets provided by libraries such as - [pydataset](https://pydataset.readthedocs.io/en/latest/) - [seaborn](https://seaborn.pydata.org/generated/seaborn.load_dataset.html) - [sklearn (Scikit-Learn)](https://scikit-learn.org/stable/api/sklearn.datasets.html#module-sklearn.datasets) -- Learn about other sources +- 📖Learn about other sources - [kaggle](https://www.kaggle.com/datasets): Kaggle offers a vast collection of datasets across various domains, along with tools for data analysis and machine learning. Users can easily download datasets, participate in competitions, and collaborate with others in the data science community - [Google dataset search](https://datasetsearch.research.google.com/): A search engine for datasets across the web. - [AWS Public datasets](https://registry.opendata.aws/): A collection of public datasets hosted on Amazon Web Services. @@ -23,36 +23,36 @@ - Health and social conditions - Environment and energy -[Examples: How to work with datasets?](./pydataset/README.md) +[🎥 How to work with datasets?](./pydataset/README.md) ### Dataset: Healthcare Insurance Source: [kaggle](https://www.kaggle.com/datasets/willianoliveiragibin/healthcare-insurance) Author: [willian oliveira gibin and 1 collaborator](https://www.kaggle.com/willianoliveiragibin) -This dataset provides insights into the connection between personal attributes (such as age, gender, BMI, family size, and smoking habits), geographic factors, and their effects on medical insurance charges. It can be utilized to examine how these characteristics affect insurance costs and to create predictive models for estimating healthcare expenses. +This dataset provides insights into the connection between **personal attributes** (such as age, gender, BMI, family size, and smoking habits), **geographic factors**, and their effects on ***medical insurance charges***. It can be utilized to examine how these characteristics affect insurance costs and to create **predictive models** for estimating healthcare expenses. -[Example: Healthcare Insurance](./understanding_data.ipynb) +[🎥 Healthcare Insurance](./understanding_data.ipynb) ### Dataset: penguins > Exploring data with pandas sql Source: [seaborn](https://seaborn.pydata.org/tutorial/introduction.html) -Perform queries to identify trends using sql notation against pandas dataframe. +Perform **queries** to identify trends using **sql notation** against pandas dataframe. -[Example: Exploring data with pandas sql (penguins)](./pandas_sql.ipynb) +[🎥 Exploring data with pandas sql (penguins)](./pandas_sql.ipynb) ### Dataset: penguins > Working with database Source: create a database with dummy data. -Perform queries to identify trends using sql notation against a database. +Perform **queries** to identify trends using **sql** notation against a **database**. -[Example: Working with database (penguins)](./pandas_sql_db.ipynb) +[🎥 Working with database (penguins)](./pandas_sql_db.ipynb) -### Task 1: Understanding Data +### Task 1: Understanding Data 📋 -[Analyzing and Visualizing Existing Datasets](task_1.md) +[📋Analyzing and Visualizing Existing Datasets](task_1.md) -[Predictive Analytics <- Previous](../README.md) | -[Next -> Data preparation](../data_preparation/README.md) \ No newline at end of file +[Predictive Analytics ⬅ï¸](../README.md) | +[âž¡ï¸ Data preparation](../data_preparation/README.md) \ No newline at end of file diff --git a/session_1/understanding_data/pandas_sql.ipynb b/session_1/understanding_data/pandas_sql.ipynb index f4026d5..e68c93a 100644 --- a/session_1/understanding_data/pandas_sql.ipynb +++ b/session_1/understanding_data/pandas_sql.ipynb @@ -81,7 +81,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Understanding Data <- Previous](./README.md)" + "[Understanding Data ⬅ï¸](./README.md)" ] } ], diff --git a/session_1/understanding_data/pandas_sql_db.ipynb b/session_1/understanding_data/pandas_sql_db.ipynb index 9062888..c4723c8 100644 --- a/session_1/understanding_data/pandas_sql_db.ipynb +++ b/session_1/understanding_data/pandas_sql_db.ipynb @@ -108,7 +108,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Understanding Data <- Previous](./README.md)" + "[Understanding Data ⬅ï¸](./README.md)" ] } ], diff --git a/session_1/understanding_data/pydataset/README.md b/session_1/understanding_data/pydataset/README.md index ff68ca8..e03f7f7 100644 --- a/session_1/understanding_data/pydataset/README.md +++ b/session_1/understanding_data/pydataset/README.md @@ -1,9 +1,9 @@ # Sample datasets -- [pydataset](./pydataset.ipynb): demonstrate how to use the pydataset library to access and analyze datasets. It includes code to load datasets and display their initial rows, helping to understand the structure and content of the data. +- [🎥 pydataset](./pydataset.ipynb): demonstrate how to use the pydataset library to access and analyze datasets. It includes code to load datasets and display their initial rows, helping to understand the structure and content of the data. -- [seaborn](./seaborn.ipynb): demonstrate how to use the seaborn library to access and analyze datasets. +- [🎥 seaborn](./seaborn.ipynb): demonstrate how to use the seaborn library to access and analyze datasets. -- [sklearn](./sklearn.ipynb): demonstrate how to use the sklearn library to access and analyze datasets. +- [🎥 sklearn](./sklearn.ipynb): demonstrate how to use the sklearn library to access and analyze datasets. -[Understanding data <- Previous](../README.md) \ No newline at end of file +[Understanding data ⬅ï¸](../README.md) \ No newline at end of file diff --git a/session_1/understanding_data/pydataset/pydataset.ipynb b/session_1/understanding_data/pydataset/pydataset.ipynb index 16b364a..954ae3f 100644 --- a/session_1/understanding_data/pydataset/pydataset.ipynb +++ b/session_1/understanding_data/pydataset/pydataset.ipynb @@ -11,7 +11,7 @@ }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 12, "metadata": {}, "outputs": [], "source": [ @@ -21,11 +21,11 @@ }, { "cell_type": "code", - "execution_count": 42, + "execution_count": 13, "metadata": {}, "outputs": [], "source": [ - "# import pydataset\n", + "# Import the function \"data\" from pydataset\n", "from pydataset import data" ] }, @@ -35,7 +35,37 @@ "metadata": {}, "outputs": [], "source": [ - "# Print head of data\n", + "# Invoke the function\n", + "data()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Display documentation or metadata about the dataset.\n", + "data(show_doc=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Invoke the function with the name of the dataset as an argument\n", + "data('cake')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Print head of data, by default the first 5 rows\n", "data().head()" ] }, @@ -45,13 +75,13 @@ "metadata": {}, "outputs": [], "source": [ - "# Describe data\n", + "# Describe data (none numerical): count, unique, top, freq\n", "data().describe()\n" ] }, { "cell_type": "code", - "execution_count": 49, + "execution_count": 19, "metadata": {}, "outputs": [], "source": [ @@ -93,7 +123,8 @@ "source": [ "# Search for a specific string in the 'description' column\n", "search_string = 'cars'\n", - "filtered_data = data_fr[data_fr['title'].str.contains(search_string, case=False, na=False)]\n", + "\n", + "filtered_data = data_df[data_df['title'].str.contains(search_string, case=False)]\n", "\n", "print(filtered_data)" ] @@ -138,7 +169,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Sample datasets <- Previous](./README.md)" + "[Sample datasets ⬅ï¸](./README.md)" ] } ], diff --git a/session_1/understanding_data/pydataset/seaborn.ipynb b/session_1/understanding_data/pydataset/seaborn.ipynb index 2394668..07c3fb6 100644 --- a/session_1/understanding_data/pydataset/seaborn.ipynb +++ b/session_1/understanding_data/pydataset/seaborn.ipynb @@ -9,6 +9,15 @@ "This library offers easy access to a limited selection of example datasets." ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# !pip install seaborn" + ] + }, { "cell_type": "code", "execution_count": 1, @@ -58,7 +67,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Sample datasets <- Previous](./README.md)" + "[Sample datasets ⬅ï¸](./README.md)" ] } ], diff --git a/session_1/understanding_data/pydataset/sklearn.ipynb b/session_1/understanding_data/pydataset/sklearn.ipynb index fe5d479..1b27e23 100644 --- a/session_1/understanding_data/pydataset/sklearn.ipynb +++ b/session_1/understanding_data/pydataset/sklearn.ipynb @@ -14,7 +14,20 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# Install packages if not already installed\n", + "# !pip install scikit-learn\n", + "# !pip install pandas\n", + "# !pip install openpyxl\n", + "# !pip install matplotlib" + ] + }, + { + "cell_type": "code", + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -25,7 +38,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -57,13 +70,13 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Save dataframes to excel files\n", - "diabetes_X.to_excel('./data/diabetes_X.xlsx')\n", - "diabetes_y.to_excel('./data/diabetes_y.xlsx')" + "diabetes_X.to_excel('../../../data/diabetes_X.xlsx')\n", + "diabetes_y.to_excel('../../../data/diabetes_y.xlsx')" ] }, { @@ -106,7 +119,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Sample datasets <- Previous](./README.md)" + "[Sample datasets ⬅ï¸](./README.md)" ] } ], diff --git a/session_1/understanding_data/task_1.md b/session_1/understanding_data/task_1.md index 4461221..be49920 100644 --- a/session_1/understanding_data/task_1.md +++ b/session_1/understanding_data/task_1.md @@ -18,4 +18,4 @@ The objective of this task is to familiarize yourself with the process of loadin - Load the selected dataset into a pandas DataFrame. - Display the first few rows of the dataset to understand its structure. -[Understanding Data <- Previous](./README.md) \ No newline at end of file +[Understanding Data ⬅ï¸](./README.md) \ No newline at end of file diff --git a/session_1/understanding_data/understanding_data.ipynb b/session_1/understanding_data/understanding_data.ipynb index 45cf0df..4c11fdf 100644 --- a/session_1/understanding_data/understanding_data.ipynb +++ b/session_1/understanding_data/understanding_data.ipynb @@ -12,13 +12,14 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Load libraries\n", "import pandas as pd # used for data analysis\n", - "import matplotlib.pyplot as plt # used for data visualization" + "import matplotlib.pyplot as plt # used for data visualization\n", + "import numpy as np # used for scientific computing" ] }, { @@ -27,16 +28,13 @@ "source": [ "### Loading the Dataset\n", "\n", - "**Row-strings and regular expressions**\n", - "\n", - "In Python, the r prefix before a string literal indicates a [raw string](https://realpython.com/python-raw-strings/). \n", - "Raw strings treat backslashes (\\) as literal characters and do not interpret them as escape characters. \n", - "This is particularly useful for regular expressions, file paths, and other scenarios where backslashes are common.\n" + "Source: [kaggle](https://www.kaggle.com/datasets/willianoliveiragibin/healthcare-insurance) \\\n", + "Author: [willian oliveira gibin and 1 collaborator](https://www.kaggle.com/willianoliveiragibin)\n" ] }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 17, "metadata": {}, "outputs": [], "source": [ @@ -108,39 +106,115 @@ "data.tail()" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### [Pandas query](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html)\n", + "\n", + "The pandas.DataFrame.query method allows you to query a DataFrame using a boolean expression. \n", + "The expression is evaluated using the DataFrame's columns as variables. \n", + "\n", + "Here are the key rules and features to keep in mind when using DataFrame.query:\n", + "\n", + "```python\n", + "DataFrame.query(expr, inplace=False, **kwargs)\n", + "```\n", + "\n", + "Parameters\n", + "\n", + "- expr: str\n", + "The query string to evaluate. This string should be a valid Python expression where DataFrame columns are treated as variables.\n", + "\n", + "- inplace: bool, default False\n", + "If True, modifies the DataFrame in place. Otherwise, returns a new DataFrame.\n", + "\n", + "- kwargs: Additional keyword arguments passed to the underlying eval function.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Use query to get data for age between 18 and 21:\n", + "youth = data.query('age >= 18 and age <= 21')\n", + "youth" + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# Get data for age between 18 and 21.\n", + "# Alternatively\n", + "# Use query to get data for age between 18 and 21:\n", "youth = data[data['age'].between(18,21)]\n", "youth" ] }, { "cell_type": "code", - "execution_count": 26, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ + "# Use query to get data for adults and females:\n", + "adult = data.query('age > 21')\n", + "female = data.query('sex == \"female\"')\n", + "\n", + "adult_smoker = data.query('age > 21 and smoker == \"yes\"')\n", + "adult_smoker" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Alternatively\n", "# Get adults\n", "adult = data[data['age'] > 21]\n", "\n", "# Get all female\n", - "female = data[data['sex'] == 'female']" + "female = data[data['sex'] == 'female']\n", + "\n", + "# To get adults who smoke, we need logical **and** operator: age > 21 and smoker == \"yes\"\n", + "# We can use np.logical_and function to do this:\n", + "adult_smoker = data[np.logical_and(data['age'] > 21, data['smoker'] == 'yes')]\n", + "adult_smoker" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Variables\n", + "min_age = 21\n", + "region_name = 'southwest'\n", + "\n", + "# Query: Select rows using variables\n", + "query = f'age > {min_age} and region == \"{region_name}\"'\n", + "data.query(query)" ] }, { "cell_type": "code", - "execution_count": 14, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "# Define features and target\n", - "insurance_X = pd.DataFrame(data, columns=[\"age\",\"sex\",\"bmi\",\"children\",\"smoker\",\"region\"])\n", - "insurance_y = pd.DataFrame(data, columns=['charges'])" + "# Define features and target (Usually features are called X and target is called y)\n", + "insurance_features = pd.DataFrame(data, columns=[\"age\",\"sex\",\"bmi\",\"children\",\"smoker\",\"region\"])\n", + "insurance_target = pd.DataFrame(data, columns=['charges'])\n", + "insurance_features, insurance_target" ] }, { @@ -149,16 +223,27 @@ "metadata": {}, "outputs": [], "source": [ - "# Plot the data\n", - "fig, axs = plt.subplots(2, 3, figsize=(15, 6))\n", - "features = insurance_X.columns\n", + "# Create subplots for age, bmi, and children (3 plots in total)\n", + "\n", + "fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))\n", + "\n", + "# Scatter plot for age vs charges\n", + "ax[0].scatter(insurance_features['age'], insurance_target['charges'], color='skyblue', edgecolor='black')\n", + "ax[0].set_title('Age vs Charges')\n", + "ax[0].set_xlabel('Age')\n", + "ax[0].set_ylabel('Charges')\n", "\n", - "for i, ax in enumerate(axs.flatten()):\n", - " ax.scatter(data[features[i]], insurance_y)\n", - " ax.set_xlabel(features[i])\n", - " ax.set_ylabel(insurance_y.columns[0])\n", + "# Plot bmi vs charges\n", + "ax[1].scatter(insurance_features['bmi'], insurance_target['charges'], color='lightgreen', edgecolor='black')\n", + "ax[1].set_title('bmi vs Charges')\n", + "ax[1].set_xlabel('bmi')\n", + "\n", + "\n", + "# Plot children vs charges\n", + "ax[2].scatter(insurance_features['children'], insurance_target['charges'], color='lightcoral', edgecolor='black')\n", + "ax[2].set_title('children vs Charges')\n", + "ax[2].set_xlabel('children')\n", "\n", - "plt.tight_layout()\n", "plt.show()" ] }, @@ -166,13 +251,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Understanding Data <- Previous](./README.md)" + "[Understanding Data ⬅ï¸](./README.md)" ] } ], "metadata": { "kernelspec": { - "display_name": "Python 3.10.2 64-bit", + "display_name": "Python 3", "language": "python", "name": "python3" }, @@ -188,12 +273,7 @@ "pygments_lexer": "ipython3", "version": "3.12.2" }, - "orig_nbformat": 4, - "vscode": { - "interpreter": { - "hash": "3bd012613331a160b6ab7096b9a4a052284afc8931bfa34ed3cbb01db99d1af1" - } - } + "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2