{ "cells": [ { "cell_type": "markdown", "id": "8275cfcb", "metadata": {}, "source": [ "# Persisting trained models and scalers" ] }, { "cell_type": "markdown", "id": "7e4422a4", "metadata": {}, "source": [ "## 1. Abstract" ] }, { "cell_type": "markdown", "id": "97a6e0e1", "metadata": {}, "source": [ "The normal work of data analysts generally consists of analyzing them using statistical and machine learning techniques and their subsequent presentation in a report.
\n", "This is different when the data model is to be used by an application at runtime. In these cases, training a model and using it to predict each instance is often very inefficient. It would be more convenient to train the model, store it, and have it available to be used later by the program or by the part of the program that needs it.
\n", "Python pickles can be used for this: the model (and the scalers obtained after training) can be stored for later use in order to avoid training the same model for each prediction need." ] }, { "cell_type": "code", "execution_count": 1, "id": "0cb47789", "metadata": {}, "outputs": [], "source": [ "import os\n", "import random\n", "import pickle\n", "import pandas as pd\n", "import numpy as np\n", "import sklearn\n", "from sklearn import datasets\n", "from sklearn import model_selection\n", "from sklearn import preprocessing\n", "from sklearn.metrics import classification_report\n", "from sklearn.preprocessing import StandardScaler\n", "#\n", "separador=os.sep" ] }, { "cell_type": "markdown", "id": "6b9e9d26", "metadata": {}, "source": [ "## 2. Basic use of Pickle" ] }, { "attachments": { "imagen.png": { "image/png": "" } }, "cell_type": "markdown", "id": "3ff0335c", "metadata": {}, "source": [ "![imagen.png](attachment:imagen.png)" ] }, { "cell_type": "markdown", "id": "a37bbe41", "metadata": {}, "source": [ "Image obtained from: https://www.programaenlinea.net/los-pickles-python/" ] }, { "cell_type": "markdown", "id": "73e38338", "metadata": {}, "source": [ "Picke es una librería que permite serializar y des-serializar objetos. Dado un objeto el mismo puede almacenarse en formato binario y en el futuro, puede recuperarse el objeto a partir del archivo binario almacenado. Para más información visitar: https://docs.python.org/3/library/pickle.html." ] }, { "cell_type": "markdown", "id": "41deebff", "metadata": {}, "source": [ "### 2.1. Saving a simple object" ] }, { "cell_type": "markdown", "id": "706925bb", "metadata": {}, "source": [ "#### Object Creation;" ] }, { "cell_type": "code", "execution_count": 2, "id": "863dd78a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Object: <__main__.Car object at 0x7b76cc62b010>\n", "Attribute: brand= Jeep\n" ] } ], "source": [ "# Definition or a class\n", "class Car():\n", " def __init__(self, brand):\n", " self.brand = brand\n", "# Creation of an instance of this class:\n", "carOne=Car('Jeep')\n", "print('Object: ',carOne)\n", "print('Attribute: brand=',carOne.brand)" ] }, { "cell_type": "markdown", "id": "7f004710", "metadata": {}, "source": [ "#### Saving the object:" ] }, { "cell_type": "code", "execution_count": 3, "id": "2dda300c", "metadata": {}, "outputs": [], "source": [ "fileName='object.pkl'\n", "pickle.dump(carOne, open(fileName, 'wb'))" ] }, { "cell_type": "markdown", "id": "2c340193", "metadata": {}, "source": [ "#### Retrieving the object from file:" ] }, { "cell_type": "code", "execution_count": 4, "id": "7dc2e86c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Attribute loaded: brand= Jeep\n" ] } ], "source": [ "otherCar=pickle.load(open(fileName,'rb'))\n", "print('Attribute loaded: brand=',otherCar.brand)\n", "# Erase file after retrieving object\n", "os.remove(fileName)" ] }, { "cell_type": "markdown", "id": "1cfdbbb4", "metadata": {}, "source": [ "## 3. Pickling a model and a scaler" ] }, { "cell_type": "markdown", "id": "5d08433a", "metadata": {}, "source": [ "### 3.1. Training a model: Diabetes Prediction" ] }, { "cell_type": "markdown", "id": "3290c76b", "metadata": {}, "source": [ "#### Context" ] }, { "cell_type": "markdown", "id": "38d6c501", "metadata": {}, "source": [ "This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes. The dataset can be downloaded from the site of Kaggle (link: 'click here' )." ] }, { "cell_type": "markdown", "id": "0425965c", "metadata": {}, "source": [ "#### The dataset" ] }, { "cell_type": "code", "execution_count": 5, "id": "7e8eb121", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Diabetes']\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeDiabetes
061487235033.6627.00501
11856629026.6351.00310
28183640023.3672.00321
318966239428.1167.00210
40137403516843.12288.00331
..............................
76310101764818032.9171.00630
76421227027036.80.34270
7655121722311226.2245.00300
7661126600030.1349.00471
7671937031030.4315.00230
\n", "

768 rows × 9 columns

\n", "
" ], "text/plain": [ " Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n", "0 6 148 72 35 0 33.6 \n", "1 1 85 66 29 0 26.6 \n", "2 8 183 64 0 0 23.3 \n", "3 1 89 66 23 94 28.1 \n", "4 0 137 40 35 168 43.1 \n", ".. ... ... ... ... ... ... \n", "763 10 101 76 48 180 32.9 \n", "764 2 122 70 27 0 36.8 \n", "765 5 121 72 23 112 26.2 \n", "766 1 126 60 0 0 30.1 \n", "767 1 93 70 31 0 30.4 \n", "\n", " DiabetesPedigreeFunction Age Diabetes \n", "0 627.00 50 1 \n", "1 351.00 31 0 \n", "2 672.00 32 1 \n", "3 167.00 21 0 \n", "4 2288.00 33 1 \n", ".. ... ... ... \n", "763 171.00 63 0 \n", "764 0.34 27 0 \n", "765 245.00 30 0 \n", "766 349.00 47 1 \n", "767 315.00 23 0 \n", "\n", "[768 rows x 9 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The Dataset from csv file\n", "df = pd.read_csv('datasets'+str(separador)+'diabetes.csv')\n", "fields=df.columns.tolist()\n", "print(fields)\n", "df" ] }, { "cell_type": "code", "execution_count": 6, "id": "c57d37c7", "metadata": {}, "outputs": [], "source": [ "# Preparing the data\n", "X_ = df.iloc[:,:len(fields)-1].values\n", "y_=df.iloc[:,len(fields)-1]\n", "y=np.array(y_)\n", "# Scaling data\n", "sc = StandardScaler()\n", "X = sc.fit_transform(X_)\n", "# Split in Train and Test\n", "X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, y, test_size=0.2, random_state=7)" ] }, { "cell_type": "markdown", "id": "f1c03467", "metadata": {}, "source": [ "#### Training a Model" ] }, { "cell_type": "code", "execution_count": 7, "id": "407552e1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Reporte de las Clasificaciones\n", " precision recall f1-score support\n", "\n", " 0 0.79 0.92 0.85 97\n", " 1 0.81 0.60 0.69 57\n", "\n", " accuracy 0.80 154\n", " macro avg 0.80 0.76 0.77 154\n", "weighted avg 0.80 0.80 0.79 154\n", "\n" ] } ], "source": [ "# Training the Logistic Regression model\n", "from sklearn import linear_model\n", "model = linear_model.LogisticRegression()\n", "model.fit(X_train,Y_train)\n", "predictions = model.predict(X_test)\n", "print('\\nReporte de las Clasificaciones\\n',classification_report(Y_test, predictions))" ] }, { "cell_type": "markdown", "id": "17b6b024", "metadata": {}, "source": [ "#### Persistance of Model and Scaler" ] }, { "cell_type": "markdown", "id": "9a8e4be2", "metadata": {}, "source": [ "Trained models and scalers can be deployed as pickle files in big applications or complex systems where training models in prediction instances is not allowed because of performance issues. Pickle files contain a summary of the training process in compressed form. " ] }, { "cell_type": "code", "execution_count": 8, "id": "27b50662", "metadata": {}, "outputs": [], "source": [ "# Saving model and Scaler\n", "fileModel=('model.pkl')\n", "pickle.dump(model, open(fileModel, 'wb'))\n", "fileScaler=('scaler.pkl')\n", "pickle.dump(sc, open(fileScaler, 'wb'))" ] }, { "cell_type": "code", "execution_count": 9, "id": "0573c2ee", "metadata": {}, "outputs": [], "source": [ "# Loading model and scaler from binary files:\n", "modelLoaded=pickle.load(open(fileModel,'rb'))\n", "scalerLoaded=pickle.load(open(fileScaler,'rb'))\n", "# Erase files after retrieving objects\n", "os.remove(fileModel)\n", "os.remove(fileScaler)" ] }, { "cell_type": "markdown", "id": "6612b9f5", "metadata": {}, "source": [ "#### Predictions" ] }, { "cell_type": "markdown", "id": "351a1d56", "metadata": {}, "source": [ "In this step we use information loaded from pickle files in order to predict diabetes diagnoses of some random rows." ] }, { "cell_type": "code", "execution_count": 10, "id": "94ef7393", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-0.64549722 0.25226125 0.38164503 1.22698758 -0.5 0.17611091\n", " -1.03804971 -0.60328193]\n", " [-0.64549722 -0.80371607 0.31584416 1.17171787 2. 1.66294894\n", " -0.30763394 -0.60328193]\n", " [-0.64549722 1.30823857 -1.78978358 -0.98380086 -0.5 -0.1125955\n", " 1.72464054 -0.07868895]\n", " [ 1.93649167 0.6922518 1.2370563 -0.98380086 -0.5 -0.27138403\n", " 0.43567153 1.95410886]\n", " [ 0. -1.44903555 -0.14476191 -0.43110375 -0.5 -1.45508032\n", " -0.81462842 -0.66885605]]\n" ] } ], "source": [ "# We select some random rows from the Dataset\n", "nb_of_rows=5\n", "X_n=[]\n", "random_rows=[]\n", "for i in range(nb_of_rows):\n", " random_num=random.randint(0,df.shape[0])\n", " X_n.append(list(X_[random_num]))\n", " random_rows.append(random_num)\n", "X_New=np.array(X_n)\n", "X_New_Norm=scalerLoaded.fit_transform(X_New)\n", "print(X_New_Norm)" ] }, { "cell_type": "code", "execution_count": 11, "id": "97cc9f06", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Predicted: [0 0 1 1 0] - Real: [1 0 1 1 0]\n" ] } ], "source": [ "# Prediction:\n", "y_pred=modelLoaded.predict(X_New_Norm)\n", "y_real=[]\n", "for i in range(nb_of_rows):\n", " y_real.append(y_[random_rows[i]])\n", "print('Predicted: ',y_pred,' - Real:',np.array(y_real))" ] }, { "cell_type": "markdown", "id": "44a8b0c0", "metadata": {}, "source": [ "## 4. Conclusions" ] }, { "cell_type": "markdown", "id": "c6ccf291", "metadata": {}, "source": [ "The Pickle library is a great tool for persisting objects present in memory, so that they can be used in another instance.\n", "It is also convenient when more than one prediction must be made after a learning. In this case, using the models stored in pickle avoids unnecessarily repeating the training of the models, making the systems more performant." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 5 }