{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "fe647680",
   "metadata": {},
   "source": [
    "# Loading a Dataset from different file types"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cbece8d",
   "metadata": {},
   "source": [
    "## Abstract"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a7b8856",
   "metadata": {},
   "source": [
    "This notebook was created to show the difference of loading a dataset from different data file formats.<br>\n",
    "This dataset contains data about stroke to people with various vital checks. It was downloaded from Kaggle: https://www.kaggle.com/datasets/lirilkumaramal/heart-stroke/download. The dataset has 43,400 rows and 12 variables that, stored in a csv file, weigh 2.6Mb.<br>\n",
    "During this notebook you could experiment loading and writing times into different data saving formats: csv (text), xlsx (excel) and pkl (binary) and also, we load data from a SQLite Database."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1904160f",
   "metadata": {},
   "source": [
    "### Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "bbd2b9da",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import datetime as datetime\n",
    "import pickle\n",
    "import sqlite3\n",
    "import os\n",
    "#\n",
    "# El separador presentado abajo es multi OS.\n",
    "separador = os.path.sep"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "175cddf2",
   "metadata": {},
   "source": [
    "## 1. Loading from a CSV (text) file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "bf0bf0c6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading time of a Pandas Dataframe from a text CSV file:  0:00:00.163215\n"
     ]
    }
   ],
   "source": [
    "# Loading time from a CSV file\n",
    "init0=datetime.datetime.now()\n",
    "archivoTrainStrokesCsv='datasets'+str(separador)+'train_strokes.csv'\n",
    "df=pd.read_csv(archivoTrainStrokesCsv)\n",
    "end0=datetime.datetime.now()\n",
    "delta0=end0-init0\n",
    "print('Loading time of a Pandas Dataframe from a text CSV file: ',delta0)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38def24c",
   "metadata": {},
   "source": [
    "### Storing dataset into Excel and binary format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "878ab9d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "archivoTrainStrokesPkl='datasets'+str(separador)+'train_strokes.pkl'\n",
    "archivoTrainStrokesExcel='datasets'+str(separador)+'train_strokes.xlsx'\n",
    "df.to_pickle(archivoTrainStrokesPkl)\n",
    "df.to_excel(archivoTrainStrokesExcel)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c326e05",
   "metadata": {},
   "source": [
    "## 2. Loading dataset from a Database"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "ba45ac69",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading time of a Pandas Dataframe from a DataBase:  0:00:00.473085\n"
     ]
    }
   ],
   "source": [
    "# Loading time from a DataBase\n",
    "init=datetime.datetime.now()\n",
    "# Crear el cursor y la conexión\n",
    "nombreDB='datasets'+str(separador)+'train_strokes.db'\n",
    "conexion = sqlite3.connect(nombreDB)\n",
    "cursor = conexion.cursor()\n",
    "# SELECT\n",
    "sentence=\"SELECT * FROM strokes\"\n",
    "df2 = pd.read_sql_query(sentence, conexion)\n",
    "end=datetime.datetime.now()\n",
    "delta=end-init\n",
    "print('Loading time of a Pandas Dataframe from a DataBase: ',delta)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "665f9112",
   "metadata": {},
   "source": [
    "## 3. Loading dataset from binary file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "27911d82",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading time of a Pandas Dataframe from a binary PKL file:  0:00:00.014409\n"
     ]
    }
   ],
   "source": [
    "# Loading time from a PKL file\n",
    "init1=datetime.datetime.now()\n",
    "df=pd.read_pickle(archivoTrainStrokesPkl)\n",
    "end1=datetime.datetime.now()\n",
    "delta1=end1-init1\n",
    "print('Loading time of a Pandas Dataframe from a binary PKL file: ',delta1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab71a71d",
   "metadata": {},
   "source": [
    "## 4. Loading dataset from an Excel file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "9e06ade8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading time of a Pandas Dataframe from an Excel XLSX file:  0:00:22.850002\n"
     ]
    }
   ],
   "source": [
    "# Loading time from a XLS file\n",
    "init2=datetime.datetime.now()\n",
    "df=pd.read_excel(archivoTrainStrokesExcel)\n",
    "end2=datetime.datetime.now()\n",
    "delta2=end2-init2\n",
    "print('Loading time of a Pandas Dataframe from an Excel XLSX file: ',delta2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "096cef14",
   "metadata": {},
   "source": [
    "## 5. Results Obtained"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "0c40050d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading from a Binary PKL File:  0:00:00.014409\n",
      "Loading from a Database:         0:00:00.473085\n",
      "Loading from a Text CSV File:    0:00:00.163215\n",
      "Loading from an Excel File:      0:00:22.850002\n"
     ]
    }
   ],
   "source": [
    "print('Loading from a Binary PKL File: ',delta1)\n",
    "print('Loading from a Database:        ',delta)\n",
    "print('Loading from a Text CSV File:   ',delta0)\n",
    "print('Loading from an Excel File:     ',delta2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b15ab856",
   "metadata": {},
   "source": [
    "## 6. Conclusions"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4857f933",
   "metadata": {},
   "source": [
    "The fastest way to load a dataset is from a binary format file (PKL). The disadvantage is that such file can't be browsed or edited. To load a dataset from a text file as CSV takes near than 6 times more time than from a binary file, but these files are very flexible. The slowest way to load a dataset is finally to load it from an Excel file (XLS). It takes more than 200 times compared with binary files and near than 30 times compared with text files. <br>\n",
    "It is interesting too that loading data from pkl and from csv files are fastest than loading the same data from a Database. But It is remarckable the performance of working with pickles. <br>\n",
    "In https://sebastianraschka.com/Articles/2013_sqlite_database.html#results-and-conclusions another interesting comparison with larger databases was made."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e395a050",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}