{ "cells": [ { "cell_type": "markdown", "id": "fe647680", "metadata": {}, "source": [ "# Loading a Dataset from different file types" ] }, { "cell_type": "markdown", "id": "9cbece8d", "metadata": {}, "source": [ "## Abstract" ] }, { "cell_type": "markdown", "id": "3a7b8856", "metadata": {}, "source": [ "This notebook was created to show the difference of loading a dataset from different data file formats.
\n", "This dataset contains data about stroke to people with various vital checks. It was downloaded from Kaggle: https://www.kaggle.com/datasets/lirilkumaramal/heart-stroke/download. The dataset has 43,400 rows and 12 variables that, stored in a csv file, weigh 2.6Mb.
\n", "During this notebook you could experiment loading and writing times into different data saving formats: csv (text), xlsx (excel) and pkl (binary) and also, we load data from a SQLite Database." ] }, { "cell_type": "markdown", "id": "1904160f", "metadata": {}, "source": [ "### Libraries" ] }, { "cell_type": "code", "execution_count": 4, "id": "bbd2b9da", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import datetime as datetime\n", "import pickle\n", "import sqlite3\n", "import os\n", "#\n", "# El separador presentado abajo es multi OS.\n", "separador = os.path.sep" ] }, { "cell_type": "markdown", "id": "175cddf2", "metadata": {}, "source": [ "## 1. Loading from a CSV (text) file" ] }, { "cell_type": "code", "execution_count": 7, "id": "bf0bf0c6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading time of a Pandas Dataframe from a text CSV file: 0:00:00.163215\n" ] } ], "source": [ "# Loading time from a CSV file\n", "init0=datetime.datetime.now()\n", "archivoTrainStrokesCsv='datasets'+str(separador)+'train_strokes.csv'\n", "df=pd.read_csv(archivoTrainStrokesCsv)\n", "end0=datetime.datetime.now()\n", "delta0=end0-init0\n", "print('Loading time of a Pandas Dataframe from a text CSV file: ',delta0)\n" ] }, { "cell_type": "markdown", "id": "38def24c", "metadata": {}, "source": [ "### Storing dataset into Excel and binary format" ] }, { "cell_type": "code", "execution_count": 8, "id": "878ab9d4", "metadata": {}, "outputs": [], "source": [ "archivoTrainStrokesPkl='datasets'+str(separador)+'train_strokes.pkl'\n", "archivoTrainStrokesExcel='datasets'+str(separador)+'train_strokes.xlsx'\n", "df.to_pickle(archivoTrainStrokesPkl)\n", "df.to_excel(archivoTrainStrokesExcel)" ] }, { "cell_type": "markdown", "id": "7c326e05", "metadata": {}, "source": [ "## 2. Loading dataset from a Database" ] }, { "cell_type": "code", "execution_count": 9, "id": "ba45ac69", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading time of a Pandas Dataframe from a DataBase: 0:00:00.473085\n" ] } ], "source": [ "# Loading time from a DataBase\n", "init=datetime.datetime.now()\n", "# Crear el cursor y la conexión\n", "nombreDB='datasets'+str(separador)+'train_strokes.db'\n", "conexion = sqlite3.connect(nombreDB)\n", "cursor = conexion.cursor()\n", "# SELECT\n", "sentence=\"SELECT * FROM strokes\"\n", "df2 = pd.read_sql_query(sentence, conexion)\n", "end=datetime.datetime.now()\n", "delta=end-init\n", "print('Loading time of a Pandas Dataframe from a DataBase: ',delta)" ] }, { "cell_type": "markdown", "id": "665f9112", "metadata": {}, "source": [ "## 3. Loading dataset from binary file" ] }, { "cell_type": "code", "execution_count": 10, "id": "27911d82", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading time of a Pandas Dataframe from a binary PKL file: 0:00:00.014409\n" ] } ], "source": [ "# Loading time from a PKL file\n", "init1=datetime.datetime.now()\n", "df=pd.read_pickle(archivoTrainStrokesPkl)\n", "end1=datetime.datetime.now()\n", "delta1=end1-init1\n", "print('Loading time of a Pandas Dataframe from a binary PKL file: ',delta1)" ] }, { "cell_type": "markdown", "id": "ab71a71d", "metadata": {}, "source": [ "## 4. Loading dataset from an Excel file" ] }, { "cell_type": "code", "execution_count": 12, "id": "9e06ade8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading time of a Pandas Dataframe from an Excel XLSX file: 0:00:22.850002\n" ] } ], "source": [ "# Loading time from a XLS file\n", "init2=datetime.datetime.now()\n", "df=pd.read_excel(archivoTrainStrokesExcel)\n", "end2=datetime.datetime.now()\n", "delta2=end2-init2\n", "print('Loading time of a Pandas Dataframe from an Excel XLSX file: ',delta2)" ] }, { "cell_type": "markdown", "id": "096cef14", "metadata": {}, "source": [ "## 5. Results Obtained" ] }, { "cell_type": "code", "execution_count": 13, "id": "0c40050d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading from a Binary PKL File: 0:00:00.014409\n", "Loading from a Database: 0:00:00.473085\n", "Loading from a Text CSV File: 0:00:00.163215\n", "Loading from an Excel File: 0:00:22.850002\n" ] } ], "source": [ "print('Loading from a Binary PKL File: ',delta1)\n", "print('Loading from a Database: ',delta)\n", "print('Loading from a Text CSV File: ',delta0)\n", "print('Loading from an Excel File: ',delta2)" ] }, { "cell_type": "markdown", "id": "b15ab856", "metadata": {}, "source": [ "## 6. Conclusions" ] }, { "cell_type": "markdown", "id": "4857f933", "metadata": {}, "source": [ "The fastest way to load a dataset is from a binary format file (PKL). The disadvantage is that such file can't be browsed or edited. To load a dataset from a text file as CSV takes near than 6 times more time than from a binary file, but these files are very flexible. The slowest way to load a dataset is finally to load it from an Excel file (XLS). It takes more than 200 times compared with binary files and near than 30 times compared with text files.
\n", "It is interesting too that loading data from pkl and from csv files are fastest than loading the same data from a Database. But It is remarckable the performance of working with pickles.
\n", "In https://sebastianraschka.com/Articles/2013_sqlite_database.html#results-and-conclusions another interesting comparison with larger databases was made." ] }, { "cell_type": "code", "execution_count": null, "id": "e395a050", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 5 }