{ "cells": [ { "cell_type": "markdown", "id": "e3da6b37", "metadata": {}, "source": [ "# Datasets, Pandas DataFrames and Databases" ] }, { "cell_type": "markdown", "id": "01876e7e", "metadata": {}, "source": [ "## Abstract" ] }, { "cell_type": "markdown", "id": "e8a1728b", "metadata": {}, "source": [ "Para la adquisición de datos, en general se dispone de archivos que contienen los datos almacenados. Sin embargo, el modo más común para persistir información es en bases de datos.\n", "En esta notebook se realizarán operaciones básicas con bases de datos. En particular se utiliza SQLite interactuando con DataFrames Pandas." ] }, { "cell_type": "markdown", "id": "0e081dc0", "metadata": {}, "source": [ "## 1. Introduccion" ] }, { "cell_type": "markdown", "id": "08b1d8cc", "metadata": {}, "source": [ "### Acerca de SQLite" ] }, { "cell_type": "markdown", "id": "3d7979a5", "metadata": {}, "source": [ "\"SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite is the most used database engine in the world with more than a trillon of databeses (1e12) in use.\" (del Sitio oficial: https://www.sqlite.org/index.html).
\n", "SQLite permite crear una base de datos SQL en tiempo de ejecución con solo importar una librería y crear el archivo. Mediante sentencias SQL puede, insertar datos, consultarlos, modificar tablas, etc. Puede navegar las bases de datos con https://sqlitebrowser.org/. Y puede eliminar la DB símplemente eliminando el archivo.\n", "

\n", "Para descargar e instalar Sqlite Browser: https://sqlitebrowser.org/dl/
\n", "
\n", "Para instalar el módulo SQLlite:
\n", "- conda install sqlite3 (instalar desde Anaconda)
\n", "- pip install sqlite3 (instalar desde Terminal - Linux / MacOS - o desde DOS / Powershell - Windows)
\n", "\n" ] }, { "cell_type": "markdown", "id": "a9a990fa", "metadata": {}, "source": [ "### Librerias" ] }, { "cell_type": "code", "execution_count": 6, "id": "fb7fe131", "metadata": {}, "outputs": [], "source": [ "import sqlite3\n", "import pandas as pd\n", "import datetime as datetime\n", "import os\n", "#\n", "# El separador presentado abajo es multi OS.\n", "separador = os.path.sep" ] }, { "cell_type": "markdown", "id": "fec08141", "metadata": {}, "source": [ "En este ejemplo, cargaremos un dataset desde un archivo, crearemos una base de datos y almacenaremos el daframe en la base de datos." ] }, { "cell_type": "markdown", "id": "4b17b6aa", "metadata": {}, "source": [ "## 2. DDL" ] }, { "cell_type": "markdown", "id": "21cb5921", "metadata": {}, "source": [ "Dentro de SQL, un sublenguaje es DDL (Data Definition Languaje). Consiste en un conjunto de sentencias para la definición y modificación de la base de datos y sus tablas." ] }, { "cell_type": "code", "execution_count": 7, "id": "b49f1b6a", "metadata": {}, "outputs": [], "source": [ "# Cargamos los datos desde un CSV\n", "archivoStrokesCsv='datasets'+str(separador)+'train_strokes.csv'\n", "df=pd.read_csv(archivoStrokesCsv)" ] }, { "cell_type": "code", "execution_count": 18, "id": "ed0ed116", "metadata": {}, "outputs": [], "source": [ "# Crearemos una base de datos, el cursor y la conexión\n", "nombreDB='datasets'+str(separador)+'train_strokes.db'\n", "conexion = sqlite3.connect(nombreDB, uri=True)\n", "cursor = conexion.cursor()" ] }, { "cell_type": "markdown", "id": "8cfcb2e9", "metadata": {}, "source": [ "### Crear Tablas" ] }, { "cell_type": "code", "execution_count": 19, "id": "84974d4b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "se creo la tabla employees\n", "se creo la tabla strokes\n" ] } ], "source": [ "# Esta tabla se crea para ser eliminada (y poder mostrarlo)\n", "sentencia=\"\"\"CREATE TABLE employees(\n", " id integer PRIMARY KEY, \n", " name text, \n", " salary real, \n", " department text, \n", " position text, \n", " startDate text\n", " )\"\"\"\n", "try:\n", " cursor.execute(sentencia)\n", " print(\"se creo la tabla employees\")\n", "except sqlite3.OperationalError:\n", " print(\"Tabla existente\")\n", "\n", "# Esta tabla se crea para persistir.\n", "sentence=\"\"\"CREATE TABLE strokes(\n", " id integer PRIMARY KEY AUTOINCREMENT,\n", " gender text,\n", " age real,\n", " hypertension real,\n", " heart_disease real,\n", " ever_married text,\n", " work_type text,\n", " Residence_type text,\n", " avg_glucose_level real,\n", " bmi real,\n", " smoking_status text,\n", " stroke real\n", " )\"\"\"\n", "try:\n", " cursor.execute(sentence)\n", " print(\"se creo la tabla strokes\")\n", "except sqlite3.OperationalError:\n", " print(\"Tabla existente\")" ] }, { "cell_type": "markdown", "id": "789b0b34", "metadata": {}, "source": [ "### Eliminar Tablas" ] }, { "cell_type": "code", "execution_count": 20, "id": "7076f2c7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "se elimino la tabla employees\n" ] } ], "source": [ "# Eliminación de tabla creada\n", "sentencia=\"\"\"drop table employees\"\"\"\n", "try:\n", " cursor.execute(sentencia)\n", " print(\"se elimino la tabla employees\")\n", "except:\n", " print(\"Error operacional\")" ] }, { "cell_type": "markdown", "id": "26e0a067", "metadata": {}, "source": [ "#### Confirmación: Navegamos con DB Browser" ] }, { "cell_type": "markdown", "id": "f997342b", "metadata": {}, "source": [ "Abrimos la DB \"train_strokes.db\" y abrimos la tabla \"strokes\"\n", "\"SQLite3" ] }, { "cell_type": "markdown", "id": "14ed27ff", "metadata": {}, "source": [ "## 3. DML" ] }, { "cell_type": "markdown", "id": "9d831903", "metadata": {}, "source": [ "DML (Data Manipulation Languaje) es un sub-lenguaje de SQL. Es un conjunto de sentencias para la manipulación de los datos almacenados." ] }, { "cell_type": "code", "execution_count": 21, "id": "d4bb5240", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idgenderagehypertensionheart_diseaseever_marriedwork_typeResidence_typeavg_glucose_levelbmismoking_statusstroke
030669Male3.000NochildrenRural95.1218.000
130468Male58.010YesPrivateUrban87.9639.2never smoked0
216523Female8.000NoPrivateUrban110.8917.600
356543Female70.000YesPrivateRural69.0435.9formerly smoked0
446136Male14.000NoNever_workedRural161.2819.100
.......................................
4339556196Female10.000NochildrenUrban58.6420.4never smoked0
433965450Female56.000YesGovt_jobUrban213.6155.4formerly smoked0
4339728375Female82.010YesPrivateUrban91.9428.9formerly smoked0
4339827973Male40.000YesPrivateUrban99.1633.2never smoked0
4339936271Female82.000YesPrivateUrban79.4820.6never smoked0
\n", "

43400 rows × 12 columns

\n", "
" ], "text/plain": [ " id gender age hypertension heart_disease ever_married \\\n", "0 30669 Male 3.0 0 0 No \n", "1 30468 Male 58.0 1 0 Yes \n", "2 16523 Female 8.0 0 0 No \n", "3 56543 Female 70.0 0 0 Yes \n", "4 46136 Male 14.0 0 0 No \n", "... ... ... ... ... ... ... \n", "43395 56196 Female 10.0 0 0 No \n", "43396 5450 Female 56.0 0 0 Yes \n", "43397 28375 Female 82.0 1 0 Yes \n", "43398 27973 Male 40.0 0 0 Yes \n", "43399 36271 Female 82.0 0 0 Yes \n", "\n", " work_type Residence_type avg_glucose_level bmi smoking_status \\\n", "0 children Rural 95.12 18.0 0 \n", "1 Private Urban 87.96 39.2 never smoked \n", "2 Private Urban 110.89 17.6 0 \n", "3 Private Rural 69.04 35.9 formerly smoked \n", "4 Never_worked Rural 161.28 19.1 0 \n", "... ... ... ... ... ... \n", "43395 children Urban 58.64 20.4 never smoked \n", "43396 Govt_job Urban 213.61 55.4 formerly smoked \n", "43397 Private Urban 91.94 28.9 formerly smoked \n", "43398 Private Urban 99.16 33.2 never smoked \n", "43399 Private Urban 79.48 20.6 never smoked \n", "\n", " stroke \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 \n", "... ... \n", "43395 0 \n", "43396 0 \n", "43397 0 \n", "43398 0 \n", "43399 0 \n", "\n", "[43400 rows x 12 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Reemplazar NAN por ceros en todo el Df\n", "df.fillna(0)" ] }, { "cell_type": "markdown", "id": "8928212a", "metadata": {}, "source": [ "### Insert: Insertar los registros del Df en la Tabla de la DB" ] }, { "cell_type": "code", "execution_count": 22, "id": "900a6a6b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "La carga del Df en la DB puede demorar unos 30s aprox. dependiendo de su PC\n", "Insertados 0 registros de 43400\n", "Insertados 10000 registros de 43400\n", "Insertados 20000 registros de 43400\n", "Insertados 30000 registros de 43400\n", "Insertados 40000 registros de 43400\n", "Demora: 0:00:28.017479\n" ] } ], "source": [ "inicio=datetime.datetime.now()\n", "print(\"La carga del Df en la DB puede demorar unos 30s aprox. dependiendo de su PC\")\n", "# Vaciamos la tabla para que no aparezcan errores de ids repetidos\n", "cursor.execute(\"DELETE FROM strokes\")\n", "conexion.commit()\n", "for i in range (len(df)):\n", " if(i/10000==int(i/10000)):\n", " print(\"Insertados \",i,\" registros de \",len(df))\n", " sentence = \"\"\"INSERT OR IGNORE INTO strokes VALUES (?,?,?,?,?,?,?,?,?,?,?,?)\"\"\"\n", " var00=i+1\n", " var01=df.iloc[i,1]\n", " var02=int(df.iloc[i,2])\n", " var03=int(df.iloc[i,3])\n", " var04=int(df.iloc[i,4])\n", " var05=df.iloc[i,5]\n", " var06=df.iloc[i,6]\n", " var07=df.iloc[i,7]\n", " var08=int(df.iloc[i,8])\n", " var09=float(df.iloc[i,9])\n", " var10=df.iloc[i,10]\n", " var11=int(df.iloc[i,11])\n", " cursor.execute(sentence,(var00,var01,var02,var03,var04,var05,var06,var07,var08,var09,var10,var11))\n", "conexion.commit()\n", "fin=datetime.datetime.now()\n", "print(\"Demora: \",fin-inicio)" ] }, { "cell_type": "markdown", "id": "52a23110", "metadata": {}, "source": [ "### Select" ] }, { "cell_type": "code", "execution_count": 23, "id": "8c60b6ec", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# SELECT\n", "sentence=\"SELECT * FROM strokes\"\n", "cursor.execute(sentence)" ] }, { "cell_type": "code", "execution_count": 24, "id": "92a30119", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cant. de registros elegidos: 43400\n" ] } ], "source": [ "# Presentar en pantalla la cantidad de registros\n", "registros = cursor.fetchall()\n", "df2=pd.DataFrame(registros)\n", "print(\"Cant. de registros elegidos: \",len(df2))" ] }, { "cell_type": "code", "execution_count": 25, "id": "b56202bc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0 1 2 3 4 5 6 7 8 9 \\\n", "0 1 Male 3.0 0.0 0.0 No children Rural 95.0 18.0 \n", "1 2 Male 58.0 1.0 0.0 Yes Private Urban 87.0 39.2 \n", "2 3 Female 8.0 0.0 0.0 No Private Urban 110.0 17.6 \n", "3 4 Female 70.0 0.0 0.0 Yes Private Rural 69.0 35.9 \n", "4 5 Male 14.0 0.0 0.0 No Never_worked Rural 161.0 19.1 \n", "\n", " 10 11 \n", "0 None 0.0 \n", "1 never smoked 0.0 \n", "2 None 0.0 \n", "3 formerly smoked 0.0 \n", "4 None 0.0 \n" ] } ], "source": [ "# Presentar en pantalla a los registros iniciales\n", "print(df2.head())" ] }, { "cell_type": "markdown", "id": "3da7561e", "metadata": {}, "source": [ "### Cierre de Conexion" ] }, { "cell_type": "code", "execution_count": 26, "id": "ec449903", "metadata": {}, "outputs": [], "source": [ "# CIERRE DE CONEXION\n", "cursor.close()\n", "conexion.close()" ] }, { "cell_type": "code", "execution_count": null, "id": "0cb557a7", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 5 }