{ "cells": [ { "cell_type": "markdown", "id": "268ff182", "metadata": {}, "source": [ "## Introducción a Pandas: Manejo de conjuntos de datos" ] }, { "cell_type": "markdown", "id": "a395e7b5", "metadata": {}, "source": [ "### Resumen" ] }, { "cell_type": "markdown", "id": "138dbf54", "metadata": {}, "source": [ "En esta Notebook encontrarás un primer abordaje al uso de archivos para adquisición de datos. Se presenta a la librería Pandas de Python, la cual ha sido de gran importancia para el desarrollo de IA en Python. Finalizaremos con una prueba de performance entre diferentes orígenes de datos." ] }, { "cell_type": "markdown", "id": "b07ca88e", "metadata": {}, "source": [ "### 1. DataFrames Pandas" ] }, { "cell_type": "markdown", "id": "a9c3d314", "metadata": {}, "source": [ "### 1.1. Importacion de Librerias" ] }, { "cell_type": "code", "execution_count": 16, "id": "1d995b18", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import datetime as datetime\n", "import pickle\n", "import os" ] }, { "cell_type": "markdown", "id": "80d88987", "metadata": {}, "source": [ "### 1.2. Creación de Dataframes" ] }, { "cell_type": "code", "execution_count": 2, "id": "5725a691", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " country capital area population\n", "0 Brazil Brasilia 8.516 200.40\n", "1 Russia Moscow 17.100 143.50\n", "2 India New Dehli 3.286 1252.00\n", "3 China Beijing 9.597 1357.00\n", "4 South Africa Pretoria 1.221 52.98\n" ] } ], "source": [ "# 1. Dataframe BRICS\n", "# Desde un diccionario\n", "dict = {\"country\": [\"Brazil\", \"Russia\", \"India\", \"China\", \"South Africa\"],\n", " \"capital\": [\"Brasilia\", \"Moscow\", \"New Dehli\", \"Beijing\", \"Pretoria\"],\n", " \"area\": [8.516, 17.10, 3.286, 9.597, 1.221],\n", " \"population\": [200.4, 143.5, 1252, 1357, 52.98] }\n", "brics = pd.DataFrame(dict)\n", "print(brics)" ] }, { "cell_type": "code", "execution_count": 4, "id": "dadfe0ee", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Pais Capital Area Poblacion\n", "0 Argentina Buenos Aires 2.784 47.0\n", "1 Brasil Brasilia 8.516 200.4\n", "2 Uruguay Montevideo 0.176 3.5\n", "3 Paraguay Asuncion 0.406 7.3\n" ] } ], "source": [ "# 2. DataFrame MERCOSUR\n", "# Concatenando listas (columnas)\n", "mercosur=pd.DataFrame()\n", "paises=['Argentina','Brasil','Uruguay','Paraguay']\n", "capitales=['Buenos Aires','Brasilia','Montevideo','Asuncion']\n", "areas=[2.784,8.516,0.176,0.406]\n", "poblaciones=[47,200.4,3.5,7.3]\n", "mercosur['Pais']=paises\n", "mercosur['Capital']=capitales\n", "mercosur['Area']=areas\n", "mercosur['Poblacion']=poblaciones\n", "print(mercosur)" ] }, { "cell_type": "markdown", "id": "3bd21878", "metadata": {}, "source": [ "### 1.3. Funciones básicas para tratar DataFrames Pandas" ] }, { "cell_type": "markdown", "id": "966816bf", "metadata": {}, "source": [ "#### Shape: Tamaño del dataFrame" ] }, { "cell_type": "code", "execution_count": 5, "id": "f002c210", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tamaño de Brics: (5, 4)\n" ] } ], "source": [ "print('tamaño de Brics: ',brics.shape)" ] }, { "cell_type": "code", "execution_count": 6, "id": "f5c62148", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tamaño de Mercosur: (4, 4)\n" ] } ], "source": [ "print('tamaño de Mercosur: ',mercosur.shape)" ] }, { "cell_type": "markdown", "id": "9cf143a1", "metadata": {}, "source": [ "#### dtypes: Tipo de datos por variable" ] }, { "cell_type": "code", "execution_count": 7, "id": "a83f43cb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "country object\n", "capital object\n", "area float64\n", "population float64\n", "dtype: object\n" ] } ], "source": [ "print(brics.dtypes)" ] }, { "cell_type": "code", "execution_count": 8, "id": "31200817", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pais object\n", "Capital object\n", "Area float64\n", "Poblacion float64\n", "dtype: object\n" ] } ], "source": [ "print(mercosur.dtypes)" ] }, { "cell_type": "markdown", "id": "c134b783", "metadata": {}, "source": [ "#### Describe: Información estadistica de las variables numéricas" ] }, { "cell_type": "code", "execution_count": 9, "id": "2ab1d129", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " area population\n", "count 5.000000 5.000000\n", "mean 7.944000 601.176000\n", "std 6.200557 645.261454\n", "min 1.221000 52.980000\n", "25% 3.286000 143.500000\n", "50% 8.516000 200.400000\n", "75% 9.597000 1252.000000\n", "max 17.100000 1357.000000\n" ] } ], "source": [ "print(brics.describe())" ] }, { "cell_type": "code", "execution_count": 10, "id": "61160371", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Area Poblacion\n", "count 4.000000 4.000000\n", "mean 2.970500 64.550000\n", "std 3.880431 92.678458\n", "min 0.176000 3.500000\n", "25% 0.348500 6.350000\n", "50% 1.595000 27.150000\n", "75% 4.217000 85.350000\n", "max 8.516000 200.400000\n" ] } ], "source": [ "print(mercosur.describe())" ] }, { "cell_type": "markdown", "id": "f97416c5", "metadata": {}, "source": [ "#### Obtener variables en una lista" ] }, { "cell_type": "code", "execution_count": 17, "id": "ca3a0054", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "varsBrics ['country', 'capital', 'area', 'population']\n" ] } ], "source": [ "varsBrics=brics.columns.tolist()\n", "print('varsBrics',varsBrics)" ] }, { "cell_type": "code", "execution_count": 18, "id": "7c3e1108", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "varsMercosur ['Pais', 'Capital', 'Area', 'Poblacion']\n" ] } ], "source": [ "varsMercosur=mercosur.columns.tolist()\n", "print('varsMercosur',varsMercosur)" ] }, { "cell_type": "markdown", "id": "fcc3c75b", "metadata": {}, "source": [ "#### iLOC: obteniendo una porción del dataFrame" ] }, { "cell_type": "code", "execution_count": 19, "id": "926f017e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Paises del BRICS: ['Brazil', 'Russia', 'India', 'China', 'South Africa']\n", "Paises del MERCOSUR: ['Argentina', 'Brasil', 'Uruguay', 'Paraguay']\n" ] } ], "source": [ "# Obtener una lista de los paises\n", "print('Paises del BRICS: ',list(brics.iloc[:,0]))\n", "print('Paises del MERCOSUR: ',list(mercosur.iloc[:,0]))" ] }, { "cell_type": "code", "execution_count": 20, "id": "e3b23dd4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Area y poblacion de los paises 1 y 2 de BRICS:\n", " area population\n", "1 17.100 143.5\n", "2 3.286 1252.0\n", "Area y poblacion de los paises 1 y 2 de MERCOSUR:\n", " Area Poblacion\n", "1 8.516 200.4\n", "2 0.176 3.5\n" ] } ], "source": [ "# Area y poblacion de los paises 1 y 2\n", "print('Area y poblacion de los paises 1 y 2 de BRICS:')\n", "print(brics.iloc[1:3,2:])\n", "print('Area y poblacion de los paises 1 y 2 de MERCOSUR:')\n", "print(mercosur.iloc[1:3,2:])" ] }, { "cell_type": "markdown", "id": "9d16a8d8", "metadata": {}, "source": [ "### 2. Archivos de Datos" ] }, { "cell_type": "markdown", "id": "806253eb", "metadata": {}, "source": [ "### 2.1. Guardar DataFrame" ] }, { "cell_type": "code", "execution_count": 23, "id": "67259b58", "metadata": {}, "outputs": [], "source": [ "# Notas:\n", "# El separador presentado abajo es multi OS.\n", "separador = os.path.sep" ] }, { "cell_type": "markdown", "id": "9b54019a", "metadata": {}, "source": [ "#### A formato CSV (archivo de texto)" ] }, { "cell_type": "code", "execution_count": 30, "id": "9a1b0149", "metadata": {}, "outputs": [], "source": [ "archivoBricsCsv=\"datasets\"+str(separador)+\"datos_brics.csv\"\n", "archivoMercosurCsv=\"datasets\"+str(separador)+\"datos_mercosur.csv\"\n", "brics.to_csv(archivoBricsCsv,index=None)\n", "mercosur.to_csv(archivoMercosurCsv)" ] }, { "cell_type": "markdown", "id": "4ca8cd97", "metadata": {}, "source": [ "#### A formato de Planilla Excel" ] }, { "cell_type": "code", "execution_count": 31, "id": "2973f4df", "metadata": {}, "outputs": [], "source": [ "archivoBricsExcel=\"datasets\"+str(separador)+\"datos_brics.xlsx\"\n", "archivoMercosurExcel=\"datasets\"+str(separador)+\"datos_mercosur.xlsx\"\n", "brics.to_excel(archivoBricsExcel,index=None)\n", "mercosur.to_excel(archivoMercosurExcel)" ] }, { "cell_type": "markdown", "id": "9cf0e858", "metadata": {}, "source": [ "#### A formato Pickle (archivo binario)" ] }, { "cell_type": "code", "execution_count": 32, "id": "bb138d26", "metadata": {}, "outputs": [], "source": [ "archivoBricsPkl=\"datasets\"+str(separador)+\"datos_brics.pkl\"\n", "archivoMercosurPkl=\"datasets\"+str(separador)+\"datos_mercosur.pkl\"\n", "brics.to_pickle(archivoBricsPkl)\n", "mercosur.to_pickle(archivoMercosurPkl)" ] }, { "cell_type": "markdown", "id": "f2b4725a", "metadata": {}, "source": [ "### 2.2. Recuperar DataFrame desde archivo" ] }, { "cell_type": "markdown", "id": "bb7afd76", "metadata": {}, "source": [ "#### Desde archivo CSV (archivo de texto)" ] }, { "cell_type": "code", "execution_count": 33, "id": "ddd22342", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BRICS recuperado csv:\n", " country capital area population\n", "0 Brazil Brasilia 8.516 200.40\n", "1 Russia Moscow 17.100 143.50\n", "2 India New Dehli 3.286 1252.00\n", "3 China Beijing 9.597 1357.00\n", "4 South Africa Pretoria 1.221 52.98\n", "MERCOSUR recuperado csv:\n", " Unnamed: 0 Pais Capital Area Poblacion\n", "0 0 Argentina Buenos Aires 2.784 47.0\n", "1 1 Brasil Brasilia 8.516 200.4\n", "2 2 Uruguay Montevideo 0.176 3.5\n", "3 3 Paraguay Asuncion 0.406 7.3\n" ] } ], "source": [ "brics2csv=pd.read_csv(archivoBricsCsv)\n", "print('BRICS recuperado csv:')\n", "print(brics2csv)\n", "mercosur2csv=pd.read_csv(archivoMercosurCsv)\n", "print('MERCOSUR recuperado csv:')\n", "print(mercosur2csv)" ] }, { "cell_type": "markdown", "id": "2d6756c8", "metadata": {}, "source": [ "##### Columna \"unnamed\" que aparece al cargar Datos (ver Mercosur).\n", "

Esto está causada por almacenar índice. Para evitarlo, en el argumento del método \"to_csv\" colocar \"index=None\". Por ejemplo: df.to_csv(nombre,index=None).

\n", "

Ahora bien, si el csv ya fue creado, deberemos cargar el csv sin el índice. Esto se hace agregando al argumento del método \"read_csv\", \"index_col=0\". Por ejemplo: pd.read_csv(archivo,index_col=0)

" ] }, { "cell_type": "markdown", "id": "2319e403", "metadata": {}, "source": [ "##### Eliminamos la columna \"unnamed\" que aparece en Mercosur" ] }, { "cell_type": "code", "execution_count": 35, "id": "3c9906b0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MERCOSUR recuperado csv:\n", " Pais Capital Area Poblacion\n", "0 Argentina Buenos Aires 2.784 47.0\n", "1 Brasil Brasilia 8.516 200.4\n", "2 Uruguay Montevideo 0.176 3.5\n", "3 Paraguay Asuncion 0.406 7.3\n" ] } ], "source": [ "mercosur2csv=pd.read_csv(archivoMercosurCsv, index_col=0)\n", "print('MERCOSUR recuperado csv:')\n", "print(mercosur2csv)" ] }, { "cell_type": "markdown", "id": "b133e09d", "metadata": {}, "source": [ "#### Desde Planilla Excel" ] }, { "cell_type": "code", "execution_count": 37, "id": "347ae62c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BRICS recuperado xlsx:\n", " country capital area population\n", "0 Brazil Brasilia 8.516 200.40\n", "1 Russia Moscow 17.100 143.50\n", "2 India New Dehli 3.286 1252.00\n", "3 China Beijing 9.597 1357.00\n", "4 South Africa Pretoria 1.221 52.98\n", "MERCOSUR recuperado xlsx:\n", " Pais Capital Area Poblacion\n", "0 Argentina Buenos Aires 2.784 47.0\n", "1 Brasil Brasilia 8.516 200.4\n", "2 Uruguay Montevideo 0.176 3.5\n", "3 Paraguay Asuncion 0.406 7.3\n" ] } ], "source": [ "brics2xlsx=pd.read_excel(archivoBricsExcel)\n", "print('BRICS recuperado xlsx:')\n", "print(brics2xlsx)\n", "mercosur2xlsx=pd.read_excel(archivoMercosurExcel,index_col=0)\n", "print('MERCOSUR recuperado xlsx:')\n", "print(mercosur2xlsx)" ] }, { "cell_type": "markdown", "id": "e39a2419", "metadata": {}, "source": [ "#### Desde archivo binario Pickle" ] }, { "cell_type": "code", "execution_count": 38, "id": "0592aaa4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BRICS recuperado pkl:\n", " country capital area population\n", "0 Brazil Brasilia 8.516 200.40\n", "1 Russia Moscow 17.100 143.50\n", "2 India New Dehli 3.286 1252.00\n", "3 China Beijing 9.597 1357.00\n", "4 South Africa Pretoria 1.221 52.98\n", "MERCOSUR recuperado pkl:\n", " Pais Capital Area Poblacion\n", "0 Argentina Buenos Aires 2.784 47.0\n", "1 Brasil Brasilia 8.516 200.4\n", "2 Uruguay Montevideo 0.176 3.5\n", "3 Paraguay Asuncion 0.406 7.3\n" ] } ], "source": [ "brics2pkl=pd.read_pickle(archivoBricsPkl)\n", "print('BRICS recuperado pkl:')\n", "print(brics2pkl)\n", "mercosur2pkl=pd.read_pickle(archivoMercosurPkl)\n", "print('MERCOSUR recuperado pkl:')\n", "print(mercosur2pkl)" ] }, { "cell_type": "code", "execution_count": null, "id": "dc902394", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 5 }