{ "cells": [ { "cell_type": "markdown", "id": "d0ec1984", "metadata": {}, "source": [ "# Data Wrangling o procesos ETL" ] }, { "cell_type": "markdown", "id": "00dd2ec4", "metadata": {}, "source": [ "## Resumen" ] }, { "cell_type": "markdown", "id": "5421d959", "metadata": {}, "source": [ "En esta notebook se realiza una inspección inicial del dataset, con la idea de identificar la estructura, tipos de variables, y demás. Se identifica también, si en el conjunto de datos hay datos faltantes o registros duplicados, y se actúa en consecuencia." ] }, { "cell_type": "markdown", "id": "bcb9422f", "metadata": {}, "source": [ "## 1. Inicialización" ] }, { "cell_type": "markdown", "id": "998ec12f", "metadata": {}, "source": [ "### 1.1. Importación de librerías" ] }, { "cell_type": "code", "execution_count": 2, "id": "92dc34a9", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import os\n", "#\n", "# separador multi-OS\n", "separador=os.sep" ] }, { "cell_type": "code", "execution_count": 2, "id": "37d22656", "metadata": {}, "outputs": [], "source": [ "### 1.2. El Dataset" ] }, { "cell_type": "code", "execution_count": 3, "id": "f6db5a71", "metadata": {}, "outputs": [], "source": [ "df=pd.read_csv('datasets'+str(separador)+'iris.csv')" ] }, { "cell_type": "markdown", "id": "2beacd29", "metadata": {}, "source": [ "## 2. Descripción del Dataset" ] }, { "cell_type": "markdown", "id": "828455ff", "metadata": {}, "source": [ "### 2.1. Tamaño" ] }, { "cell_type": "code", "execution_count": 4, "id": "d8807da5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(150, 5)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "id": "23a7cfbf", "metadata": {}, "source": [ "### 2.2. Primeros registros" ] }, { "cell_type": "code", "execution_count": 5, "id": "91846863", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal.lengthsepal.widthpetal.lengthpetal.widthvariety
05.13.51.40.2Setosa
14.93.01.40.2Setosa
24.73.21.30.2Setosa
34.63.11.50.2Setosa
45.03.61.40.2Setosa
\n", "
" ], "text/plain": [ " sepal.length sepal.width petal.length petal.width variety\n", "0 5.1 3.5 1.4 0.2 Setosa\n", "1 4.9 3.0 1.4 0.2 Setosa\n", "2 4.7 3.2 1.3 0.2 Setosa\n", "3 4.6 3.1 1.5 0.2 Setosa\n", "4 5.0 3.6 1.4 0.2 Setosa" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "id": "276a5519", "metadata": {}, "source": [ "### 2.3. Lista de variables" ] }, { "cell_type": "markdown", "id": "f36c68f8", "metadata": {}, "source": [ "### 2.3. Tipos de variables" ] }, { "cell_type": "code", "execution_count": 6, "id": "e6868a28", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{float64: ['sepal.length', 'sepal.width', 'petal.length', 'petal.width'], object: ['variety']}\n" ] } ], "source": [ "tipos = df.columns.to_series().groupby(df.dtypes).groups\n", "print(tipos)" ] }, { "cell_type": "code", "execution_count": 7, "id": "9d159028", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "variables= ['sepal.length', 'sepal.width', 'petal.length', 'petal.width', 'variety']\n", "Variables Cualitativas: Index(['variety'], dtype='object')\n", "Variables Numéricas: ['sepal.width', 'petal.length', 'sepal.length', 'petal.width']\n" ] } ], "source": [ "variables=df.columns.tolist()\n", "print('variables=',variables)\n", "ctext = tipos[np.dtype('object')]\n", "print('Variables Cualitativas: ',ctext)\n", "cnum = list(set(variables) - set(ctext))\n", "print('Variables Numéricas: ',cnum)" ] }, { "cell_type": "markdown", "id": "f6b06909", "metadata": {}, "source": [ "### 2.4. Ver si hay datos faltantes" ] }, { "cell_type": "code", "execution_count": 8, "id": "15ac4cd9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sepal.length 0\n", "sepal.width 0\n", "petal.length 0\n", "petal.width 0\n", "variety 0\n", "dtype: int64\n" ] } ], "source": [ "print(df.isna().sum())" ] }, { "cell_type": "markdown", "id": "f08da5db", "metadata": {}, "source": [ "### 2.5. Ver si hay registros duplicados" ] }, { "cell_type": "code", "execution_count": 9, "id": "14699d93", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "False 149\n", "True 1\n", "dtype: int64\n" ] } ], "source": [ "print(df.duplicated().value_counts())" ] }, { "cell_type": "markdown", "id": "ed722164", "metadata": {}, "source": [ "## 3. Limpieza del Dataset" ] }, { "cell_type": "markdown", "id": "b8afe4a5", "metadata": {}, "source": [ "### 3.1. Completar datos faltantes" ] }, { "cell_type": "markdown", "id": "8a0f814c", "metadata": {}, "source": [ "En caso que en algún registro, aparezca un dato faltante (NAN), la mayoría de los algoritmos no podrá operar con ellos. Es por ello que se debe tomar una decisión: Eliminar el registro o bien reemplazar ese valor por algún valor pre-establecido como ser el promedio." ] }, { "cell_type": "code", "execution_count": 11, "id": "2f2b083d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "En caso de NAN: ¿(R) Reemplazar / (E) Eliminar registro?E\n" ] } ], "source": [ "respuesta=input('En caso de NAN: ¿(R) Reemplazar / (E) Eliminar registro?')\n", "if(respuesta=='R'):\n", " hayFaltantes=df.isnull().any().any()\n", " print('Hay datos vacios? ',hayFaltantes)\n", " # Si hay faltantes: competo a los numéricos con el promedio:\n", " for c in cnum:\n", " mean = df[c].mean()\n", " df[c] = df[c].fillna(mean)\n", " # Si hay faltantes cuantitativas completo:\n", " def completarVariableCualitativa(df,variable):\n", " var=[]\n", " for i in range(len(df)):\n", " if(pd.isna(df.iloc[i,variable])):\n", " dato='no_data'\n", " else:\n", " dato=df.iloc[i,variable]\n", " var.append(dato)\n", " return var\n", " ctext = tipos[np.dtype('object')]\n", " ctextnum=[]\n", " for i in range(len(ctext)):\n", " for j in range(len(variables)):\n", " if ctext[i]==variables[j]:\n", " ctextnum.append(j)\n", " for i in range(len(ctext)):\n", " completarVariableCualitativa(df,int(ctextnum[i]))\n", "elif(respuesta=='E'):\n", " df = df.dropna(axis = 0, how ='any')\n", "else:\n", " print('ERROR DESCONOCIDO')" ] }, { "cell_type": "markdown", "id": "28f26cee", "metadata": {}, "source": [ "### 3.2. Eliminar registros duplicados" ] }, { "cell_type": "markdown", "id": "bba20cda", "metadata": {}, "source": [ "Si hubiera registros duplicados, puede que sea un error o bien que haya dos elementos de la muestra exactamente iguales. Pueden mantenerse o eliminarse. En este caso, ante duplicados, se decide eliminarlos." ] }, { "cell_type": "code", "execution_count": 12, "id": "00d32fd7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(149, 5)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_sin_duplicados = df.drop_duplicates(keep='first')\n", "df_sin_duplicados.shape" ] }, { "cell_type": "markdown", "id": "7d56dbb1", "metadata": {}, "source": [ "Observar la cantidad de registros antes de este proceso y después: antes había 150 registros, de los cuales se había identificado que uno era duplicado. Tras su eliminación, observar que shape muestra 149 registros.
\n", "Importante: drop_duplicates() por defecto elimina todos los registros que aparecen duplicados. Normalmente es deseable retener uno de ellos. Por eso: \"keep='first'\"." ] }, { "cell_type": "markdown", "id": "29f916d4", "metadata": {}, "source": [ "### 3.3. Normalizar Dataset" ] }, { "cell_type": "markdown", "id": "57500951", "metadata": {}, "source": [ "Muchos algoritmos requieren que los números se encuentren en el mismo orden de magnitud, de modo de no asignar pesos fuertes a variables que símplemente tienen valores numéricos altos en comparación con otras. Para ello, normalmente es conveniente la normalización de todo el conjunto de datos (salvo la variable objetivo). Normalmente, tras la normalización, todos los dato pasan a estar entre -1 y 1 o parecido." ] }, { "cell_type": "code", "execution_count": 13, "id": "a1f4b7b7", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/tmp/ipykernel_5757/222317414.py:6: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.\n", " X = np.array(df.drop([target],1))\n" ] } ], "source": [ "def normalizar(df,var):\n", " import sklearn\n", " from sklearn import preprocessing\n", " campos=df.columns.tolist()\n", " target=campos[var]\n", " X = np.array(df.drop([target],1))\n", " y = np.array(df[target])\n", " X_Norm=sklearn.preprocessing.scale(X)\n", " df2=pd.DataFrame(X_Norm)\n", " df2['target']=y\n", " df2.columns=campos\n", " return df2\n", "df=normalizar(df,4)" ] }, { "cell_type": "markdown", "id": "656daca5", "metadata": {}, "source": [ "### 3.4. Discretizar Dataset" ] }, { "cell_type": "markdown", "id": "79b7c66b", "metadata": {}, "source": [ "La variable \"Variety\" presenta los valores: \"Setosa\", \"Virginica\" y \"Versicolor\". Como son del tipo texto, no pueden operarse en la mayoría de los algoritmos. El proceso de discretización crea una variable \"Variety_Setosa\" y en aquellos registros donde la variedad es \"Setosa\", en la columna \"Variety_Setosa\" coloca un 1, y resto en otro caso. Así sucesivamente." ] }, { "cell_type": "code", "execution_count": 14, "id": "1cc0e842", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sepal.length sepal.width petal.length petal.width variety_Setosa variety_Versicolor variety_Virginica\n", "-0.052506 -0.822570 0.762758 0.922303 0 0 1 2\n", " 0.432165 -1.973554 0.421734 0.395774 0 1 0 1\n", " -0.362176 0.308059 0.132510 0 1 0 1\n", " 0.788808 0.933271 1.448832 0 0 1 1\n", " 0.553333 -1.743357 0.364896 0.132510 0 1 0 1\n", " ..\n", "-0.537178 1.939791 -1.397064 -1.052180 1 0 0 1\n", " -1.169714 -1.052180 1 0 0 1\n", "-0.416010 -1.743357 0.137547 0.132510 0 1 0 1\n", " -1.513160 -0.032966 -0.262387 0 1 0 1\n", " 2.492019 1.709595 1.501645 1.053935 0 0 1 1\n", "Length: 149, dtype: int64\n" ] } ], "source": [ "df_esp = pd.get_dummies(df, columns=['variety'])\n", "# Conteo de resultados\n", "print(df_esp.value_counts())" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 5 }