{ "cells": [ { "attachments": { "imagen.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": { "id": "4zSSlTgxLAwK" }, "source": [ "![imagen.png](attachment:imagen.png)\n", "## Spacy: Named Entity Recognition Model\n", "Basado en el tutorial: https://spacy.io/universe/project/video-spacy-course-es\n", "

SpaCy es una librería de software para procesamiento de lenguajes naturales, reconocimiento de nombres de entidades, análisis de redes, visualización de datos, análisis, visual analysis, análisis de contenidos, enriching, anotación desarrollado por Matt Honnibal y programado en lenguaje Python. Licencia: Licencia MIT

" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: spacy[transformers] in /home/jorgek/anaconda3/lib/python3.11/site-packages (3.8.2)\n", "Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy[transformers]) (3.0.12)\n", "Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy[transformers]) (1.0.5)\n", "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy[transformers]) (1.0.10)\n", "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy[transformers]) (2.0.8)\n", "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy[transformers]) (3.0.9)\n", "Collecting thinc<8.4.0,>=8.3.0 (from spacy[transformers])\n", " Using cached thinc-8.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.9 MB)\n", "Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy[transformers]) (1.1.3)\n", "Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy[transformers]) (2.4.8)\n", "Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy[transformers]) (2.0.10)\n", "Requirement already satisfied: weasel<0.5.0,>=0.1.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy[transformers]) (0.4.1)\n", "Requirement already satisfied: typer<1.0.0,>=0.3.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy[transformers]) (0.12.5)\n", "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy[transformers]) (4.65.0)\n", "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy[transformers]) (2.29.0)\n", "Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy[transformers]) (2.9.2)\n", "Requirement already satisfied: jinja2 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy[transformers]) (3.1.2)\n", "Requirement already satisfied: setuptools in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy[transformers]) (67.8.0)\n", "Requirement already satisfied: packaging>=20.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy[transformers]) (23.0)\n", "Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy[transformers]) (3.4.1)\n", "Requirement already satisfied: numpy>=1.19.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy[transformers]) (2.0.2)\n", "Collecting spacy-transformers<1.4.0,>=1.1.2 (from spacy[transformers])\n", " Downloading spacy_transformers-1.3.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (197 kB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m197.7/197.7 kB\u001b[0m \u001b[31m2.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\n", "\u001b[?25hRequirement already satisfied: language-data>=1.2 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from langcodes<4.0.0,>=3.2.0->spacy[transformers]) (1.2.0)\n", "Requirement already satisfied: annotated-types>=0.6.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy[transformers]) (0.7.0)\n", "Requirement already satisfied: pydantic-core==2.23.4 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy[transformers]) (2.23.4)\n", "Requirement already satisfied: typing-extensions>=4.6.1 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy[transformers]) (4.6.3)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from requests<3.0.0,>=2.13.0->spacy[transformers]) (2.0.4)\n", "Requirement already satisfied: idna<4,>=2.5 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from requests<3.0.0,>=2.13.0->spacy[transformers]) (3.4)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from requests<3.0.0,>=2.13.0->spacy[transformers]) (1.26.16)\n", "Requirement already satisfied: certifi>=2017.4.17 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from requests<3.0.0,>=2.13.0->spacy[transformers]) (2023.5.7)\n", "Requirement already satisfied: transformers<4.37.0,>=3.4.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy-transformers<1.4.0,>=1.1.2->spacy[transformers]) (4.29.2)\n", "Requirement already satisfied: torch>=1.8.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from spacy-transformers<1.4.0,>=1.1.2->spacy[transformers]) (2.0.1)\n", "Collecting spacy-alignments<1.0.0,>=0.7.2 (from spacy-transformers<1.4.0,>=1.1.2->spacy[transformers])\n", " Downloading spacy_alignments-0.9.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (313 kB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m314.0/314.0 kB\u001b[0m \u001b[31m2.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m:01\u001b[0m\n", "\u001b[?25hRequirement already satisfied: blis<1.1.0,>=1.0.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from thinc<8.4.0,>=8.3.0->spacy[transformers]) (1.0.1)\n", "Requirement already satisfied: confection<1.0.0,>=0.0.1 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from thinc<8.4.0,>=8.3.0->spacy[transformers]) (0.1.5)\n", "Requirement already satisfied: click>=8.0.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from typer<1.0.0,>=0.3.0->spacy[transformers]) (8.0.4)\n", "Requirement already satisfied: shellingham>=1.3.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from typer<1.0.0,>=0.3.0->spacy[transformers]) (1.5.4)\n", "Requirement already satisfied: rich>=10.11.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from typer<1.0.0,>=0.3.0->spacy[transformers]) (13.9.3)\n", "Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from weasel<0.5.0,>=0.1.0->spacy[transformers]) (0.20.0)\n", "Requirement already satisfied: smart-open<8.0.0,>=5.2.1 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from weasel<0.5.0,>=0.1.0->spacy[transformers]) (5.2.1)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from jinja2->spacy[transformers]) (2.1.1)\n", "Requirement already satisfied: marisa-trie>=0.7.7 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy[transformers]) (1.2.1)\n", "Requirement already satisfied: markdown-it-py>=2.2.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy[transformers]) (2.2.0)\n", "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy[transformers]) (2.15.1)\n", "Requirement already satisfied: filelock in /home/jorgek/anaconda3/lib/python3.11/site-packages (from torch>=1.8.0->spacy-transformers<1.4.0,>=1.1.2->spacy[transformers]) (3.9.0)\n", "Requirement already satisfied: sympy in /home/jorgek/anaconda3/lib/python3.11/site-packages (from torch>=1.8.0->spacy-transformers<1.4.0,>=1.1.2->spacy[transformers]) (1.11.1)\n", "Requirement already satisfied: networkx in /home/jorgek/anaconda3/lib/python3.11/site-packages (from torch>=1.8.0->spacy-transformers<1.4.0,>=1.1.2->spacy[transformers]) (2.8.4)\n", "Requirement already satisfied: huggingface-hub<1.0,>=0.14.1 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from transformers<4.37.0,>=3.4.0->spacy-transformers<1.4.0,>=1.1.2->spacy[transformers]) (0.15.1)\n", "Requirement already satisfied: pyyaml>=5.1 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from transformers<4.37.0,>=3.4.0->spacy-transformers<1.4.0,>=1.1.2->spacy[transformers]) (6.0)\n", "Requirement already satisfied: regex!=2019.12.17 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from transformers<4.37.0,>=3.4.0->spacy-transformers<1.4.0,>=1.1.2->spacy[transformers]) (2022.7.9)\n", "Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from transformers<4.37.0,>=3.4.0->spacy-transformers<1.4.0,>=1.1.2->spacy[transformers]) (0.13.2)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: fsspec in /home/jorgek/anaconda3/lib/python3.11/site-packages (from huggingface-hub<1.0,>=0.14.1->transformers<4.37.0,>=3.4.0->spacy-transformers<1.4.0,>=1.1.2->spacy[transformers]) (2023.4.0)\n", "Requirement already satisfied: mdurl~=0.1 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy[transformers]) (0.1.0)\n", "Requirement already satisfied: mpmath>=0.19 in /home/jorgek/anaconda3/lib/python3.11/site-packages (from sympy->torch>=1.8.0->spacy-transformers<1.4.0,>=1.1.2->spacy[transformers]) (1.2.1)\n", "Installing collected packages: spacy-alignments, thinc, spacy-transformers\n", " Attempting uninstall: thinc\n", " Found existing installation: thinc 9.1.1\n", " Uninstalling thinc-9.1.1:\n", " Successfully uninstalled thinc-9.1.1\n", "Successfully installed spacy-alignments-0.9.1 spacy-transformers-1.3.5 thinc-8.3.2\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "# Instalaciones necesarias\n", "#%pip install -U numpy # actualizacion de Numpy\n", "#%pip install -U spacy # instalacion y/o actualizacion de Spacy\n", "#%pip install -U h5py thinc # instalacion y/o actualizacion de h5py y thinc \n", "#%pip install -U thinc\n", "#%pip install -U spacy[transformers]" ] }, { "cell_type": "markdown", "metadata": { "id": "_3K7oE8SKaQl" }, "source": [ "## Inicio\n", "### Requerimientos: Librerias" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "fcodkn0Dot0D" }, "outputs": [], "source": [ "# Librerias\n", "import spacy\n", "from spacy.lang.es import Spanish" ] }, { "cell_type": "markdown", "metadata": { "id": "Al21TRcGKt6c" }, "source": [ "### Elementos basicos para el Analisis de un texto" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "RTHRlEKIo7NV" }, "outputs": [], "source": [ "# Un texto en español\n", "texto01=\"Lo que hay que saber del dólar hoy lunes 20 de marzo en Argentina, con información completa y actualizada sobre la cotización del dólar en el Banco Nación, en el mercado mayorista y los datos del Banco Central.\"\n", "texto02=\"Ella comió pizza\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Pipeline en español\n", "nlp01=Spanish()" ] }, { "cell_type": "markdown", "metadata": { "id": "l6P-keVPKyFd" }, "source": [ "## 1. El objeto DOC\n", "El objeto DOC es el resultado del procesamiento de un texto con el modelo NLP Stacy. Es una lista de palabras (incluyendo numeros y signos de puntuación). Se comporta como una lista. A cada elemento de esa lista se lo llama \"tocken\". " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cantidad de tokens (palabras, numeros + signos de puntuacion): 41\n" ] } ], "source": [ "# El objeto DOC: aplicacion de NLP al TEXTO\n", "doc01 = nlp01(texto01)\n", "print(\"Cantidad de tokens (palabras, numeros + signos de puntuacion):\",len(doc01))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1. Un token del texto\n", "Al comportarse como una lista de python, al objeto DOC simple y natural obtenerles un token desde su indice." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Obteniendo un token dentro del texto: token01= saber\n", "Obteniendo otro token dentro del texto: token02= 20\n", "Obteniendo otro token dentro del texto: token03= ,\n" ] } ], "source": [ "# Elementos desde su indice:\n", "token01=doc01[4]\n", "token02=doc01[9]\n", "token03=doc01[14]\n", "print(\"Obteniendo un token dentro del texto: token01=\",token01)\n", "print(\"Obteniendo otro token dentro del texto: token02=\",token02)\n", "print(\"Obteniendo otro token dentro del texto: token03=\",token03)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2. Tipo de elemento\n", "Se puede saber qué tipo de elemento es cada token: si es una palabra (is_alpha), un signo de puntuación (is_punct) o un número (is_digit)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "token\tis_alpha\tis_digit\tis_punct\t\n", "saber\tTrue\t\tFalse\t\tFalse\t\n", "20\tFalse\t\tTrue\t\tFalse\t\n", ",\tFalse\t\tFalse\t\tTrue\t\n" ] } ], "source": [ "print(\"{}\\t{}\\t{}\\t{}\\t\".format(\"token\", \"is_alpha\", \"is_digit\", \"is_punct\"))\n", "print(\"{}\\t{}\\t\\t{}\\t\\t{}\\t\".format(token01, token01.is_alpha, token01.is_digit, token01.is_punct))\n", "print(\"{}\\t{}\\t\\t{}\\t\\t{}\\t\".format(token02, token02.is_alpha, token02.is_digit, token02.is_punct))\n", "print(\"{}\\t{}\\t\\t{}\\t\\t{}\\t\".format(token03, token03.is_alpha, token03.is_digit, token03.is_punct))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3. Una porcion del texto: Span\n", "Gracias a que DOC se comporta como una lista de python, para obtenerle una porcion (span) hago un slice o recorte de la lista (objeto DOC) presentando indice de inicio y fin (sin incluirlo): " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Obteniendo una porcion del texto: Lo que hay que saber\n" ] } ], "source": [ "# Span: un recorte de la lista\n", "unSpan=doc01[0:5]\n", "print(\"Obteniendo una porcion del texto: \",unSpan)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Atributos Linguisticos en contexto\n", "Modelos pre-entrenados les permiten a Spacy identificar atributos linguisticos en contexto, dependencias sintacticas y entidades nominales.\n", "Por ejemplo: puede identificar si un span consiste en el nombre y apellido de una persona.
\n", "Mas informacion: https://spacy.io/usage/linguistic-features\n", "### 2.1. El paquete es_core_news_sm\n", "Contiene modelos entrenados con noticias en español. Para poder usarlo, primero debe poder descargarse.
\n", "$ python -m spacy download es_core_news_sm
\n", "El mismo puede cargarse (luego de la descarga):
\n", "spacy.load('es_core_news_sm')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Atributos Linguisticos en contexto\n", "Modelos pre-entrenados les permiten a Spacy identificar atributos linguisticos en contexto, dependencias sintacticas y entidades nominales.\n", "Por ejemplo: puede identificar si un span consiste en el nombre y apellido de una persona.
\n", "Mas informacion: https://spacy.io/usage/linguistic-features\n", "### 2.1. El paquete es_core_news_sm\n", "Contiene modelos entrenados con noticias en español. Para poder usarlo, primero debe poder descargarse.
\n", "El mismo puede cargarse (luego de la descarga):
\n", "spacy.load('es_core_news_sm')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "El texto: Lo que hay que saber del dólar hoy lunes 20 de marzo en Argentina, con información completa y actualizada sobre la cotización del dólar en el Banco Nación, en el mercado mayorista y los datos del Banco Central.\n" ] } ], "source": [ "#!python -m spacy download es_core_news_sm\n", "print(\"El texto: \",texto01)\n", "# Inicialización: el pipeline\n", "nlp02 = spacy.load(\"es_core_news_sm\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2. Identificacion de Categorias Gramaticales (.pos_):\n", "Las categorias gramaticales (part of speech) son: Verbo, sustantivos, pronombres, adjetivos, etc. \n", "Spacy permite identificar las categorías gramaticales y las entidades nominales de las palabras de un texto." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "texto\t\tcategoria\n", "Lo\t\tPRON\t\tél\n", "que\t\tPRON\t\tque\n", "hay\t\tAUX\t\thaber\n", "que\t\tSCONJ\t\tque\n", "saber\t\tVERB\t\tsaber\n", "del\t\tADP\t\tdel\n", "dólar\t\tNOUN\t\tdólar\n", "hoy\t\tADV\t\thoy\n", "lunes\t\tNOUN\t\tlunes\n", "20\t\tNUM\t\t20\n", "de\t\tADP\t\tde\n", "marzo\t\tNOUN\t\tmarzo\n", "en\t\tADP\t\ten\n", "Argentina\t\tPROPN\t\tArgentina\n", ",\t\tPUNCT\t\t,\n", "con\t\tADP\t\tcon\n", "información\t\tNOUN\t\tinformación\n", "completa\t\tADJ\t\tcompleto\n", "y\t\tCCONJ\t\ty\n", "actualizada\t\tADJ\t\tactualizado\n", "sobre\t\tADP\t\tsobre\n", "la\t\tDET\t\tel\n", "cotización\t\tNOUN\t\tcotización\n", "del\t\tADP\t\tdel\n", "dólar\t\tNOUN\t\tdólar\n", "en\t\tADP\t\ten\n", "el\t\tDET\t\tel\n", "Banco\t\tPROPN\t\tBanco\n", "Nación\t\tPROPN\t\tNación\n", ",\t\tPUNCT\t\t,\n", "en\t\tADP\t\ten\n", "el\t\tDET\t\tel\n", "mercado\t\tNOUN\t\tmercado\n", "mayorista\t\tADJ\t\tmayorista\n", "y\t\tCCONJ\t\ty\n", "los\t\tDET\t\tel\n", "datos\t\tNOUN\t\tdato\n", "del\t\tADP\t\tdel\n", "Banco\t\tPROPN\t\tBanco\n", "Central\t\tPROPN\t\tCentral\n", ".\t\tPUNCT\t\t.\n" ] } ], "source": [ "# El objeto DOC\n", "doc01=nlp02(texto01)\n", "print(\"{}\\t\\t{}\".format(\"texto\", \"categoria\"))\n", "for token in doc01:\n", " print(\"{}\\t\\t{}\\t\\t{}\".format(token.text, token.pos_,token.lemma_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nota: nlp02 es una instancia de \"es_core_news_sm\". Si no lo usamos, y en su lugar utilizamos nlp01 (instancia de stacy.spanish()), veremos que no se obtienen estos resultados" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3. Dependencias Sintacticas (.dep_)\n", "Un analisis sintactico permite obtener sujeto y predicado, udentificando básicamente el núcleo de la oración. Spacy permite identificar estas dependencias.\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "texto\t\tAnalisis Sintactico\tDependencia\n", "Ella\t\tnsubj\t\t\tcomió\n", "comió\t\tROOT\t\t\tcomió\n", "pizza\t\tobj\t\t\tcomió\n" ] } ], "source": [ "# Ejemplo de dependencias sintacticas de un texto:\n", "doc02=nlp02(texto02)\n", "# -> Sujeto: --> \"Ella\"\n", "# -> Predicado: --> \"comio pizza\"\n", "# -> Nucleo: --> \"comio\"\n", "# -> Objeto Directo: --> \"pizza\"\n", "print(\"{}\\t\\t{}\\t{}\".format(\"texto\", \"Analisis Sintactico\", \"Dependencia\"))\n", "for token in doc02:\n", " print(\"{}\\t\\t{}\\t\\t\\t{}\".format(token.text, token.dep_,token.head.text))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "token\tis_alpha\tis_digit\tis_punct\t\n", "Ella\tTrue\t\tFalse\t\tFalse\t\n", "comió\tTrue\t\tFalse\t\tFalse\t\n", "pizza\tTrue\t\tFalse\t\tFalse\t\n" ] } ], "source": [ "# Probaremos 1.2. (ver qué tipo de token es cada uno) con nlp02 aplicado al texto corto\n", "print(\"{}\\t{}\\t{}\\t{}\\t\".format(\"token\", \"is_alpha\", \"is_digit\", \"is_punct\"))\n", "print(\"{}\\t{}\\t\\t{}\\t\\t{}\\t\".format(doc02[0], doc02[0].is_alpha, doc02[0].is_digit, doc02[0].is_punct))\n", "print(\"{}\\t{}\\t\\t{}\\t\\t{}\\t\".format(doc02[1], doc02[1].is_alpha, doc02[1].is_digit, doc02[1].is_punct))\n", "print(\"{}\\t{}\\t\\t{}\\t\\t{}\\t\".format(doc02[2], doc02[2].is_alpha, doc02[2].is_digit, doc02[2].is_punct))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4. Nombres Propios o Entidades Nominales\n", "Spacy puede identificar nombres propios y decir que tipo de nombre propio es: locacion, organizacion, persona, entre otros. Se los llama \"entidades nominales\" (doc.ent).\n", "El nombre: ent.text\n", "El tipo: ent.label_" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nombre\tTipo\n", "Argentina\tLOC\n", "Banco Nación\tORG\n", "Banco Central\tORG\n" ] } ], "source": [ "# Para este ejemplo utilizaremos el ejemplo de Texto01 que tiene varios nombres propios.\n", "doc03=nlp02(texto01)\n", "print(\"{}\\t{}\".format(\"Nombre\", \"Tipo\"))\n", "for ent in doc03.ents:\n", " print(\"{}\\t{}\".format(ent.text, ent.label_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observacion: Banco Nación fue detectado como una LOC y no como una ORG. Es por la palabra Nación. No sucede lo mismo para Banco Provincia, ni para Banco Municipal, ni Banco Provincial, ni Banco Nacional." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1. Explicacion de algunas Categorias de .pos_:\n", "Tipo\tExplicacion\n", "PRON\tpronoun\n", "AUX\tauxiliary\n", "SCONJ\tsubordinating conjunction\n", "VERB\tverb\n", "ADP\tadposition\n", "ADV\tadverb\n", "2. Explicacion de algunas Categorias de .dep_:\n", "nsubj\tnominal subject\n", "ROOT\troot\n", "noun\tobject\n", "3. Explicacion de algunas Categorias de .ent.label_:\n", "LOC: Non-GPE locations, mountain ranges, bodies of water\n", "ORG: Companies, agencies, institutions, etc.\n" ] } ], "source": [ "# Explicacion:\n", "print(\"1. Explicacion de algunas Categorias de .pos_:\")\n", "print(\"{}\\t{}\".format(\"Tipo\", \"Explicacion\"))\n", "print(\"{}\\t{}\".format(\"PRON\", spacy.explain(\"PRON\")))\n", "print(\"{}\\t{}\".format(\"AUX\", spacy.explain(\"AUX\")))\n", "print(\"{}\\t{}\".format(\"SCONJ\", spacy.explain(\"SCONJ\")))\n", "print(\"{}\\t{}\".format(\"VERB\", spacy.explain(\"VERB\")))\n", "print(\"{}\\t{}\".format(\"ADP\", spacy.explain(\"ADP\")))\n", "print(\"{}\\t{}\".format(\"ADV\", spacy.explain(\"ADV\")))\n", "print(\"2. Explicacion de algunas Categorias de .dep_:\")\n", "print(\"{}\\t{}\".format(\"nsubj\", spacy.explain(\"nsubj\")))\n", "print(\"{}\\t{}\".format(\"ROOT\", spacy.explain(\"ROOT\")))\n", "print(\"{}\\t{}\".format(\"noun\", spacy.explain(\"obj\")))\n", "print(\"3. Explicacion de algunas Categorias de .ent.label_:\")\n", "print(\"LOC: \",spacy.explain(\"LOC\"))\n", "print(\"ORG: \",spacy.explain(\"ORG\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Patrones basados en reglas\n", "Reglas para encontrar palabras y claves en el texto. A diferencia de las ReGex, Spacy trabaja sobre objetos DOC y tokens en lugar del uso de strings en las ReGex.\n", "Es mas flexible ya que no solo podemos buscar textos sino también otros atributos léxicos. Por ejemplo: encontrar \"araña\" solo si es VERBO y no si es sustantivo. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.1. Patrones de coincidencias (Matcher)\n", "Un patrón agregado al Matcher consiste en una lista de diccionarios. Cada diccionario describe un token y sus atributos." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "res= [(3060392928464876464, 7, 9)]\n", "Resultado: Adidas ZX\n" ] } ], "source": [ "from spacy.matcher import Matcher\n", "# nlp=spacy.load(\"es_core_news_sm\")# --> ya lo tenemos: nlp02\n", "texto03=\"Nuevos diseños de zapatillas en la colección Adidas ZX\"\n", "# Iniciar el matcher\n", "matcher01=Matcher(nlp02.vocab)\n", "# Agregar un patron al matcher\n", "patron01=[{\"TEXT\":\"Adidas\"},{\"TEXT\":\"ZX\"}]\n", "# matcher.add(\"Identificador del patron que fue buscado, callback (opcional),el patron\")\n", "matcher01.add(\"PATRON_ADIDAS\",[patron01])\n", "# procesar el texto\n", "doc04=nlp02(texto03)\n", "# aplicar el matcher01 sobre el doc\n", "res=matcher01(doc04)\n", "# que es res?:\n", "print(\"res=\",res)\n", "# ID_Patron: res[i][0]\n", "# INICIO de la coincidencia: res[i][1]\n", "# FIN de la coincidencia: res[i][2]\n", "spanRes=doc04[res[0][1]:res[0][2]]\n", "print(\"Resultado: \",spanRes)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "¿ Quien ganó la copa Mundial FIFA 2022 ? ... Argentina\n", "¿ Quien salió segundo ? ... Francia\n" ] } ], "source": [ "# Otro ejemplo:\n", "texto04=\"Argentina ganó la copa Mundial FIFA 2022. Francia salió segundo.\"\n", "doc04=nlp02(texto04)\n", "patron01=[{\"POS\":\"PROPN\",\"OP\":\"?\"},{\"LEMMA\":\"ganar\",\"POS\":\"VERB\"},{\"LOWER\":\"la\"},{\"LOWER\":\"copa\"},{\"LOWER\":\"mundial\"},{\"LOWER\":\"fifa\"},{\"IS_DIGIT\":True},{\"IS_PUNCT\":True}]\n", "patron02=[{\"POS\":\"PROPN\",\"OP\":\"?\"},{\"LEMMA\":\"salir\",\"POS\":\"VERB\"},{\"LOWER\":\"segundo\"},{\"IS_PUNCT\":True}]\n", "matcher04a=Matcher(nlp02.vocab)\n", "matcher04a.add(\"FIFA_PATRON\",[patron01])\n", "matcher04b=Matcher(nlp02.vocab)\n", "matcher04b.add(\"FIFA_PATRON\",[patron02])\n", "res1=matcher04a(doc04)\n", "res2=matcher04b(doc04)\n", "# Preguntas:\n", "print(\"¿ Quien \",doc04[res1[0][1]+1:res1[0][2]-1],\"? ... \",doc04[res1[0][1]])\n", "print(\"¿ Quien \",doc04[res2[0][1]+1:res2[0][2]-1],\"? ... \",doc04[res2[0][1]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Estructuras de datos en Spacy\n", "### 4.1. Vocab\n", "#### Hashes en Vocab\n", "Para reducir memoria, en Vocab se almacenan los hashes de los strings y no los strings. Es por eso que se tienen que guardar la referencia bidireccional." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hash del string: 4509832299175612861\n", "el string hasheado es: Cerveza\n" ] } ], "source": [ "# string a hash\n", "unString=\"Cerveza\"\n", "hashString=nlp02.vocab.strings[unString]\n", "print(\"hash del string:\",hashString)\n", "# string del hash\n", "elString=nlp02.vocab.strings[hashString]\n", "print(\"el string hasheado es:\",elString)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Lexemas\n", "Un objeto Lexeme es una entrada en el vocabulario. Contiene información del contexto (a diferencia de pos_ y dep_)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cerveza 4509832299175612861 True\n" ] } ], "source": [ "# El lexema del sring Cerveza\n", "lexema=nlp02.vocab[unString]\n", "# imprimir los atributos del lexema\n", "print(lexema.text,lexema.orth,lexema.is_alpha)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2. DOC\n", "Un DOC puede armarse a partir de una lista." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "¿Qué hora es?\n" ] } ], "source": [ "# importamos la clase Doc\n", "from spacy.tokens import Doc\n", "# Las palabras de la frase y los espacios\n", "wordsDeUnaFrase=[\"¿\",\"Qué\",\"hora\",\"es\",\"?\"]\n", "espacios=[False,True,True,False,False]\n", "# Creación del Doc\n", "elDoc=Doc(nlp02.vocab,words=wordsDeUnaFrase,spaces=espacios)\n", "print(elDoc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.3. Span\n", "Un Span es un slice (una porcion) de un doc, con al menos un tocken. Un Span también puede armarse manualmente. Argumentos: doc, inicio, fin." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Qué hora\n" ] } ], "source": [ "# importamos la clase Span\n", "from spacy.tokens import Span\n", "# armamos el span\n", "elSpan=Span(elDoc,1,3)\n", "print(elSpan)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Similitud semantica\n", "Spacy provee herramientas para calcular la similitud semantica entre docs, span y tokens.
\n", "Importante: para usar medidas de similitud semantica no alcanza con los modelos \"_sm\" sino que se requiere al menos uno mediano \"_md\" o bien el grande \"_lg\"." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# Descargamos el modelo mediano\n", "#!python -m spacy download es_core_news_md\n", "# Inicialización: el pipeline\n", "nlp03 = spacy.load(\"es_core_news_md\")" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Texto 5 ¡Qué rica cerveza!\n", "Texto 6 ¡Qué buen vino!\n", "Similaridad(Texto 5, Texto 6)= 0.9236639738082886\n", "\n", "Texto 7 Mi celular tiene poca batería\n", "Similaridad(Texto 5, Texto 7)= 0.21833710372447968\n" ] } ], "source": [ "# recordando el string anterior:\n", "texto05=\"¡Qué rica cerveza!\"\n", "doc05=nlp03(texto05)\n", "print(\"Texto 5\",doc05)\n", "# declaro otro string\n", "\n", "texto06=\"¡Qué buen vino!\"\n", "doc06=nlp03(texto06)\n", "print(\"Texto 6\",doc06)\n", "# Imprimir la similaridad entre las frases\n", "print(\"Similaridad(Texto 5, Texto 6)=\",doc05.similarity(doc06))\n", "print(\"\")\n", "texto07=\"Mi celular tiene poca batería\"\n", "doc07=nlp03(texto07)\n", "print(\"Texto 7\",doc07)\n", "# Imprimir la similaridad entre las frases\n", "print(\"Similaridad(Texto 5, Texto 7)=\",doc05.similarity(doc07))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pueden compararse objetos algo diferentes" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "unaParte= rica cerveza\n", "Similaridad(Texto 5, unSpan)= 0.4644829034805298\n" ] } ], "source": [ "# Obtenemos un Span de doc02\n", "unSpan=Span(doc05,2,4)\n", "print(\"unaParte =\",unSpan)\n", "# Comparamos Frase_1 con unSpan\n", "print(\"Similaridad(Texto 5, unSpan)=\",doc05.similarity(unSpan))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Combinando modelos estadisticos y reglas\n", "Usar patrones basados en modelos estadísticos (Matchers) es de utilidad al querer generalizar basado en pocos ejemplos (ver si un span es una ORG). En cambio, el uso de reglas permite trabajar sobre conjuntos finitos (por ejemplo: listado de Paises)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.1. Patrones basados en reglas\n", "Phrase Matcher permite hallar las coincidencias en una frase pero accediendo a los tokens. Es decir, podremos armar un patron a partir de los atributos de cada token.\n", "Pueden hallarse listas de palabras en grandes volúmenes de texto más rápida y eficientemente que el Matcher()." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Res= []\n", "span resultado: Adidas ZX\n" ] } ], "source": [ "# Ejemplo: Dentro de una frase, hallar una subfrase con phraseMatcher()\n", "#\n", "# Importar la clase\n", "from spacy.matcher import PhraseMatcher\n", "# una frase\n", "texto07=\"Yo tengo un perro labrador dorado de 25 hg.\"\n", "doc07=nlp03(texto07)\n", "subFrase=\"labrador dorado\"\n", "docSubFrase=nlp03(subFrase)\n", "# Creo el matcher\n", "matcher=PhraseMatcher(nlp03.vocab)\n", "# El patron\n", "patron=nlp03(docSubFrase) # OBS: no hay que armar dict, solo pasar la subFrase\n", "matcher.add(\"PERRO\",[patron])\n", "# Resultado\n", "res=matcher(doc03)\n", "print(\"Res=\",res)\n", "# Mostrar Span resultado\n", "for match_id,start,end in res:\n", " spanRes=doc03[start:end]\n", "print(\"span resultado:\",spanRes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Personalizacion del Pipeline\n", "El pipeline utiliza informacion de los modelos entrenados de spacy. Sin embargo, puede agregarse mas informacion a los modelos, utiles por ejemplo para: agregar entidades.\n", "### 7.1. Componentes del pipeline\n", "* Tagger: Crea etiquetas POS entre otras\n", "* Parser: Crea etiquetas de dependencias head, dep, sent, etc\n", "* NER: Crea etiquetas nominales ent\n", "* Textcat: Clasifica al texto" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Componentes del pipeline nlp03: ['tok2vec', 'morphologizer', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']\n", "Detalle del pipeline nlp03: [('tok2vec', ), ('morphologizer', ), ('parser', ), ('attribute_ruler', ), ('lemmatizer', ), ('ner', )]\n" ] } ], "source": [ "# Nombres de componentes del pipeline:\n", "print(\"Componentes del pipeline nlp03:\",nlp03.pipe_names)\n", "print(\"Detalle del pipeline nlp03:\",nlp03.pipeline)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7.2. Componentes personalizados del Pipeline\n", "Pueden agregarse funcionalidades al Pipeline. Por ejemplo, modificar un texto. Puede agregarse metadatos personalizados a los documentos. Pueden actualizarse atributos como los doc.ents" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pipeline: ['contarDoc', 'tok2vec', 'morphologizer', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']\n" ] } ], "source": [ "# Ejemplo: Se creará una componente que imprime en pantalla la cant. de palabras del doc\n", "#\n", "# Un pipeline:\n", "nlp04=spacy.load(\"es_core_news_sm\")\n", "# La funcion:\n", "@nlp04.component(\"contarDoc\")\n", "def contarTokens (doc):\n", " # Imprime la long. del Doc:\n", " print(\"Longitud del documento: \",len(doc))\n", " return doc\n", "# Añade la componente al inicio del pipeline\n", "nlp04.add_pipe(\"contarDoc\",first=True)\n", "# Imprime los nombres de las componentes\n", "print(\"Pipeline:\",nlp04.pipe_names)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Longitud del documento: 12\n", "Texto: Argentina ganó la copa Mundial FIFA 2022. Francia salió segundo.\n" ] } ], "source": [ "# Procesamos un texto:\n", "doc05=nlp04(texto04)\n", "print(\"Texto:\",doc05)" ] } ], "metadata": { "colab": { "name": "Named entity recognition using spacy model.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 1 }