{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "cFloNx163DCr" }, "source": [ "##### Copyright 2020 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "iSdwTGPc3Hpj" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "BE2AKncl3QJZ" }, "source": [ "# Visualizing Data using the Embedding Projector in TensorBoard\n", "\n", "\n", " \n", " \n", " \n", " \n", "
\n", " View on TensorFlow.org\n", " \n", " Run in Google Colab\n", " \n", " View source on GitHub\n", " \n", " Download notebook\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "v4s3Sf2I3mJr" }, "source": [ "## Overview\n", "\n", "Using the **TensorBoard Embedding Projector**, you can graphically represent high dimensional embeddings. This can be helpful in visualizing, examining, and understanding your embedding layers.\n", "\n", "In this tutorial, you will learn how visualize this type of trained layer." ] }, { "cell_type": "markdown", "metadata": { "id": "6-0rhuaW9f2-" }, "source": [ "## Setup\n", "\n", "For this tutorial, we will be using TensorBoard to visualize an embedding layer generated for classifying movie review data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TjRkD3r3etuL" }, "outputs": [], "source": [ "try:\n", " # %tensorflow_version only exists in Colab.\n", " %tensorflow_version 2.x\n", "except Exception:\n", " pass\n", "\n", "%load_ext tensorboard" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mh22cCoM8t7e" }, "outputs": [], "source": [ "import os\n", "import tensorflow as tf\n", "import tensorflow_datasets as tfds\n", "from tensorboard.plugins import projector\n" ] }, { "cell_type": "markdown", "metadata": { "id": "xlp6ZASQB5go" }, "source": [ "## IMDB Data \n", "\n", "We will be using a dataset of 25,000 IMDB movie reviews, each of which has a sentiment label (positive/negative). Each review is preprocessed and encoded as a sequence of word indices (integers). For simplicity, words are indexed by overall frequency in the dataset, for instance the integer \"3\" encodes the 3rd most frequent word appearing in all reviews. This allows for quick filtering operations such as: \"only consider the top 10,000 most common words, but eliminate the top 20 most common words\".\n", "\n", "As a convention, \"0\" does not stand for any specific word, but instead is used to encode any unknown word. Later in the tutorial, we will remove the row for \"0\" in the visualization.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "s0Yiw05gIgqS" }, "outputs": [], "source": [ "(train_data, test_data), info = tfds.load(\n", " \"imdb_reviews/subwords8k\",\n", " split=(tfds.Split.TRAIN, tfds.Split.TEST),\n", " with_info=True,\n", " as_supervised=True,\n", ")\n", "encoder = info.features[\"text\"].encoder\n", "\n", "# Shuffle and pad the data.\n", "train_batches = train_data.shuffle(1000).padded_batch(\n", " 10, padded_shapes=((None,), ())\n", ")\n", "test_batches = test_data.shuffle(1000).padded_batch(\n", " 10, padded_shapes=((None,), ())\n", ")\n", "train_batch, train_labels = next(iter(train_batches))\n" ] }, { "cell_type": "markdown", "metadata": { "id": "RpvPVCwO7bDj" }, "source": [ "# Keras Embedding Layer\n", "\n", "A [Keras Embedding Layer](https://keras.io/layers/embeddings/) can be used to train an embedding for each word in your vocabulary. Each word (or sub-word in this case) will be associated with a 16-dimensional vector (or embedding) that will be trained by the model.\n", "\n", "See [this tutorial](https://www.tensorflow.org/tutorials/text/word_embeddings?hl=en) to learn more about word embeddings." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "Fgoq5haqw8Z5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2500/2500 [==============================] - 13s 5ms/step - loss: 0.5330 - accuracy: 0.6769 - val_loss: 0.4043 - val_accuracy: 0.7800\n" ] } ], "source": [ "# Create an embedding layer.\n", "embedding_dim = 16\n", "embedding = tf.keras.layers.Embedding(encoder.vocab_size, embedding_dim)\n", "# Configure the embedding layer as part of a keras model.\n", "model = tf.keras.Sequential(\n", " [\n", " embedding, # The embedding layer should be the first layer in a model.\n", " tf.keras.layers.GlobalAveragePooling1D(),\n", " tf.keras.layers.Dense(16, activation=\"relu\"),\n", " tf.keras.layers.Dense(1),\n", " ]\n", ")\n", "\n", "# Compile model.\n", "model.compile(\n", " optimizer=\"adam\",\n", " loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n", " metrics=[\"accuracy\"],\n", ")\n", "\n", "# Train model for one epoch.\n", "history = model.fit(\n", " train_batches, epochs=1, validation_data=test_batches, validation_steps=20\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "s9HmC29hdMnH" }, "source": [ "## Saving Data for TensorBoard\n", "\n", "TensorBoard reads tensors and metadata from the logs of your tensorflow projects. The path to the log directory is specified with `log_dir` below. For this tutorial, we will be using `/logs/imdb-example/`.\n", "\n", "In order to load the data into Tensorboard, we need to save a training checkpoint to that directory, along with metadata that allows for visualization of a specific layer of interest in the model. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Pi8_SCYRdn9x" }, "outputs": [], "source": [ "# Set up a logs directory, so Tensorboard knows where to look for files.\n", "log_dir='/logs/imdb-example/'\n", "if not os.path.exists(log_dir):\n", " os.makedirs(log_dir)\n", "\n", "# Save Labels separately on a line-by-line manner.\n", "with open(os.path.join(log_dir, 'metadata.tsv'), \"w\") as f:\n", " for subwords in encoder.subwords:\n", " f.write(\"{}\\n\".format(subwords))\n", " # Fill in the rest of the labels with \"unknown\".\n", " for unknown in range(1, encoder.vocab_size - len(encoder.subwords)):\n", " f.write(\"unknown #{}\\n\".format(unknown))\n", "\n", "\n", "# Save the weights we want to analyze as a variable. Note that the first\n", "# value represents any unknown word, which is not in the metadata, here\n", "# we will remove this value.\n", "weights = tf.Variable(model.layers[0].get_weights()[0][1:])\n", "# Create a checkpoint from embedding, the filename and key are the\n", "# name of the tensor.\n", "checkpoint = tf.train.Checkpoint(embedding=weights)\n", "checkpoint.save(os.path.join(log_dir, \"embedding.ckpt\"))\n", "\n", "# Set up config.\n", "config = projector.ProjectorConfig()\n", "embedding = config.embeddings.add()\n", "# The name of the tensor will be suffixed by `/.ATTRIBUTES/VARIABLE_VALUE`.\n", "embedding.tensor_name = \"embedding/.ATTRIBUTES/VARIABLE_VALUE\"\n", "embedding.metadata_path = 'metadata.tsv'\n", "projector.visualize_embeddings(log_dir, config)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PtL_KzYMBIzP" }, "outputs": [], "source": [ "# Now run tensorboard against on log data we just saved.\n", "%tensorboard --logdir /logs/imdb-example/" ] }, { "cell_type": "markdown", "metadata": { "id": "YtzW8mr_wmbD" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "MG4hcUzQQoWA" }, "source": [ "## Analysis\n", "The TensorBoard Projector is a great tool for interpreting and visualzing embedding. The dashboard allows users to search for specific terms, and highlights words that are adjacent to each other in the embedding (low-dimensional) space. From this example we can see that Wes **Anderson** and Alfred **Hitchcock** are both rather neutral terms, but that they are referenced in different contexts.\n", "\n", "\n", "\n", "In this space, Hitchcock is closer to words like `nightmare`, which is likely due to the fact that he is known as the \"Master of Suspense\", whereas Anderson is closer to the word `heart`, which is consistent with his relentlessly detailed and heartwarming style.\n", "\n", "" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "tensorboard_projector_plugin.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }