{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "g_nWetWWd_ns"
      },
      "source": ["##### Copyright 2023 The TensorFlow Datasets Authors."]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "2pHVBk_seED1"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "M7vSdG6sAIQn"
      },
      "source": ["# TFDS for Jax and PyTorch"]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fwc5GKHBASdc"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://www.tensorflow.org/datasets/tfless_tfds\"><img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" />View on TensorFlow.org</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/data_source.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://github.com/tensorflow/datasets/blob/master/docs/data_source.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://storage.googleapis.com/tensorflow_docs/datasets/docs/data_source.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\" />Download notebook</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9ee074e4"
      },
      "source": [
        "TFDS has always been framework-agnostic. For instance, you can easily load\n",
        "datasets in\n",
        "[NumPy format](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_numpy)\n",
        "for usage in Jax and PyTorch.\n",
        "\n",
        "TensorFlow and its data loading solution\n",
        "([`tf.data`](https://www.tensorflow.org/guide/data)) are first-class citizens in\n",
        "our API by design.\n",
        "\n",
        "We extended TFDS to support TensorFlow-less NumPy-only data loading. This can\n",
        "be convenient for usage in ML frameworks such as Jax and PyTorch. Indeed,\n",
        "for the latter users, TensorFlow can:\n",
        "\n",
        "- reserve GPU/TPU memory;\n",
        "- increase build time in CI/CD;\n",
        "- take time to import at runtime.\n",
        "\n",
        "TensorFlow is no longer a dependency to read datasets.\n",
        "\n",
        "ML pipelines need a data loader to load examples, decode them, and present\n",
        "them to the model. Data loaders use the\n",
        "\"source/sampler/loader\" paradigm:\n",
        "\n",
        "```\n",
        " TFDS dataset       ┌────────────────┐\n",
        "   on disk          │                │\n",
        "        ┌──────────►│      Data      │\n",
        "|..|... │     |     │     source     ├─┐\n",
        "├──┼────┴─────┤     │                │ │\n",
        "│12│image12   │     └────────────────┘ │    ┌────────────────┐\n",
        "├──┼──────────┤                        │    │                │\n",
        "│13│image13   │                        ├───►│      Data      ├───► ML pipeline\n",
        "├──┼──────────┤                        │    │     loader     │\n",
        "│14│image14   │     ┌────────────────┐ │    │                │\n",
        "├──┼──────────┤     │                │ │    └────────────────┘\n",
        "|..|...       |     │     Index      ├─┘\n",
        "                    │    sampler     │\n",
        "                    │                │\n",
        "                    └────────────────┘\n",
        "```\n",
        "\n",
        "- The data source is responsible for accessing and decoding examples from a TFDS\n",
        "dataset on the fly.\n",
        "- The index sampler is responsible for determining the order in which records\n",
        "are processed. This is important to implement global transformations (e.g.,\n",
        "global shuffling, sharding, repeating for multiple epochs) before reading any\n",
        "records.\n",
        "- The data loader orchestrates the loading by leveraging the data source and the\n",
        "index sampler. It allows performance optimization (e.g., pre-fetching,\n",
        "multiprocessing or multithreading).\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UaWdLA3fQDK2"
      },
      "source": [
        "## TL;DR\n",
        "\n",
        "`tfds.data_source` is an API to create data sources:\n",
        "\n",
        "1. for fast prototyping in pure-Python pipelines;\n",
        "2. to manage data-intensive ML pipelines at scale."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aLho3l_Vd0a5"
      },
      "source": [
        "## Setup\n",
        "\n",
        "Let's install and import the needed dependencies:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "c4COEsqIdvYH"
      },
      "outputs": [],
      "source": [
        "!pip install array_record\n",
        "!pip install grain-nightly\n",
        "!pip install jax jaxlib\n",
        "!pip install tfds-nightly\n",
        "\n",
        "import os\n",
        "os.environ.pop('TFDS_DATA_DIR', None)\n",
        "\n",
        "import tensorflow_datasets as tfds"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CjEJeF1Id_JM"
      },
      "source": [
        "## Data sources\n",
        "\n",
        "Data sources are basically Python sequences. So they need to implement the\n",
        "following protocol:\n",
        "\n",
        "```python\n",
        "from typing import SupportsIndex\n",
        "\n",
        "class RandomAccessDataSource(Protocol):\n",
        "  \"\"\"Interface for datasources where storage supports efficient random access.\"\"\"\n",
        "\n",
        "  def __len__(self) -> int:\n",
        "    \"\"\"Number of records in the dataset.\"\"\"\n",
        "\n",
        "  def __getitem__(self, key: SupportsIndex) -> Any:\n",
        "    \"\"\"Retrieves the record for the given key.\"\"\"\n",
        "```\n",
        "\n",
        "The underlying file format needs to support efficient random access. At the\n",
        "moment, TFDS relies on [`array_record`](https://github.com/google/array_record).\n",
        "\n",
        "[`array_record`](https://github.com/google/array_record) is a new file format\n",
        "derived from [Riegeli](https://github.com/google/riegeli), achieving a new\n",
        "frontier of IO efficiency. In particular, ArrayRecord supports parallel read,\n",
        "write, and random access by record index. ArrayRecord builds on top of Riegeli\n",
        "and supports the same compression algorithms.\n",
        "\n",
        "[`fashion_mnist`](https://www.tensorflow.org/datasets/catalog/fashion_mnist) is\n",
        "a common dataset for computer vision. To retrieve an ArrayRecord-based data\n",
        "source with TFDS, simply use:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9Tslzx0_eEWx"
      },
      "outputs": [],
      "source": ["ds = tfds.data_source('fashion_mnist')"]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AlaRrD_SeHLY"
      },
      "source": [
        "`tfds.data_source` is a convenient wrapper. It is equivalent to:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "duHDKzXReIKB"
      },
      "outputs": [],
      "source": [
        "builder = tfds.builder('fashion_mnist', file_format='array_record')\n",
        "builder.download_and_prepare()\n",
        "ds = builder.as_data_source()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rlyIsd0ueKjQ"
      },
      "source": [
        "This outputs a dictionary of data sources:\n",
        "\n",
        "```\n",
        "{\n",
        "  'train': DataSource(name=fashion_mnist, split='train', decoders=None),\n",
        "  'test': DataSource(name=fashion_mnist, split='test', decoders=None),\n",
        "}\n",
        "```\n",
        "\n",
        "Once `download_and_prepare` has run, and you generated the record files, we\n",
        "don't need TensorFlow anymore. Everything will happen in Python/NumPy!\n",
        "\n",
        "Let's check this by uninstalling TensorFlow and re-loading the data source\n",
        "in another subprocess:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "mTfSzvaQkSd9"
      },
      "outputs": [],
      "source": ["!pip uninstall -y tensorflow"]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3sT5AN7neNT9"
      },
      "outputs": [],
      "source": [
        "%%writefile no_tensorflow.py\n",
        "import os\n",
        "os.environ.pop('TFDS_DATA_DIR', None)\n",
        "\n",
        "import tensorflow_datasets as tfds\n",
        "\n",
        "try:\n",
        "  import tensorflow as tf\n",
        "except ImportError:\n",
        "  print('No TensorFlow found...')\n",
        "\n",
        "ds = tfds.data_source('fashion_mnist')\n",
        "print('...but the data source could still be loaded...')\n",
        "ds['train'][0]\n",
        "print('...and the records can be decoded.')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "FxohFdb3kSxh"
      },
      "outputs": [],
      "source": ["!python no_tensorflow.py"]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1o8n-BhhePYY"
      },
      "source": [
        "In future versions, we are also going to make the dataset preparation\n",
        "TensorFlow-free.\n",
        "\n",
        "A data source has a length:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qtfl17SQeQ7F"
      },
      "outputs": [],
      "source": ["len(ds['train'])"]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "a-UFBu8leSMp"
      },
      "source": ["Accessing the first element of the dataset:"]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "tFvT2Sx2eToh"
      },
      "outputs": [],
      "source": ["%%timeit\n", "ds['train'][0]"]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VTgZskyZeU_D"
      },
      "source": [
        "...is just as cheap as accessing any other element. This is the definition of\n",
        "[random access](https://en.wikipedia.org/wiki/Random_access):"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "cPJFa6aIeWcY"
      },
      "outputs": [],
      "source": ["%%timeit\n", "ds['train'][1000]"]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fs3kafYheX6N"
      },
      "source": [
        "Features now use NumPy DTypes (rather than TensorFlow DTypes). You can inspect\n",
        "the features with:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "q7x5AEEaeZja"
      },
      "outputs": [],
      "source": ["features = tfds.builder('fashion_mnist').info.features"]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VOnLqAZOeiBi"
      },
      "source": [
        "You'll find more information about\n",
        "[the features in our documentation](https://www.tensorflow.org/datasets/api_docs/python/tfds/features).\n",
        "Here we can notably retrieve the shape of the images, and the number of classes:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Xk8Vc-y0edlb"
      },
      "outputs": [],
      "source": [
        "shape = features['image'].shape\n",
        "num_classes = features['label'].num_classes"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eFh8pytVemsu"
      },
      "source": [
        "## Use in pure Python\n",
        "\n",
        "You can consume data sources in Python by iterating over them:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ULjO-JDVefNf"
      },
      "outputs": [],
      "source": [
        "for example in ds['train']:\n",
        "  print(example)\n",
        "  break"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gZRHZNOkenPb"
      },
      "source": [
        "If you inspect elements, you will also notice that all features are already\n",
        "decoded using NumPy. Behind the scenes, we use [OpenCV](https://opencv.org)\n",
        "by default because it is fast. If you don't have OpenCV installed, we default\n",
        "to [Pillow](python-pillow.org) to provide lightweight and fast image\n",
        "decoding.\n",
        "\n",
        "```\n",
        "{\n",
        "  'image': array([[[0], [0], ..., [0]],\n",
        "                  [[0], [0], ..., [0]]], dtype=uint8),\n",
        "  'label': 2,\n",
        "}\n",
        "```\n",
        "\n",
        "**Note**: Currently, the feature is only available for `Tensor`, `Image` and\n",
        "`Scalar` features. The `Audio` and `Video` features will come soon. Stay tuned!"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8kLyK5j1enhc"
      },
      "source": [
        "## Use with PyTorch\n",
        "\n",
        "PyTorch uses the source/sampler/loader paradigm. In Torch, \"data sources\" are\n",
        "called \"datasets\".\n",
        "[`torch.utils.data`](https://pytorch.org/docs/stable/data.html) contains all the\n",
        "details you need to know to build efficient input pipelines in Torch.\n",
        "\n",
        "TFDS data sources can be used as regular\n",
        "[map-style datasets](https://pytorch.org/docs/stable/data.html#map-style-datasets).\n",
        "\n",
        "First we install and import Torch:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3aKol1fDeyoK"
      },
      "outputs": [],
      "source": [
        "!pip install torch\n",
        "\n",
        "from tqdm import tqdm\n",
        "import torch"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HKdJvYywe0YC"
      },
      "source": [
        "We already defined data sources for training and testing (respectively,\n",
        "`ds['train']` and `ds['test']`). We can now define the sampler and the loaders:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_4P2JIrie23f"
      },
      "outputs": [],
      "source": [
        "batch_size = 128\n",
        "train_sampler = torch.utils.data.RandomSampler(ds['train'], num_samples=5_000)\n",
        "train_loader = torch.utils.data.DataLoader(\n",
        "    ds['train'],\n",
        "    sampler=train_sampler,\n",
        "    batch_size=batch_size,\n",
        ")\n",
        "test_loader = torch.utils.data.DataLoader(\n",
        "    ds['test'],\n",
        "    sampler=None,\n",
        "    batch_size=batch_size,\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EVhofOm4e53O"
      },
      "source": [
        "Using PyTorch, we train and evaluate a simple logistic regression on the first\n",
        "examples:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "HcAmvMa-e42p"
      },
      "outputs": [],
      "source": [
        "class LinearClassifier(torch.nn.Module):\n",
        "  def __init__(self, shape, num_classes):\n",
        "    super(LinearClassifier, self).__init__()\n",
        "    height, width, channels = shape\n",
        "    self.classifier = torch.nn.Linear(height * width * channels, num_classes)\n",
        "\n",
        "  def forward(self, image):\n",
        "    image = image.view(image.size()[0], -1).to(torch.float32)\n",
        "    return self.classifier(image)\n",
        "\n",
        "\n",
        "model = LinearClassifier(shape, num_classes)\n",
        "optimizer = torch.optim.Adam(model.parameters())\n",
        "loss_function = torch.nn.CrossEntropyLoss()\n",
        "\n",
        "print('Training...')\n",
        "model.train()\n",
        "for example in tqdm(train_loader):\n",
        "  image, label = example['image'], example['label']\n",
        "  prediction = model(image)\n",
        "  loss = loss_function(prediction, label)\n",
        "  optimizer.zero_grad()\n",
        "  loss.backward()\n",
        "  optimizer.step()\n",
        "\n",
        "print('Testing...')\n",
        "model.eval()\n",
        "num_examples = 0\n",
        "true_positives = 0\n",
        "for example in tqdm(test_loader):\n",
        "  image, label = example['image'], example['label']\n",
        "  prediction = model(image)\n",
        "  num_examples += image.shape[0]\n",
        "  predicted_label = prediction.argmax(dim=1)\n",
        "  true_positives += (predicted_label == label).sum().item()\n",
        "print(f'\\nAccuracy: {true_positives/num_examples * 100:.2f}%')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ewKJQpwZe6Ik"
      },
      "source": [
        "## Use with JAX\n",
        "\n",
        "[Grain](https://github.com/google/grain) is a library for reading data for\n",
        "training and evaluating JAX models. It's open source, fast and deterministic.\n",
        "Grain uses the source/sampler/loader paradigm, so we can re-use\n",
        "`tfds.data_source`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wG31LBso4IMv"
      },
      "outputs": [],
      "source": [
        "import grain.python as pygrain\n",
        "import numpy as np\n",
        "\n",
        "data_source = tfds.data_source(\"fashion_mnist\", split=\"train\")\n",
        "\n",
        "# To shuffle the data, use a sampler:\n",
        "sampler = pygrain.IndexSampler(\n",
        "    num_records=5,\n",
        "    num_epochs=1,\n",
        "    shard_options=pygrain.NoSharding(),\n",
        "    shuffle=True,\n",
        "    seed=0,\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RGrYQmdF4IMv"
      },
      "source": [
        "Transformations are defined as classes and can be `BatchTransform`,\n",
        "`FilterTransform` or `MapTransform`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NG6oyzc34IMv"
      },
      "outputs": [],
      "source": [
        "class ImageToText(pygrain.MapTransform):\n",
        "  \"\"\"Maps an image to text.\"\"\"\n",
        "\n",
        "  LABEL_TO_TEXT = {\n",
        "      0: \"zero\",\n",
        "      1: \"one\",\n",
        "      2: \"two\",\n",
        "      3: \"three\",\n",
        "      4: \"four\",\n",
        "      5: \"five\",\n",
        "      6: \"six\",\n",
        "      7: \"seven\",\n",
        "      8: \"height\",\n",
        "      9: \"nine\",\n",
        "  }\n",
        "\n",
        "  def map(self, element: dict[str, np.ndarray]) -> dict[str, np.ndarray]:\n",
        "    label = element[\"label\"]\n",
        "    text = self.LABEL_TO_TEXT[label]\n",
        "    element[\"text\"] = text\n",
        "    return element\n",
        "\n",
        "# You can chain transformations in a list:\n",
        "operations = [ImageToText()]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "53U1d8Yj4IM9"
      },
      "source": [
        "Finally, the data loader takes care of orchestrating the loading. You can scale\n",
        "up with multiprocessing to enjoy both the flexibility of Python and the\n",
        "performance of a data loader:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "SQEP8Bhp4IM9"
      },
      "outputs": [],
      "source": [
        "loader = pygrain.DataLoader(\n",
        "    data_source=data_source,\n",
        "    operations=operations,\n",
        "    sampler=sampler,\n",
        "    worker_count=0,  # Scale to multiple workers in multiprocessing\n",
        ")\n",
        "\n",
        "for element in loader:\n",
        "  print(element[\"text\"])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JvLEtCWRvvy8"
      },
      "source": [
        "## Read more\n",
        "\n",
        "For more information, please refer to [`tfds.data_source`](https://www.tensorflow.org/datasets/api_docs/python/tfds/data_source) API doc."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}