# Use with Pandas

This document is a quick introduction to using `datasets` with Pandas, with a particular focus on how to process
datasets using Pandas functions, and how to convert a dataset to Pandas or from Pandas.

This is particularly useful as it allows fast operations, since `datasets` uses PyArrow under the hood and PyArrow is well integrated with Pandas.

## Dataset format

By default, datasets return regular Python objects: integers, floats, strings, lists, etc.

To get Pandas DataFrames or Series instead, you can set the format of the dataset to `pandas` using [`Dataset.with_format`]:

```py
>>> from datasets import Dataset
>>> data = {"col_0": ["a", "b", "c", "d"], "col_1": [0., 0., 1., 1.]}
>>> ds = Dataset.from_dict(data)
>>> ds = ds.with_format("pandas")
>>> ds[0]       # pd.DataFrame
  col_0  col_1
0     a    0.0
>>> ds[:2]      # pd.DataFrame
  col_0  col_1
0     a    0.0
1     b    0.0
>>> ds["data"]  # pd.Series
0    a
1    b
2    c
3    d
Name: col_0, dtype: object
```

This also works for `IterableDataset` objects obtained e.g. using `load_dataset(..., streaming=True)`:

```py
>>> ds = ds.with_format("pandas")
>>> for df in ds.iter(batch_size=2):
...     print(df)
...     break
  col_0  col_1
0     a    0.0
1     b    0.0
```

## Process data

Pandas functions are generally faster than regular hand-written python functions, and therefore they are a good option to optimize data processing. You can use Pandas functions to process a dataset in [`Dataset.map`] or [`Dataset.filter`]:

```python
>>> from datasets import Dataset
>>> data = {"col_0": ["a", "b", "c", "d"], "col_1": [0., 0., 1., 1.]}
>>> ds = Dataset.from_dict(data)
>>> ds = ds.with_format("pandas")
>>> ds = ds.map(lambda df: df.assign(col_2=df.col_1 + 1), batched=True)
>>> ds[:2]
  col_0  col_1  col_2
0     a    0.0    1.0
1     b    0.0    1.0
>>> ds = ds.filter(lambda df: df.col_0 == "b", batched=True)
>>> ds[0]
  col_0  col_1  col_2
0     b    0.0    1.0
```

We use `batched=True` because it is faster to process batches of data in Pandas rather than row by row. It's also possible to use `batch_size=` in `map()` to set the size of each `df`.

This also works for [`IterableDataset.map`] and [`IterableDataset.filter`].

## Import or Export from Pandas

To import data from Pandas, you can use [`Dataset.from_pandas`]:

```python
ds = Dataset.from_pandas(df)
```

And you can use [`Dataset.to_pandas`] to export a Dataset to a Pandas DataFrame:


```python
df = Dataset.to_pandas()
```