I'm curious whether anyone knows what the underlying mechanism is causing this...
My (small) environment has replication across multiple nodes in the cluster, so snapshotting isn't something that's seemed important for our use case. However, we're looking at migrating to ES 9.0.1 and in this case a snapshot seemed to good step before rolling out the upgrade.
So, this is the most simple of environments: one cluster, four nodes. Set up a repository in Google Cloud Storage, and registered it and ran a couple of indexes to make sure it was basically operational. Then I removed the index specification, so the snapshot included all indices, and all features and the global state.
Three of the nodes are hot nodes, and hot indexes are configured with one replica, the other node is the warm node - no replicas. The approximate storage space used on disk is 56GB for the hot data (so, duplicated data) and 108GB for the warm data.
So, here's the question: one snapshot, of what is in local storage 164GB turns out to be 600GB in the repository. This is backed up by the API call GET _snapshot/REPOSITORY/SNAPSHOTNAME/_status?pretty
(641 720 111 926 bytes), querying the bucket metadata, and the approximate transfer time/bandwidth to Cloud Storage.
That's with the "compression" option enabled for the repository.
Any ideas what's happening here? This is really just being used one-off, but I'd like to understand....