A slice of Buckets: Diffing + Syncing + Archiving

buckets • Jun 18, 2020

If you haven't heard of Buckets, I strongly recommend you take a quick look into our docs overview to get a sense of what they are. In short, you can store your personal or project files in a decentralized ecosystem of technologies built on top of IPFS and libp2p.

Storing your files on a decentralized network is an exciting topic for many reasons, which we won't cover here. But excitement alone isn't enough when it comes to developing new ways of solving problems, along the way, developers sometimes forget about the user experience (UX). At Textile, we care about creating a pleasant UX while making disruptive new tools, trying to avoid forcing users to make tradeoffs when using new technologies.

Even if Buckets store files on top of many novel protocols, it also has all of the same UX challenges that are important for easy adoption or to be a competitive alternative to existing workflows. In particular, as a file storage solution, it should ease the burden of syncing local work/changes with decentralized storage.

This article will explain how we implemented the first-version solution to make syncing your local bucket changes to your remote backend without demanding significant extra storage or transfer.

Merkle DAGs to the rescue

A Merkle DAG is a fascinating data structure at the core of IPFS, and which Buckets sits on top. This data-structure can be found again and again in many other systems, including Git, many databases, and blockchain implementations such as Bitcoin or Ethereum.

Let's take a quick look at a simplified version of a UnixFS Merkle DAG which represents a folder to get a sense of what we're talking about:

It seems pretty clear how the above DAG is representing a repository with the following tree output:

.
├── LICENSE
├── README.md
└── cmd
    ├── buckets.go
    └── main.go

1 directory, 4 files

Every node has value Qm... which is its identifier. This identifier is not an arbitrary value but a fingerprint of the complete sub-tree. If any node in the sub-tree changes, then its root identifier will change. This practice is known as content-addressing. We'll continue to refer to node identifiers as Cids.

The central point of this type of DAG is that edges are defined as named values of other nodes Cids, which we'll see soon how that's very convenient to understand what was changed in a directory quickly.

Also, each rounded box describes a file or folder, while at the leaves, the square boxes contain chunked data of files. This is an important fact that will be leveraged by Buckets diffing, so hold it in your mind for a bit!

Diffing in DAGs

Now that we know that Merkle DAGs are content-addressed let's see how changing a file changes node identifiers. Let's suppose we change the buckets.go file in our repository. Now the Merkle DAG of our current repository state is the following:

Here we can appreciate how the change has affected the Cids of buckets.go, but also cmd and the root node. The path of Cid change is marked in red. By comparing the above diagram with the original one, we can quickly see how easy it is to identify what changed in our repository.

By comparing the root nodes of this new and original DAG, we can say that something in the repository has changed. Moving to the children, we can keep narrowing down the path of change in the structure, quickly pruning paths that we know didn't change.

The fact that node Cids are a fingerprint of the complete subtree allows checking mutability by making a quick comparison. Saying it differently, we don't need to traverse whole subtrees if we can assert which Cids were unchanged. That's a fascinating insight, let that sink in!

The TL;DR of this story: if you have a Merkle DAG of the previous version of your repository, you can leverage it to understand what changed to your current repository state. The actual algorithm would be walking both DAGs and comparing each node links and Cids values.

Buckets diffing

When a bucket is pushed to the remote, its Merkle DAG representation is saved locally as a reference of the latest pushed version.

When you execute buck status, it compares the persisted Merkle DAG with a generated Merkle DAG of the Bucket local state. As we mentioned in the last section, walking both DAGs and comparing Cids can quickly provide paths that changed to the last known version.

While comparing both DAGs, we can understand different situations as follows:

If a node in the current DAG doesn't have a link with a name that existed in the previous DAG, we can understand that file was deleted.
If a node in the current DAG has a named link that wasn't present in the prior version of the DAG, we can understand that file is new.
If a node in the current DAG still has the same-named link as the previous DAG, but its Cid value has changed, we can understand that file was modified.
We can go further and agree that if a removed and new link has a different name but the same Cid, the file was renamed.

Having this list of changes allows us to show what changed and what actions should be done in the remote to sync to the new folder state.

For the implementation, we leveraged the dagutils, which already implements the generation of the set of changes from one DAG to another, just what we needed!

But wait, you may already notice that in the above diagrams, the tree leaves contain files data. So saving the Merkle DAG of the last pushed version, it's almost the same as having a copy of all the data!

There's an important optimization that we cared about in our implementation: stripping data leaves from the DAGs. Since we only care about the bucket tree structure at file-level granularity for our diffing concerns, we can avoid storing and comparing nodes of the DAG that are children of file type nodes.

This means that the overhead of saving the necessary information to allow diffing with the last pushed Bucket version is small. The DAG total size and amount of nodes is untangled with the total size of the Bucket data, and only related to the directory structure.

In a nutshell, when a bucket is pushed, the persisted Merkle DAG contains the minimum amount of information about the directory structure and data fingerprints.

Implementation tradeoffs

First and foremost, you're invited to see the actual implementation of above ideas in textile repo. In particular, the diffing logic can be found in the buckets/local folder.

The implementation should account for different tradeoff decisions to understand resource consumption. Currently, on each buck status, a new Merkle DAG should be created to allow the double-walk comparison with the last pushed version.

This isn't a negligible amount of work since generating the Cids involves applying cryptographic hashes to all the data of the Bucket, and holding children Cids values to calculate parents all the way up to the root. The nice part is that CLI is completely stateless and doesn't have any background daemon or filesystem event hooks. If you're not doing buck status, you're not paying any cost.

Regarding memory consumption, there are some design knobs that can be tuned, such as using in-memory or persisted data stores to balance the CPU and RAM tradeoffs while applying the DAGs walking algorithm.

Similarly, exploring different Merkle DAGs serialization codecs can affect the total persisted size of the last pushed snapshot on disk. This also might influence how fast it can be loaded if some sort of compression is applied, or speed of lookup execution. Tradeoffs!

Bucket syncing

Providing an easy and performant way of calculating local changes is just one part of the deal. The Buckets also have solutions for other synchronization problems; let's take a look at those!

The buck command already provides a good big picture of what's offering:

The Bucket Client.

Manages files and folders in an object storage bucket.

Usage:
  buck [command]

Available Commands:
  add         Add adds a UnixFS DAG locally.
  archive     Create a Filecoin archive
  cat         Cat bucket objects at path
  destroy     Destroy bucket and all objects
  help        Help about any command
  init        Initialize a new or existing bucket
  links       Show links to where this bucket can be accessed
  ls          List top-level or nested bucket objects
  pull        Pull bucket object changes
  push        Push bucket object changes
  root        Show local bucket root CID
  status      Show bucket object changes

Flags:
      --api string   API target (default "127.0.0.1:3006")
  -h, --help         help for buck

Use "buck [command] --help" for more information about a command.

In the last sections, we focused a bit on how buck status works in detail, but it's also interesting how this flow fits with buck push and buck pull.

buck push

The buck push commands run the diffing algorithm described previously and executes actions to let the remote be in sync with your local bucket.

Let's take a look at initing a bucket and pushing some content:

Super easy!

Now, say that after you check your local changes, you decide to push them to the remote end just as we did before. If this operation is done without proper caution, it can lead to losing data. The solution should check that the remote and local state are moving to the new state from the same data snapshot.

Not doing so might lose some data, since your local bucket might be missing other changes pushed to the remote (e.g. when collaborating on buckets). For this reason, pushing new content to the remote includes the root Cid of the local bucket so the remote host can check that the wanted change can be fast-forwarded. If the remote notices that the change action is starting from a bucket state different than the latest known one, it will fail to avoid unsafe changes.

I'll push some new changes in another terminal referencing the same bucket, and try to add a new file as did above which should fail since my local bucket is out of sync with the remote:

Nice! The remote is warning us that our push operation isn't safe since our local bucket is behind on changes made to the remote.

Merging changes from different starting points allows multiple conflict resolution mechanisms, which have different tradeoffs about safe merging results. Currently, Buckets has fast-forward policy to make pushes as predictable as possible without adding extra complexity to merging workflows.

buck pull

The buck pull commands allow getting the latest state from the remote to your local Bucket.

Considering our last failed push, let's retry our last attempt with some pulling afterward:

Wow, what happened here? As we can see, when doing buck pull without extra flags, we'll get remote changes and still keep local ones. Furthermore, if pulling remote changes involve a file that we have changed locally, the CLI will automatically preserve our local version so we don't lose any data.

We can also provide the --hard flag which, as we saw doing buck pull -h , will discard local changes and leave our local bucket exactly as the latest version of the remote. If you are annoyed by prompts asking for confirmations, you can also provide the --yes flag. Use it with caution!

Want your Bucket in Filecoin?

Buckets has native support for archiving your data in the Filecoin network using Powergate under the hood. The Bucket CLI has simple abstractions to make that super easy.

buck archive

The buck archive lets you archive your current state of the bucket to the Filecoin network with a single command. A game changer!

This command has some extra sub-commands which are pretty useful, like buck archive status and buck archive info. Looking at their outputs, you get information about your last archive.

The most important fact about a decentralized data storage mechanism, such as Buckets, is that your data is not locked-in. When you do buck archive you store data in the Filecoin network, which means you can recover that information without any Textile tool or endpoint!

Please take a look at the following video, which starts a bucket from scratch, push some files, archive it, sees how that process unfolds and retrieves the data only by using the Lotus CLI. All this runs in a docker-compose setup which spins the complete stack of Buckets (buckd, MongoDB, threadsd, IPFS), connected with a Powergate stack (powd, IPFS, Lotus Devnet). The Lotus node has a volume shared with the host machine to easier file inspection.

Enough talking, let's see it in action (remember selecting high-quality):

From a simple folder with two files, we pushed them to decentralized storage backed by IPFS. Later on, the complete Bucket snapshot is saved in the Filecoin network, being able to retrieve the data, if wanted, without any other tool than the Lotus CLI. Let that sink in!

This feature is not yet available in our production Hub, it's an experimental feature! You can run the above demo locally. The docker-compose setup I used in the video lives in this repository folder. Recall you might want to add some extra volume binding to the Lotus devnet to easier access to the Lotus container where data gets retrieved.

Easy onboarding!

The easiest and fastest way to experiment with Buckets is to use the Hub. In the blink of an eye you'll have your Hub account and be ready to create buckets, while also accessing a ton of other really cool integrated services.

The Hub also has a neat CLI, which is a superset of the Bucket CLI:

$ hub -h
The Hub Client.

Usage:
  hub [command]

Available Commands:
  buck        Manage an object storage bucket
  destroy     Destroy your account
  help        Help about any command
  init        Initialize account
  keys        API key management
  login       Login
  logout      Logout
  orgs        Org management
  threads     Thread management
  whoami      Show current user

Flags:
      --api string       API target (default "api.textile.io:443")
  -h, --help             help for hub
  -s, --session string   User session token

Use "hub [command] --help" for more information about a command.

If you want to understand more about the Hub, you can read our blog post and start diving into it.

The buck command and sub-commands is exactly the same to all we mentioned in this article, in fact just see it's help output:

$ hub buck -h
Manages files and folders in an object storage bucket.

Usage:
  hub buck [command]

Aliases:
  buck, bucket

Available Commands:
  add         Add adds a UnixFS DAG locally.
  archive     Create a Filecoin archive
  cat         Cat bucket objects at path
  destroy     Destroy bucket and all objects
  init        Initialize a new or existing bucket
  links       Show links to where this bucket can be accessed
  ls          List top-level or nested bucket objects
  pull        Pull bucket object changes
  push        Push bucket object changes
  root        Show local bucket root CID
  status      Show bucket object changes

Flags:
  -h, --help   help for buck

Global Flags:
      --api string       API target (default "api.textile.io:443")
  -s, --session string   User session token

Use "hub buck [command] --help" for more information about a command.

Looks familiar, right? ;)

Conclusion

In this article, we covered a bit of technical background to understand how immutable data structures help in understanding change, and how they are applied in Buckets diffing as a backbone for bucket synchronization.

Having that covered, we moved to use buck push , which leverages the diffing algorithm to only push changes that matter and not a complete snapshot of the local bucket in a smart way to avoid data loss or unwanted results.

Additionally, we used buck pull command, which allows us to sync our bucket with the latest remote version without losing current local changes. If local changes should be discarded, we can easily provide some flags to deal with that automatically.

Finally, we archived the bucket in the Filecoin network and showed that it can be retrieved with a Textile agnostic tool such as the Lotus CLI.

We hope you've enjoyed how Buckets solves syncing with the remote, and hope you can also leverage the same theory and libs in your sync challenges!

You asked, we answered! An excellent technical write-up about how Textile Buckets work to sync and store data on IPFS and Filecoin: A slice of Buckets: Diffing + Syncing + Archiving https://t.co/cJVTOmyJyC #ipfs #filecoin #developers
— Textile (@textileio) June 18, 2020