Enabling the distributed web

developers • Jun 11, 2018

How content addressing and the protocols that it supports are shaping the future of the Internet

Lots of folks have been echoing the idea that the way the Internet operates now, is a serious problem for the future of the web. Not least of all because, despite its decentralized and open roots (seriously, read that one), the Internet has become a place of centralized giants and information monopolies. But it’s not all bad news. First, the Internet does work pretty well, and second, we can make it better. And that’s exactly what the fine folks behind the Interplanetary File System (IPFS) aim to do. Their vision? A peer-to-peer hypermedia protocol to make the web faster, safer, and more open.But what exactly does this mean? At its core, IPFS is a versioned file system that manages objects and facilitates storing them, while also tracking versions over time. IPFS also accounts for how those files move across the network (via protocols), so it is also a distributed file system. A key piece of the distributed web puzzle, and the topic of today’s post, is content addressing. How you actually refer to files and objects within the system. But first, what’s wrong with the Internet?

How the Internet works today

When you visit a website, or access an image, spreadsheet, dataset, or tweet over the Internet today, chances are, you’re accessing that content via a link or url that starts with an http://or https://. This link identifies the content you are after by its location — it points to a particular location on the web, served from a particular server, using a particular port, at a particular location on the planet. A common scenario works like this:

You fire up your favorite browser, type in a url (it should probably be https://www.textile.photos/), and hit enter, and then…
Your browser uses a DNS (Domain Name System) resolver to ask for the IP address that corresponds to the URL you entered. IP addresses are how most Internet-connected devices are located.
Your browser then initiates a connection to that IP address using a specific port. All web servers run by default on port 80, while port 443 is used for secure web connections.
The web server then processes the URL you entered and gives the control to its back-end. The back-end is probably running code to generate the HTML page, and then hands it over to the web server. The web server then sends the HTML page to the browser, via the HTTP channel.
Finally, your web browser receives the HTML page, closes the connection to the web server, and then renders it on your screen. It is also responsible for executing any code (Javascript) present in the response.

While modern web technologies have largely altered this general model, this general scenario remains the most common way to exchange information over the web. HTTP dates back to 1991, but didn’t really gain traction with web browsers until about 1996. So we’re talking over 20-year-old technology. And this has worked really well for a long time, but if you think seriously about what is happening here, you start to see how fragile it is. The biggest issue? Whoever controls the content’s location controls the content, because it’s location-addressed.

Location, location, location

There are plenty of efficiency/practical reasons why a location-addressed Internet is a crazy idea. For instance, even if a thousand people have downloaded a thousand copies of a file (or a tweet, or photo, or whatever), to a thousand different physical locations locations(their phone, computer, etc.), all references to that file still points to that original, single location! So again, whoever controls that location decides what people get to see when they access that link.

Now, let’s assume that most content providers are going to continue to provide you with the content they say they’ll provide. That they’re honest. Aside from the obvious implications of this strong assumption, there’s another issue at play: link-rot. Trends change, interests wain, and the Internet is a fickle place. Couple this with the increasing costs of maintaining servers, updating infrastructure, and keeping up with the latest trends, and you’ve got a recipe for dead links. In fact, a 2014 study found that just about 50% of the URLs in U.S. Supreme Court opinions no longer link to the original information. As far back as 2003 some folks discovered that about one link out of every 200 disappeared each week from the Internet, leaving the average lifespan of a web page at just 100 days.

But there are more important social issues at stake here too. Consider the fact that, in 2018, there are relatively few, large, physical servers, operated by relatively few, large, corporations (think FAMGA) that are responsible for hosting essential elements of what we consider the modern Internet. How did this centralization of an otherwise decentralized system happen? Convenience. We’ve traded convenience for control, and it has some potentially serious consequences, such as censorship, surveillance, the loss of net neutrality, and a whole lot more.

[What we have now is] a slow, expensive Internet, made even more costly by predatory last-mile carriers (in the U.S. at least), and the accelerating growth of connection requests from mobile devices. It’s not just slow and expensive, it’s unreliable. — techcrunch

How the Internet should work

If the current centralized web is all about location, then the future distributed web is all about content. But content addressing and location addressing are just different ways to refer to a particular file or object. Neither is necessarily slower or faster. The main differences are that location addressing tells the computer where to find a file, whereas content addressing says what to find. Content addressing doesn’t tell you how to get said file, and location addressing only tells you how find it (not what it is). Get it?

If the current centralized web is all about location, then the future distributed web is all about content. — tweet it

In the case of content addressing, the how is by identifying files or objects by their fingerprint, not their location. This is actually a really intuitive idea. When you ask someone for their favorite cat video, they probably aren’t going to say something like “oh haha, the one on this server, at this sub-domain, under this file path, slash hilarious dash cat dot png”. Instead, they’re going to describe the content of the video: “oh haha, the one where the cat knocks the glass off the counter, thug style… classic”.

So how does IPFS and other such systems create a fingerprint for a file? First, these types of systems are a synthesis of emerging and existing innovations, and its worth a quick read to become familiar with terms like distributed hash tables (DHT), block exchanges, Merkle DAGs, cryptographic hashes, and even block-chains. With those concepts/ideas in mind, the basic idea is we identify content by its cryptographic hash, or even better, a self-describing content-addressed identifier. A cryptographic hash is a (relatively) short alphanumeric string that’s calculated by running your content through a cryptographic hash function (like SHA).

To steal an example from an early post on IPFS from neocities, when the Textile logo is added to IPFS, it gets a new name: QmY3xr7CChe86Z4J8eqMsoZfKUTdHSbe5SYLD2TRLPhPKE. That name is actually the CID (Content IDentifier) for that file, computed from the raw data within that PNG. It is guaranteed to be cryptographically unique to the contents of that file, and that file only. If I change that file by even one bit, the hash will become something completely different. Now, when I want to access that file, I can simply ask the IPFS network for the file with that exact CID, the network will find the peers that have the data (using a DHT), retrieve it, and verify (using the CID) that it’s the correct file. What this means is I can technically get the file from multiple places because as long as the file matches the hash, I know I’m getting the right data. Which brings us to some of the benefits of content addressing.

It’s all about the content

The excellent online Decentralized Web Primer ebook, provides an excellent coverage of the power and benefits of content-addressing (and IPFS and decentralization in general), so I won’t reiterate it all here. But a few keys points are:

Content-addressed links are permanent. The link permanently points to exactly that content.

This a potential solution to the massive problem of linkrot by creating persistent data structure.

It lets us store data together. We share the responsibility of data preservation.

If you hold a copy of a dataset on any of your devices, or if you pay someone to host it for you (we’ll discuss Filecoin more in a future post), you become part of the network.

It increases the integrity of data. If the hashes match, I know I’m getting what I asked for.

Since we are requesting specific content, it makes man-in-the-middle attacks practically impossible, and reduces the likelihood of any DDoS attacks.

It enables distribution. You don’t even need a connection to the wider web.

This means content can be stored and served very close to the user, possibly from a computer in the same room!

The decentralized web lets us store data together. We share the responsibility of data preservation. — tweet it

Content addressing also opens up the doors to sharing content more easily between apps and services. If you take a photo on your phone, and share it via a peer-to-peer (p2p) network such as IPFS, anyone (or any app) who has access to that photo can request it from anyone(orany app) that already has it. Fully distributed, deduplicated, and sharable across (potentially) all platforms. This is a big idea, and at Textile, we’re excited to push these kinds of capabilities into the photo sharing world.

[…] what ultimately disrupts many of the major web services created in the last decade could be peer-to-peer protocols, not companies. — techcrunch

There are lots of other benefits that come along with content addressed and distributed systems more generally, which I encourage you to read about. And despite the staying power of the ‘conventional’ HTTP-driven web, and the challenges to decentralization, there are lots of ways that decentralization can win. From promises to improve health-care data management, to circumventing government censorship, to bringing connectivity to otherwise disconnected communities, the decentralized web — with content addressing at its core — is here for the long haul.

Intrigued? Curious? Confused? Check out our other articles for some of the crazy ideas Textile is exploring/implementing to better protect your photos. And while you’re at it, why not join the Textile Photos wait-list to get an early glimps of the innovative ideas we’re exploring to push photos and personal data into the decentralized future.