WeatherXM + Textile: DePIN Data Storage, Retrieval, & Rewards Computation

TL;DR

WeatherXM used Textile to do the following:

Bring open data availability of station/device data and rewards calculations to help progress toward fully decentralized infrastructure and bring transparency to the dataset.
Enable verifiable data such that the data is signed at the source, ensuring all data is cryptographically provable with access controls on data mutations.
A decentralized framework for computation, such as QoD, providing the WeatherXM ecosystem a way to build on top of the data.

You can view available “events” (data pushes) here and extract the device data at the https://basin.tableland.xyz/events/<event_cid> endpoint for further analysis, as described below.

What is it?

Textile and WeatherXM are working together to design a solution for DePIN data availability (“DA”), storage, and retrieval. If you’re unfamiliar with WeatherXM, it is a network that's powered by community-owned devices, purpose-built by WeatherXM for weather data collection. As part of the network, stations/devices throughout the globe are incentivized to share data.

WeatherXM’s goal is to make network data more accessible so developers can build on top of it. The station/device data are replicated to Basin for decentralized storage, allowing for computation over that data. The data flows through Basin, which can support cache with TTL, and is replicated to Filecoin for long-term storage guarantees.

Basin is designed to be flexible and work with raw file uploads (e.g., parquet) or even database streaming (e.g., Postgres). Data is replicated to the network in a “vault” and then can be retrieved by the vault’s CIDs for further processing.

There are two possible data availability options for a vault—hot or cold storage:

Hot layer: Cache with time-to-live (TTL) where the data is made available for the defined timeline and immediately retrievable (i.e., no need to wait for long propagation or finality times).
Cold layer: Persistence on Filecoin where all data is flushed for long-term retrieval.

This makes it possible to run computation over the open data where both the inputs—and, potentially, outputs—are highly available in the short term and also retrievable after cold storage flushing (e.g., dispute resolution).

How does WeatherXM use the Basin data?

WeatherXM device data is made available for data consumers through hot + cold storage replication, and this data is used today in a dashboard available at https://index.weatherxm.network. You’ll notice two tabs on that page:

Weather Data: Device data located in the Textile vault wxm.weather_data_dev
Rewards: Calculations over the device data, located in the Textile vault wxm.rewards_merkle_trees_dev

WeatherXM's dashboard displays the CIDs created when data is pushed to Textile Basin and provides a link to the Basin HTTP API.

We also built a demonstration of how developers can use the DePIN network’s data in the downstream compute and visualization pipeline. One of WeatherXM’s goals is to progressively decentralize their stack, and Basin is a key component to making that possible. We wanted to provide a way for developers to see exactly how they can build on top of the public dataset and add value to the WeatherXM ecosystem.

We built a demonstration about how you can leverage WeatherXM and run queries over the data, which we’ll walk through below. You can review the Data.md for the final visualizations that plot WeatherXM network data over historical trends and across the globe.Check it out here*: https://github.com/textileio/demo-basin-wxm-query*

How does Basin work?

Background

The setup process for WeatherXM involved:

Creating a vault with the Basin CLI, which is an onchain interaction signed by a private key.
Aggregating device data that is processed by internal systems for Quality of Data purposes.
Replicating internal data to the Basin DA layer on a consistent schedule—i.e., parquet files containing device data are written to the vault and signed by a private key.

Once that data is available, anyone with knowledge of the unique vault identifier or vault owner’s address can extract and process the data. It’s openly accessible on the Basin network and through the cold layer. In our demo described below, we built a program that consists of:

A Python script to fetch remote parquet files written to the vault, filtered over a time range, and identified at an IPFS CID.
A DuckDB in-memory SQL database for loading parquet files and executing queries.
Writing the query results to a CSV and markdown files and producing data visualization with geopandas (e.g., precipitation rate per geography).
Running the steps above weekly with GitHub Actions.

You could imagine more complex flows with Docker images or compute networks like Bacalhau, but the idea was to keep things lightweight to showcase how to interact with open, verifiable data.

Viewing & retrieving WeatherXM data

If you want to understand the basics behind Basin, you can start by installing the CLI on your machine. Make sure the path to Go binaries is in your shell’s PATH variable (e.g., include $(go env GOPATH)/bin).

go install github.com/tablelandnetwork/basin-cli/cmd/vaults@latest

Then, you can take a peek at what the structure and data within a vault look like. For example, list all vaults created by a specific account/address:

vaults list --account 0x75c9eCb5309d068f72f545F3fbc7cCD246016fE0

Or, list out the events associated with a vault:

vaults events --vault wxm.weather_data_dev

Which provides the “events” (i.e., vault mutations & CIDs) for the vault. Note the cache_expiry field shows when the highly available data will expire from the cache.

[ { "cid": "bafybeicizws2qymmcn657b3qa7rm2w2lz4bdumxc6b66jw4ac4fsrhmsxm", "timestamp": 1709078400, "is_archived": false, "cache_expiry": "2024-05-16T20:56:47.379695" }, { "cid": "bafybeigppam7v3uwarke5iulytgptdj6hzjta5kye7nljbvemzje6r3xma", "timestamp": 1709596800, "is_archived": false, "cache_expiry": "2024-05-16T19:49:50.277204" } ]

This period is defined upon vault creation such that all files written to the vault follow the same expiry logic. Note the data is flushed to cold storage and still available after the cache expiration. The CIDs can then be retrieved and processed locally:

vaults retrieve --output <filename> <event_cid>

Demo: How to use WeatherXM data

Now that we understand how Basin works, we can review some of the demo’s features since it builds on top of the WeatherXM vault data. The first step is to get all of the vaults for a specific address and then retrieve all of the events and CIDs for the vaults. This can be done programmatically with the Basin HTTP API (where {variable} would be replaced by actual string values—e.g., account=0x75c9eCb5309d068f72f545F3fbc7cCD246016fE0):

Get vaults for address: curl https://basin.tableland.xyz/vaults?account={address}
Get events for vault: curl https://basin.tableland.xyz/vaults/{vault}/events
Download the raw data at an event: curl https://basin.tableland.xyz/events/{event_cid} -o data.parquet

Let’s presume the files have been downloaded at each event for WeatherXM’s vault into some data_dir directory. We then load these parquet files into a DuckDB in-memory instance.

from pathlib import Path from duckdb import connect def create_database(data_dir): db = connect() files = Path(data_dir) / "*.parquet" # Read all parquet files in data directory db.execute(f"CREATE VIEW xm_data AS SELECT * FROM read_parquet('{files}');") return db

We can now run queries over the WeatherXM device data! This will calculate average or aggregate metrics for temperature, precipitation, wind, etc., and create a DataFrame for subsequent processing.

def execute_queries(db, start, end): # Set up columns for average calculations columns = [ "temperature", "humidity", "precipitation_accumulated", "wind_speed", "wind_gust", "wind_direction", "illuminance", "solar_irradiance", "fo_uv", "uv_index", "precipitation_rate", "pressure", ] # Set up all query parts avg_selection = [f"avg({col}) AS {col}" for col in columns] avg_calculations = ",".join(avg_selection) query_parts = [ "SELECT min(timestamp) AS range_start, max(timestamp) AS range_end,", "COUNT(DISTINCT device_id) AS number_of_devices,", "mode(cell_id) AS cell_id_mode,", "sum(precipitation_accumulated) AS total_precipitation,", avg_calculations, ] # Add WHERE clause for time filtering, if applicable where_clause = "" if start is not None and end is not None: where_clause = f"WHERE timestamp >= {start} AND timestamp <= {end}" elif start is not None: where_clause = f"WHERE timestamp >= {start}" elif end is not None: where_clause = f"WHERE timestamp <= {end}" # Combine all parts into one query and execute query = " ".join(query_parts) + " FROM xm_data" + f" {where_clause}" try: result = db.execute(query).pl() # Create a polars DataFrame return result except Exception as e: print(f"Error executing DuckDB aggregate queries: {e}") raise

Lastly, we can use geopandas to create visualizations from the returned DataFrame. In the example, there are a few rough bounding boxes (“bbox”) for each major continent with lat/long bounds:

bboxes = { "north_america": (14, 72, -172, -52), "south_america": (-55, 12, -85, -34), "europe": (35, 72, -13, 60), "africa": (-35, 38, -18, 55), "asia": (-11, 81, 25, 179), "australia": (-48, -6, 108, 178), }

For example, this calculates the total precipitation accumulated by cell_id within a bounding box:

def query_bbox(db, bbox, start, end): # DuckDB query to select and aggregate data within the bounding box query = f""" SELECT cell_id, SUM(precipitation_accumulated) as total_precipitation, AVG(lat) as lat, AVG(lon) as lon FROM xm_data WHERE lat BETWEEN {bbox[0]} AND {bbox[1]} AND lon BETWEEN {bbox[2]} AND {bbox[3]} """ # Add WHERE clause for time filtering, if applicable where_clause = "" if start is not None and end is not None: where_clause = f"AND timestamp >= {start} AND timestamp <= {end}" elif start is not None: where_clause = f"AND timestamp >= {start}" elif end is not None: where_clause = f"AND timestamp <= {end}" # Combine all parts into one query and execute query = query + f" {where_clause}" + " GROUP BY cell_id" try: result = db.execute(query).df() # Create a pandas Dataframe return result except Exception as e: print(f"Error executing DuckDB bbox query: {e}") raise

By taking the pandas DataFrame and passing it to geopanda’s GeoDataFrame method, you can create maps that plot data at lat/long coordinates, such as the one described above for the North American bbox. The history CSV file also plots information week-over-week—below is what the device trends look like (hint: the number of devices is increasing!):

A simple demo of computing the number of devices and precipitation plot over the WeatherXM vault.

How can someone use it?

If you’d like to develop with or alter the demo code, start by cloning the repo and going through the setup steps described in the README.

git clone https://github.com/textileio/demo-basin-wxm-query

Then, you can use the make run command to fetch events and compute queries over the data for the WeatherXM vault. You can also choose to pass start and end parameters to the make run command, which specifies timestamp ranges for data extraction.

If you want to use the WeatherXM data, the Basin HTTP APIs described in the sections above should provide all that you need. Namely, you can fetch the events and download the parquet files locally—or event load remote data into DuckDB from an IPFS gateway!

Next steps

This is just the first post in a series of Basin data demonstrations. We’ll continue to put out additional walkthroughs in long-form posts, but if you’d like to stay up to date with what we’re building on a week-by-week basis, check out the Weeknotes newsletter on Substack!

Follow us on Substack: https://tableland.substack.com/