Weeknotes: Synthetic data generation, reviewing Data Availability proof approaches, & DePIN corner with Streamr

Learn how to generate synthetic data for Basin vaults, discover Data Availability layer approaches to proofs and storage, and a DePIN overview about Streamr.

Begin transmission…

Generating synthetic data from Basin

by Dan Buchholz

At the ETHDenver Proof of Data summit, Andrew gave a talk and wrote an article about the future of data and how synthesized data will take over the internet. One of the ways to show how web3-native, open data can be used in this realm is with any of the data stored on Basin!

To demonstrate a synthetic data approach, it's a three-step process:

  • Extract data from a Basin vault.

  • Use an AI/ML library to generate synthetic data.

  • Benchmark the results.

There are tons of ways you can create synthetic data; there's even an Awesome Synthetic Data repo on GitHub that outlines quite a few resources.

Check out the demo code here: textileio/demo-basin-synthetic-data

The end result will give you data that, by comparison, looks like this—e.g., the synthetic data has ~97% similarity to the source/real data:

Setup with SDV

The Synthetic Data Vault "SDV") library makes it easy to generate exactly what we need—and note that it has no relation to Basin's "vault" term…it's purely a coincidence! SDV offers several ways to use their models and generate tabular data with classic statistical methods or Deep Learning. Let's review how to implement this with their "classical" Gaussian copula. You can follow along with their Jupyter notebook for a step-by-step example.

The first thing we want to do is get our source data ready. We'll use WeatherXM's wxm.weather_data_dev vault, extract data from an event, and then create a smaller sample of it for the sake of the demo. The DuckDB query below will take the downloaded parquet file, filter it so that only 10 records per device ID exist in the dataset, and also limit the sample to 10000 rows. Without this, it's possible all of the rows have the same device ID, which would lead to erroneous results.

> curl 'https://basin.tableland.xyz/events/bafybeid64vmetvduzsvmvisbn7ldgwruva6x7wm2pg2ty4jzqq4c4vmrm4' -o data.parquet > duckdb # COPY (SELECT * FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY device_id ORDER BY timestamp) as rn FROM read_parquet('data.parquet')) subquery WHERE rn <= 10 limit 10000) TO 'data.csv' (FORMAT 'csv');

Let's start by importing a few different libraries that include pandas, sdmetrics, and sdv. Note that the kaledio and tabulate packages are also needed under the hood, along with graphviz, in order to generate plots.

from pathlib import Path from pandas import DataFrame, read_csv from sdmetrics.reports.single_table import DiagnosticReport, QualityReport from sdv.evaluation.single_table import ( evaluate_quality, get_column_plot, run_diagnostic, ) from sdv.metadata import SingleTableMetadata from sdv.single_table import GaussianCopulaSynthesizer

Then, create two simple methods to set up the SDV data/metadata and generate the synthetic data:

def get_source_data(path: Path) -> tuple[DataFrame, SingleTableMetadata]: real_data = read_csv(path) # remove any rows with null column values real_data = real_data.dropna() metadata = SingleTableMetadata() metadata.detect_from_dataframe(real_data) return real_data, metadata def generate_synthetic_data( real_data: DataFrame, metadata: SingleTableMetadata ) -> DataFrame: synthesizer = GaussianCopulaSynthesizer(metadata) synthesizer.fit(real_data) synthetic_data = synthesizer.sample(len(real_data)) return synthetic_data

You can take the result from generate_synthetic_data and then use that to write the results to files and use it downstream. One other important step is to iteratively adjust how you're producing the synthetic data and make sure the outputs have a strong correlation to the inputs.

SDV comes with a diagnostic and quality metric method that streamlines this process:

def analyze_data( real_data: DataFrame, synthetic_data: DataFrame, metadata: SingleTableMetadata ) -> tuple[DiagnosticReport, QualityReport]: diagnostic = run_diagnostic( real_data=real_data, synthetic_data=synthetic_data, metadata=metadata, verbose=False, ) quality_report = evaluate_quality( real_data, synthetic_data, metadata, verbose=False ) return diagnostic, quality_report

And that's a wrap! One of the key points here is that AI/ML data generation on top of web3-native, open data can be a major component in the future of data. This walkthrough is a rudimentary example of how you can build on open data—or even your own datasets on Basin!

The rest of the demo code shows how to generate plots and write the results to output files, so check it out if you’re interested: textileio/demo-basin-synthetic-data

DePIN Corner: Streamr

by Marla Natoli

Streamr enables decentralized data broadcasting via secure P2P distribution. Streamr enables streaming of any live media or real‑time data at scale (think pub/sub but decentralized), and has recently showcased some exciting POCs with Dimo and IoTex around using zero-knowledge powered computation over streaming data to demonstrate the ability to power on-chain driven experiences while preserving data privacy. These multi-party POCs are super exciting, as they help us move closer to achieving modular infrastructure that works seamlessly together, which is imperative to driving the adoption of decentralized infrastructure. Our decentralized object storage solution (codenamed Basin) complements Streamr’s data streaming solution well, and we’re looking forward to experimenting with some potential use cases in the weeks and months to come.

Learn more about Streamr, IoTex, and Dimo’s joint POC here, and check out the available streams from Streamr here. And of course, if you’re interested in learning more about our decentralized storage solution and how it can help DePINs or other organizations looking to decentralize their database infrastructure, feel free to set up some time with us here or join our Discord and get in touch.

Reviewing Data Availability layers & different approaches

by Dan Buchholz

Data Availability (”DA”) is a hot topic now that EIP-4844 is live! There are a few major players in the space, and all of them take similar approaches but with a few differences. We'll take a look at EigenDA, Celestia, Avail, and Arbitrum AnyTrust. But first, let's start off with a bit of background information.

DA layers ensure that block data is provably published so that applications and rollups can know what the state of the chain is—but once the data is published, DA layers do not guarantee that historical data will be permanently stored and remain retrievable. DAs either deploy a validity proof or a fraud/fault proof (validity proofs are more common). Data availability sampling (”DAS”) by light clients is also a common feature to ensure data can be recovered. However, the term DA typically refers to simple blocks/transaction data, so it differs from large, arbitrary, long-term data availability and storage.

The following section outlines common terms across DA protocols. Skip if familiar.

To clarify, here's a quick recap of what DAs implement:

  • Validity proofs: ensure that all data and transactions are valid before they are included onchain via zk-SNARKs/STARKs.

    • Computationally intensive but provides strong security guarantees.

  • Fraud/fault proofs: allow data to be posted onchain before guaranteed valid—and use a challenge period for tx dispute resolution.

    • Less computationally intensive but lower security guarantees (i.e., requires the network to actively generate fraud proofs).

  • KZG commitment scheme: data redundancy via erasure encoding—and correctness thereof without needing a fraud proof.

    • E.g., full nodes to prove transaction inclusion to light nodes using succinct proof.

  • Erasure encoding: reduce per-node storage requirements by splitting data up across many nodes while ensuring the original data can be recovered if lost.

    • This involves decreasing an individual node’s storage requirement by increasing the total size of a piece of data (splitting into blocks & adding additional redundancy/erasure encoded blocks).

    • Then, distribute the blocks across many nodes. If you need the original data, it should be recoverable by piecing blocks back together from the network—assuming some defined tolerance threshold is held.

  • Data availability sampling: ensure data availability without requiring nodes to hold the entire dataset; complements erasure encoding to he

  • lp guarantee data is available.

    • I.e., randomly sampled pieces of erasure-coded block data to assure the entire block is available in the network for reconstruction—else, slash nodes.

  • Data availability committee: a trusted set of nodes—or validators in a DA PoS network—that store full copies of data and publish onchain attestations to prove ownership.

Layer 2s & DA approaches

The following section outlines common L2s and how they approach DA.

According to Avail, there are a few different approaches that L2s take to DA. Note this is in the sense of block/transaction DA and differs from the “arbitrary” / large DA approach Textile focuses on with Basin and object storage:

ZK & optimistic rollups

  • Post proofs (validity or fraud) onchain along with state commitments.

Plasma: Rollup + offchain DA

  • All data and computation, except for deposits, withdrawals, and Merkle roots, to be kept offchain.

Optimiums: Optimistic rollups + offchain DA (subclass of Plasma)

  • Adaptation of Optimistic rollups that also take data availability offchain while using fraud proofs for verification.

    • I.e., differs from traditional rollups in that transaction data is entirely in offchain storage.

  • E.g., Optimism offers a “plasma mode” where data is uploaded to the DA storage layer via plain HTTP calls

Validiums: ZK rollups + offchain DA (subclass of Plasma)

  • Adaptation of ZK rollups that shift data availability offchain while continuing to use validity proofs.

  • E.g., Starknet posts a STARK validity proof and also sends a state diff, which represents the changes in the L2 state since the last validity proof was sent (updates/modifications made to the network's state).

Volitions: ZK rollups + <X> DA (i.e., the user picks DA location)

  • Dual-mode operation for either onchain or offchain DA:

    • Opt for zk rollup mode with zk proofs to ensure the integrity and validity of transactions where transaction data is stored onchain.

    • Opt for Validium mode, which stores transaction data offchain, enhancing scalability and throughput while maintaining robust validity proofs

Sovereign Rollups: Independent rollups

  • Maintain autonomy over security and data availability models; protocol decides if data availability is either onchain or offchain. There’s no standard approach here.

DA solutions overview

The following section outlines common DA layers and how they implement their solution.

EigenDA

While other systems such as Celestia and Danksharding (planned) also make use of Reed Solomon encoding, they do so only for the purpose of supporting certain observability properties of Data Availability Sampling (DAS) by light nodes. On the other hand, all incentivized/full nodes of the system download, store, and serve the full system bandwidth. Source

  • EigenDA produces a DA attestation that asserts that a given blob or collection of blobs is available.

    • Have both a liveness & safety threshold

    • Enable consensus about whether a given blob of data is fully within the custody of a set of honest nodes

    • Attestations are anchored to one or more "Quorums," each of which defines a set of EigenLayer stakers that underwrite the security of the attestation.

  • Architecture (see here):

    • Operator: EigenDA full nodes—the service providers of EigenDA—store chunks of blob data for a predefined period and serve these chunks upon request.

    • Disperser: (currently centralized) is responsible for encoding blobs, distributing them to the DA nodes, and aggregating their digital signatures into a DA attestation.

    • Retriever: a service that queries EigenDA operators for blob chunks, verifies that blob chunks are accurate, and reconstructs the original blob for the user

  • Prune data after some predefined period.

  • For reference—EigenLayer vs. EigenDA:

    • EigenLayer serves as a platform connecting stakers and infrastructure developers where stakers have the option to restake their stake and contribute to the security of other infrastructures while earning native ETH rewards (i.e., pool security vs. fragmenting it).

    • AVS stands for an "actively validated service"—software that uses its own set of nodes that are doing something that requires verification/validation (e.g., consensus, DA, TEE, etc.). The core EigenLayer protocol lets people take their ETH, restake it (liquid staking tokens), and then take these tokens and stake to EigenLayer nodes. An EigenLayer operator node can run arbitrary software—i.e., developers can create an AVS of their choice, such as a database, load balancer, log service, etc.

    • An example of an AVS implementation is the EigenDA—i.e., EigenLayer is dogfooding their base protocol.

Celestia

While performing DAS for a block header, every light node queries Celestia Nodes for a number of random data shares from the extended matrix and the corresponding Merkle proofs. If all the queries are successful, then the light node accepts the block header as valid (from a DA perspective).

  • Assumptions:

    • There is a minimum number of light nodes (depends on the block size) that are conducting data availability sampling for a given block size

    • Light nodes assume they are connected to at least one honest full node and can receive fraud proofs for incorrectly erasure-coded blocks.

  • Prune data after 30-day window

    • They recommend L2 rollups that want to use Celestia to implement their own long-term storage plan: here

Avail

  • Validity proofs + DAS

  • Have 3 node types:

    • Full Nodes: download and verify the correctness of blocks but do not partake in the consensus process.

    • Validator Nodes: responsible for generating blocks, deciding on transaction inclusion, and maintaining the order; incentivized through consensus participation

    • Light Clients: (DAS) query full nodes to check KZG polynomial openings against the commitments in the block header for each sampled cell

  • Full nodes prune data after some cutoff period.

Arbitrum AnyTrust (variant of Arbitrum Nitro)

  • Data Availability Committees that post Data Availability Certificates (DACert)

    • DACert is a proof that the block's data will be available from at least one honest Committee member until an expiration time.

    • AnyTrust sounds like a volition—the sequencer either posts data blocks on the L1 chain as calldata/blobs, or it posts DACerts

  • DAC members run DA servers with two endpoints:

    • Sequencer API to submit data blocks for storage.

    • REST API that allows data blocks to be fetched by hash.

  1. DA protocol deep dives: https://chronicle.castlecapital.vc/p/deepdive-data-availability

  2. Avail DA comparison: https://blog.availproject.org/a-guide-to-selecting-the-right-data-availability-layer/

  3. Avail DA overview: https://docs.availproject.org/docs/the-avail-trinity/avail-da

  4. Avail RFP for storage: https://github.com/availproject/avail-uncharted/blob/main/grants/RFPs/RFP-003.md

  5. EigenLayer design: https://www.blog.eigenlayer.xyz/ycie/

  6. Ethereum DA basics: https://ethereum.org/en/developers/docs/data-availability

  7. Starknet DA overview: https://book.starknet.io/ch03-01-03-data-availability.html#recreating-starknets-state

  8. Optimism DA overview: https://specs.optimism.io/experimental/plasma.html#da-storage

  9. Vitalik’s DA & Plasma overview: https://vitalik.eth.limo/general/2023/11/14/neoplasma.html

  10. Arbitrum AnyTrust DA overview: https://docs.arbitrum.io/inside-anytrust#data-availability-certificates

  11. Celestia DA & storage: https://docs.celestia.org/learn/retrievability

  12. Fraud & DA proofs via Celestia: https://arxiv.org/abs/1809.09044


End transmission…

Want to dive deeper, ask questions, or just nerd out with us? Jump into our Telegram or Discord—including weekly research office hours or developer office hours. And if you’d like to discuss any of these topics in more detail, comment on the issue over in GitHub!

Are you enjoying Weeknotes? We’d love your feedback—if you fill out a quick survey, we’ll be sure to reach out directly with community initiatives in the future!: Fill out the form here

Textile Blog & Newsletter logo
Subscribe to Textile Blog & Newsletter and never miss a post.
  • Loading comments...