feat: stabilizing encoding (#219)
This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the [Automerge Encoding Format](https://automerge.org/automerge-binary-format-spec/).
The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.
This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.
# Encoding Schema
## Header
The header has 22 bytes.
- (0-4 bytes) Magic Bytes: The encoding starts with `loro` as magic bytes.
- (4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The `checksum` and `magic bytes` fields are trimmed when calculating the checksum.
- (20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.
## Encode Mode: Updates
In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.
Like Automerge's format, we employ columnar encoding for operations and changes.
Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.
## Encode Mode: Snapshot
This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.
Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.
Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.
2024-01-02 09:03:24 +00:00
|
|
|
mod encode_reordered;
|
2023-07-31 03:49:55 +00:00
|
|
|
|
feat: stabilizing encoding (#219)
This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the [Automerge Encoding Format](https://automerge.org/automerge-binary-format-spec/).
The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.
This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.
# Encoding Schema
## Header
The header has 22 bytes.
- (0-4 bytes) Magic Bytes: The encoding starts with `loro` as magic bytes.
- (4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The `checksum` and `magic bytes` fields are trimmed when calculating the checksum.
- (20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.
## Encode Mode: Updates
In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.
Like Automerge's format, we employ columnar encoding for operations and changes.
Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.
## Encode Mode: Snapshot
This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.
Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.
Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.
2024-01-02 09:03:24 +00:00
|
|
|
use crate::op::OpWithId;
|
|
|
|
use crate::LoroDoc;
|
2023-07-31 03:49:55 +00:00
|
|
|
use crate::{oplog::OpLog, LoroError, VersionVector};
|
feat: stabilizing encoding (#219)
This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the [Automerge Encoding Format](https://automerge.org/automerge-binary-format-spec/).
The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.
This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.
# Encoding Schema
## Header
The header has 22 bytes.
- (0-4 bytes) Magic Bytes: The encoding starts with `loro` as magic bytes.
- (4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The `checksum` and `magic bytes` fields are trimmed when calculating the checksum.
- (20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.
## Encode Mode: Updates
In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.
Like Automerge's format, we employ columnar encoding for operations and changes.
Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.
## Encode Mode: Snapshot
This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.
Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.
Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.
2024-01-02 09:03:24 +00:00
|
|
|
use loro_common::{HasCounter, IdSpan, LoroResult};
|
|
|
|
use num_derive::{FromPrimitive, ToPrimitive};
|
|
|
|
use num_traits::{FromPrimitive, ToPrimitive};
|
|
|
|
use rle::{HasLength, Sliceable};
|
|
|
|
const MAGIC_BYTES: [u8; 4] = *b"loro";
|
2023-07-31 03:49:55 +00:00
|
|
|
|
feat: stabilizing encoding (#219)
This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the [Automerge Encoding Format](https://automerge.org/automerge-binary-format-spec/).
The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.
This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.
# Encoding Schema
## Header
The header has 22 bytes.
- (0-4 bytes) Magic Bytes: The encoding starts with `loro` as magic bytes.
- (4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The `checksum` and `magic bytes` fields are trimmed when calculating the checksum.
- (20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.
## Encode Mode: Updates
In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.
Like Automerge's format, we employ columnar encoding for operations and changes.
Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.
## Encode Mode: Snapshot
This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.
Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.
Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.
2024-01-02 09:03:24 +00:00
|
|
|
#[derive(Clone, Copy, Debug, PartialEq, Eq, FromPrimitive, ToPrimitive)]
|
2023-08-30 04:27:23 +00:00
|
|
|
pub(crate) enum EncodeMode {
|
|
|
|
// This is a config option, it won't be used in encoding.
|
|
|
|
Auto = 255,
|
feat: stabilizing encoding (#219)
This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the [Automerge Encoding Format](https://automerge.org/automerge-binary-format-spec/).
The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.
This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.
# Encoding Schema
## Header
The header has 22 bytes.
- (0-4 bytes) Magic Bytes: The encoding starts with `loro` as magic bytes.
- (4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The `checksum` and `magic bytes` fields are trimmed when calculating the checksum.
- (20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.
## Encode Mode: Updates
In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.
Like Automerge's format, we employ columnar encoding for operations and changes.
Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.
## Encode Mode: Snapshot
This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.
Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.
Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.
2024-01-02 09:03:24 +00:00
|
|
|
Rle = 1,
|
|
|
|
Snapshot = 2,
|
2023-07-31 03:49:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
impl EncodeMode {
|
feat: stabilizing encoding (#219)
This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the [Automerge Encoding Format](https://automerge.org/automerge-binary-format-spec/).
The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.
This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.
# Encoding Schema
## Header
The header has 22 bytes.
- (0-4 bytes) Magic Bytes: The encoding starts with `loro` as magic bytes.
- (4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The `checksum` and `magic bytes` fields are trimmed when calculating the checksum.
- (20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.
## Encode Mode: Updates
In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.
Like Automerge's format, we employ columnar encoding for operations and changes.
Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.
## Encode Mode: Snapshot
This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.
Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.
Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.
2024-01-02 09:03:24 +00:00
|
|
|
pub fn to_bytes(self) -> [u8; 2] {
|
|
|
|
let value = self.to_u16().unwrap();
|
|
|
|
value.to_be_bytes()
|
|
|
|
}
|
|
|
|
|
|
|
|
pub fn is_snapshot(self) -> bool {
|
|
|
|
matches!(self, EncodeMode::Snapshot)
|
2023-07-31 03:49:55 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
feat: stabilizing encoding (#219)
This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the [Automerge Encoding Format](https://automerge.org/automerge-binary-format-spec/).
The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.
This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.
# Encoding Schema
## Header
The header has 22 bytes.
- (0-4 bytes) Magic Bytes: The encoding starts with `loro` as magic bytes.
- (4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The `checksum` and `magic bytes` fields are trimmed when calculating the checksum.
- (20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.
## Encode Mode: Updates
In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.
Like Automerge's format, we employ columnar encoding for operations and changes.
Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.
## Encode Mode: Snapshot
This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.
Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.
Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.
2024-01-02 09:03:24 +00:00
|
|
|
impl TryFrom<[u8; 2]> for EncodeMode {
|
2023-08-30 08:41:04 +00:00
|
|
|
type Error = LoroError;
|
|
|
|
|
feat: stabilizing encoding (#219)
This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the [Automerge Encoding Format](https://automerge.org/automerge-binary-format-spec/).
The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.
This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.
# Encoding Schema
## Header
The header has 22 bytes.
- (0-4 bytes) Magic Bytes: The encoding starts with `loro` as magic bytes.
- (4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The `checksum` and `magic bytes` fields are trimmed when calculating the checksum.
- (20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.
## Encode Mode: Updates
In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.
Like Automerge's format, we employ columnar encoding for operations and changes.
Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.
## Encode Mode: Snapshot
This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.
Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.
Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.
2024-01-02 09:03:24 +00:00
|
|
|
fn try_from(value: [u8; 2]) -> Result<Self, Self::Error> {
|
|
|
|
let value = u16::from_be_bytes(value);
|
|
|
|
Self::from_u16(value).ok_or(LoroError::IncompatibleFutureEncodingError(value as usize))
|
2023-07-31 03:49:55 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
feat: stabilizing encoding (#219)
This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the [Automerge Encoding Format](https://automerge.org/automerge-binary-format-spec/).
The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.
This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.
# Encoding Schema
## Header
The header has 22 bytes.
- (0-4 bytes) Magic Bytes: The encoding starts with `loro` as magic bytes.
- (4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The `checksum` and `magic bytes` fields are trimmed when calculating the checksum.
- (20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.
## Encode Mode: Updates
In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.
Like Automerge's format, we employ columnar encoding for operations and changes.
Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.
## Encode Mode: Snapshot
This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.
Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.
Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.
2024-01-02 09:03:24 +00:00
|
|
|
/// The encoder used to encode the container states.
|
|
|
|
///
|
|
|
|
/// Each container state can be represented by a sequence of operations.
|
|
|
|
/// For example, a list state can be represented by a sequence of insert
|
|
|
|
/// operations that form its current state.
|
|
|
|
/// We ignore the delete operations.
|
|
|
|
///
|
|
|
|
/// We will use a new encoder for each container state.
|
|
|
|
/// Each container state should call encode_op multiple times until all the
|
|
|
|
/// operations constituting its current state are encoded.
|
|
|
|
pub(crate) struct StateSnapshotEncoder<'a> {
|
|
|
|
/// The `check_idspan` function is used to check if the id span is valid.
|
|
|
|
/// If the id span is invalid, the function should return an error that
|
|
|
|
/// contains the missing id span.
|
|
|
|
check_idspan: &'a dyn Fn(IdSpan) -> Result<(), IdSpan>,
|
|
|
|
/// The `encoder_by_op` function is used to encode an operation.
|
|
|
|
encoder_by_op: &'a mut dyn FnMut(OpWithId),
|
|
|
|
/// The `record_idspan` function is used to record the id span to track the
|
|
|
|
/// encoded order.
|
|
|
|
record_idspan: &'a mut dyn FnMut(IdSpan),
|
|
|
|
#[allow(unused)]
|
|
|
|
mode: EncodeMode,
|
|
|
|
}
|
|
|
|
|
|
|
|
impl StateSnapshotEncoder<'_> {
|
|
|
|
pub fn encode_op(&mut self, id_span: IdSpan, get_op: impl FnOnce() -> OpWithId) {
|
|
|
|
if let Err(span) = (self.check_idspan)(id_span) {
|
|
|
|
let mut op = get_op();
|
|
|
|
if span == id_span {
|
|
|
|
(self.encoder_by_op)(op);
|
2023-07-31 03:49:55 +00:00
|
|
|
} else {
|
feat: stabilizing encoding (#219)
This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the [Automerge Encoding Format](https://automerge.org/automerge-binary-format-spec/).
The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.
This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.
# Encoding Schema
## Header
The header has 22 bytes.
- (0-4 bytes) Magic Bytes: The encoding starts with `loro` as magic bytes.
- (4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The `checksum` and `magic bytes` fields are trimmed when calculating the checksum.
- (20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.
## Encode Mode: Updates
In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.
Like Automerge's format, we employ columnar encoding for operations and changes.
Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.
## Encode Mode: Snapshot
This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.
Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.
Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.
2024-01-02 09:03:24 +00:00
|
|
|
debug_assert_eq!(span.ctr_start(), id_span.ctr_start());
|
|
|
|
op.op = op.op.slice(span.atom_len(), op.op.atom_len());
|
|
|
|
(self.encoder_by_op)(op);
|
2023-07-31 03:49:55 +00:00
|
|
|
}
|
|
|
|
}
|
feat: stabilizing encoding (#219)
This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the [Automerge Encoding Format](https://automerge.org/automerge-binary-format-spec/).
The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.
This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.
# Encoding Schema
## Header
The header has 22 bytes.
- (0-4 bytes) Magic Bytes: The encoding starts with `loro` as magic bytes.
- (4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The `checksum` and `magic bytes` fields are trimmed when calculating the checksum.
- (20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.
## Encode Mode: Updates
In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.
Like Automerge's format, we employ columnar encoding for operations and changes.
Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.
## Encode Mode: Snapshot
This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.
Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.
Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.
2024-01-02 09:03:24 +00:00
|
|
|
|
|
|
|
(self.record_idspan)(id_span);
|
|
|
|
}
|
|
|
|
|
|
|
|
#[allow(unused)]
|
|
|
|
pub fn mode(&self) -> EncodeMode {
|
|
|
|
self.mode
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
pub(crate) struct StateSnapshotDecodeContext<'a> {
|
|
|
|
pub oplog: &'a OpLog,
|
|
|
|
pub ops: &'a mut dyn Iterator<Item = OpWithId>,
|
|
|
|
pub blob: &'a [u8],
|
|
|
|
pub mode: EncodeMode,
|
|
|
|
}
|
|
|
|
|
|
|
|
pub(crate) fn encode_oplog(oplog: &OpLog, vv: &VersionVector, mode: EncodeMode) -> Vec<u8> {
|
|
|
|
let mode = match mode {
|
|
|
|
EncodeMode::Auto => EncodeMode::Rle,
|
2023-07-31 03:49:55 +00:00
|
|
|
mode => mode,
|
|
|
|
};
|
2023-10-30 03:13:52 +00:00
|
|
|
|
feat: stabilizing encoding (#219)
This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the [Automerge Encoding Format](https://automerge.org/automerge-binary-format-spec/).
The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.
This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.
# Encoding Schema
## Header
The header has 22 bytes.
- (0-4 bytes) Magic Bytes: The encoding starts with `loro` as magic bytes.
- (4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The `checksum` and `magic bytes` fields are trimmed when calculating the checksum.
- (20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.
## Encode Mode: Updates
In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.
Like Automerge's format, we employ columnar encoding for operations and changes.
Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.
## Encode Mode: Snapshot
This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.
Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.
Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.
2024-01-02 09:03:24 +00:00
|
|
|
let body = match &mode {
|
|
|
|
EncodeMode::Rle => encode_reordered::encode_updates(oplog, vv),
|
2023-07-31 03:49:55 +00:00
|
|
|
_ => unreachable!(),
|
|
|
|
};
|
feat: stabilizing encoding (#219)
This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the [Automerge Encoding Format](https://automerge.org/automerge-binary-format-spec/).
The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.
This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.
# Encoding Schema
## Header
The header has 22 bytes.
- (0-4 bytes) Magic Bytes: The encoding starts with `loro` as magic bytes.
- (4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The `checksum` and `magic bytes` fields are trimmed when calculating the checksum.
- (20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.
## Encode Mode: Updates
In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.
Like Automerge's format, we employ columnar encoding for operations and changes.
Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.
## Encode Mode: Snapshot
This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.
Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.
Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.
2024-01-02 09:03:24 +00:00
|
|
|
|
|
|
|
encode_header_and_body(mode, body)
|
|
|
|
}
|
|
|
|
|
|
|
|
pub(crate) fn decode_oplog(
|
|
|
|
oplog: &mut OpLog,
|
|
|
|
parsed: ParsedHeaderAndBody,
|
|
|
|
) -> Result<(), LoroError> {
|
|
|
|
let ParsedHeaderAndBody { mode, body, .. } = parsed;
|
|
|
|
match mode {
|
|
|
|
EncodeMode::Rle | EncodeMode::Snapshot => encode_reordered::decode_updates(oplog, body),
|
|
|
|
EncodeMode::Auto => unreachable!(),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
pub(crate) struct ParsedHeaderAndBody<'a> {
|
|
|
|
pub checksum: [u8; 16],
|
|
|
|
pub checksum_body: &'a [u8],
|
|
|
|
pub mode: EncodeMode,
|
|
|
|
pub body: &'a [u8],
|
|
|
|
}
|
|
|
|
|
|
|
|
impl ParsedHeaderAndBody<'_> {
|
|
|
|
/// Return if the checksum is correct.
|
|
|
|
fn check_checksum(&self) -> LoroResult<()> {
|
|
|
|
if md5::compute(self.checksum_body).0 != self.checksum {
|
|
|
|
return Err(LoroError::DecodeDataCorruptionError);
|
|
|
|
}
|
|
|
|
|
|
|
|
Ok(())
|
|
|
|
}
|
2023-07-31 03:49:55 +00:00
|
|
|
}
|
|
|
|
|
feat: stabilizing encoding (#219)
This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the [Automerge Encoding Format](https://automerge.org/automerge-binary-format-spec/).
The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.
This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.
# Encoding Schema
## Header
The header has 22 bytes.
- (0-4 bytes) Magic Bytes: The encoding starts with `loro` as magic bytes.
- (4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The `checksum` and `magic bytes` fields are trimmed when calculating the checksum.
- (20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.
## Encode Mode: Updates
In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.
Like Automerge's format, we employ columnar encoding for operations and changes.
Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.
## Encode Mode: Snapshot
This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.
Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.
Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.
2024-01-02 09:03:24 +00:00
|
|
|
const MIN_HEADER_SIZE: usize = 22;
|
|
|
|
pub(crate) fn parse_header_and_body(bytes: &[u8]) -> Result<ParsedHeaderAndBody, LoroError> {
|
|
|
|
let reader = &bytes;
|
|
|
|
if bytes.len() < MIN_HEADER_SIZE {
|
|
|
|
return Err(LoroError::DecodeError("Invalid import data".into()));
|
2023-08-30 08:41:04 +00:00
|
|
|
}
|
|
|
|
|
feat: stabilizing encoding (#219)
This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the [Automerge Encoding Format](https://automerge.org/automerge-binary-format-spec/).
The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.
This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.
# Encoding Schema
## Header
The header has 22 bytes.
- (0-4 bytes) Magic Bytes: The encoding starts with `loro` as magic bytes.
- (4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The `checksum` and `magic bytes` fields are trimmed when calculating the checksum.
- (20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.
## Encode Mode: Updates
In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.
Like Automerge's format, we employ columnar encoding for operations and changes.
Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.
## Encode Mode: Snapshot
This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.
Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.
Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.
2024-01-02 09:03:24 +00:00
|
|
|
let (magic_bytes, reader) = reader.split_at(4);
|
2023-07-31 03:49:55 +00:00
|
|
|
let magic_bytes: [u8; 4] = magic_bytes.try_into().unwrap();
|
|
|
|
if magic_bytes != MAGIC_BYTES {
|
feat: stabilizing encoding (#219)
This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the [Automerge Encoding Format](https://automerge.org/automerge-binary-format-spec/).
The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.
This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.
# Encoding Schema
## Header
The header has 22 bytes.
- (0-4 bytes) Magic Bytes: The encoding starts with `loro` as magic bytes.
- (4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The `checksum` and `magic bytes` fields are trimmed when calculating the checksum.
- (20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.
## Encode Mode: Updates
In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.
Like Automerge's format, we employ columnar encoding for operations and changes.
Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.
## Encode Mode: Snapshot
This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.
Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.
Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.
2024-01-02 09:03:24 +00:00
|
|
|
return Err(LoroError::DecodeError("Invalid magic bytes".into()));
|
2023-07-31 03:49:55 +00:00
|
|
|
}
|
|
|
|
|
feat: stabilizing encoding (#219)
This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the [Automerge Encoding Format](https://automerge.org/automerge-binary-format-spec/).
The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.
This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.
# Encoding Schema
## Header
The header has 22 bytes.
- (0-4 bytes) Magic Bytes: The encoding starts with `loro` as magic bytes.
- (4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The `checksum` and `magic bytes` fields are trimmed when calculating the checksum.
- (20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.
## Encode Mode: Updates
In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.
Like Automerge's format, we employ columnar encoding for operations and changes.
Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.
## Encode Mode: Snapshot
This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.
Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.
Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.
2024-01-02 09:03:24 +00:00
|
|
|
let (checksum, reader) = reader.split_at(16);
|
|
|
|
let checksum_body = reader;
|
|
|
|
let (mode_bytes, reader) = reader.split_at(2);
|
|
|
|
let mode: EncodeMode = [mode_bytes[0], mode_bytes[1]].try_into()?;
|
|
|
|
|
|
|
|
let ans = ParsedHeaderAndBody {
|
|
|
|
mode,
|
|
|
|
checksum_body,
|
|
|
|
checksum: checksum.try_into().unwrap(),
|
|
|
|
body: reader,
|
|
|
|
};
|
|
|
|
|
|
|
|
ans.check_checksum()?;
|
|
|
|
Ok(ans)
|
|
|
|
}
|
|
|
|
|
|
|
|
fn encode_header_and_body(mode: EncodeMode, body: Vec<u8>) -> Vec<u8> {
|
|
|
|
let mut ans = Vec::new();
|
|
|
|
ans.extend(MAGIC_BYTES);
|
|
|
|
let checksum = [0; 16];
|
|
|
|
ans.extend(checksum);
|
|
|
|
ans.extend(mode.to_bytes());
|
|
|
|
ans.extend(body);
|
|
|
|
let checksum_body = &ans[20..];
|
|
|
|
let checksum = md5::compute(checksum_body).0;
|
|
|
|
ans[4..20].copy_from_slice(&checksum);
|
|
|
|
ans
|
|
|
|
}
|
|
|
|
|
|
|
|
pub(crate) fn export_snapshot(doc: &LoroDoc) -> Vec<u8> {
|
|
|
|
let body = encode_reordered::encode_snapshot(
|
|
|
|
&doc.oplog().try_lock().unwrap(),
|
|
|
|
&doc.app_state().try_lock().unwrap(),
|
|
|
|
&Default::default(),
|
|
|
|
);
|
|
|
|
encode_header_and_body(EncodeMode::Snapshot, body)
|
|
|
|
}
|
|
|
|
|
|
|
|
pub(crate) fn decode_snapshot(
|
|
|
|
doc: &LoroDoc,
|
|
|
|
mode: EncodeMode,
|
|
|
|
body: &[u8],
|
|
|
|
) -> Result<(), LoroError> {
|
2023-07-31 03:49:55 +00:00
|
|
|
match mode {
|
feat: stabilizing encoding (#219)
This PR implements a new encode schema that is more extendible and more compact. It’s also simpler and takes less binary size and maintaining effort. It is inspired by the [Automerge Encoding Format](https://automerge.org/automerge-binary-format-spec/).
The main motivation is the extensibility. When we integrate a new CRDT algorithm, we don’t want to make a breaking change to the encoding or keep multiple versions of the encoding schema in the code, as it will make our WASM size much larger. We need a stable and extendible encoding schema for our v1.0 version.
This PR also exposes the ops that compose the current container state. For example, now you can make a query about which operation a certain character quickly. This behavior is required in the new snapshot encoding, so it’s included in this PR.
# Encoding Schema
## Header
The header has 22 bytes.
- (0-4 bytes) Magic Bytes: The encoding starts with `loro` as magic bytes.
- (4-20 bytes) Checksum: MD5 checksum of the encoded data, including the header starting from 20th bytes. The checksum is encoded as a 16-byte array. The `checksum` and `magic bytes` fields are trimmed when calculating the checksum.
- (20-21 bytes) Encoding Method (2 bytes, big endian): Multiple encoding methods are available for a specific encoding version.
## Encode Mode: Updates
In this approach, only ops, specifically their historical record, are encoded, while document states are excluded.
Like Automerge's format, we employ columnar encoding for operations and changes.
Previously, operations were ordered by their Operation ID (OpId) before columnar encoding. However, sorting operations based on their respective containers initially enhance compression potential.
## Encode Mode: Snapshot
This mode simultaneously captures document state and historical data. Upon importing a snapshot into a new document, initialization occurs directly from the snapshot, bypassing the need for CRDT-based recalculations.
Unlike previous snapshot encoding methods, the current binary output in snapshot mode is compatible with the updates mode. This enhances the efficiency of importing snapshots into non-empty documents, where initialization via snapshot is infeasible.
Additionally, when feasible, we leverage the sequence of operations to construct state snapshots. In CRDTs, deducing the specific ops constituting the current container state is feasible. These ops are tagged in relation to the container, facilitating direct state reconstruction from them. This approach, pioneered by Automerge, significantly improves compression efficiency.
2024-01-02 09:03:24 +00:00
|
|
|
EncodeMode::Snapshot => encode_reordered::decode_snapshot(doc, body),
|
|
|
|
_ => unreachable!(),
|
2023-07-31 03:49:55 +00:00
|
|
|
}
|
|
|
|
}
|