While it got faster to build a large BTreeMap<RepoPath, _>, there's still
a measurable cost. Let's eliminate it if watchman is enabled and the working
copy is clean. Perhaps, we should introduce new serialization format that
supports instant loading and lookup, but this hack works for the moment.
I'm not sure if the new tree_state format should be flat (RepoPath, _) list,
or tree like the backend storage btw.
In my "linux" repo (watchman enabled):
% hyperfine --sort command --warmup 3 --runs 10 -L bin jj-0,jj-1 \
"target/release-with-debug/{bin} -R ~/mirrors/linux status"
Benchmark 1: target/release-with-debug/jj-0 -R ~/mirrors/linux status
Time (mean ± σ): 768.9 ms ± 14.2 ms [User: 630.7 ms, System: 131.2 ms]
Range (min … max): 742.3 ms … 783.1 ms 10 runs
Benchmark 2: target/release-with-debug/jj-1 -R ~/mirrors/linux status
Time (mean ± σ): 713.0 ms ± 16.8 ms [User: 587.9 ms, System: 116.2 ms]
Range (min … max): 681.5 ms … 731.1 ms 10 runs
Relative speed comparison
1.08 ± 0.03 target/release-with-debug/jj-0 -R ~/mirrors/linux status
1.00 target/release-with-debug/jj-1 -R ~/mirrors/linux status
This ensures that the root fsmonitor_matcher matches nothing if there are no
working-copy changes. The query result can be observed by "jj debug watchman
query-changed-files".
I don't have expertise on watchman query language, but using the watchman API
is probably better than .filter()-ing the result manually.
Suppose the input list is presorted, sorting a sorted vec would be cheaper
than .insert()-ing sorted items one by one.
In my "linux" repo (watchman eanbled):
- jj-0: baseline
- jj-1: previous (don't randomize by HashMap)
- jj-2: this
% hyperfine --sort command --warmup 3 --runs 10 -L bin jj-0,jj-1,jj-2 \
"target/release-with-debug/{bin} -R ~/mirrors/linux status"
Benchmark 1: target/release-with-debug/jj-0 -R ~/mirrors/linux status
Time (mean ± σ): 1.034 s ± 0.020 s [User: 0.881 s, System: 0.212 s]
Range (min … max): 1.011 s … 1.068 s 10 runs
Benchmark 2: target/release-with-debug/jj-1 -R ~/mirrors/linux status
Time (mean ± σ): 849.3 ms ± 13.8 ms [User: 710.7 ms, System: 199.3 ms]
Range (min … max): 821.7 ms … 870.2 ms 10 runs
Benchmark 3: target/release-with-debug/jj-2 -R ~/mirrors/linux status
Time (mean ± σ): 786.2 ms ± 16.7 ms [User: 650.7 ms, System: 204.1 ms]
Range (min … max): 760.8 ms … 805.2 ms 10 runs
Relative speed comparison
1.32 ± 0.04 target/release-with-debug/jj-0 -R ~/mirrors/linux status
1.08 ± 0.03 target/release-with-debug/jj-1 -R ~/mirrors/linux status
1.00 target/release-with-debug/jj-2 -R ~/mirrors/linux status
According to the doc, this is compatible with the map syntax.
https://protobuf.dev/programming-guides/proto3/#maps
This change means that the serialized file states are sorted by RepoPath,
so BTreeMap<RepoPath, _> can be reconstructed with fewer cache misses.
In my "linux" repo (watchman enabled):
- jj-0: baseline
- jj-1: this
% hyperfine --sort command --warmup 3 --runs 10 -L bin jj-0,jj-1,jj-2 \
"target/release-with-debug/{bin} -R ~/mirrors/linux status"
Benchmark 1: target/release-with-debug/jj-0 -R ~/mirrors/linux status
Time (mean ± σ): 1.034 s ± 0.020 s [User: 0.881 s, System: 0.212 s]
Range (min … max): 1.011 s … 1.068 s 10 runs
Benchmark 2: target/release-with-debug/jj-1 -R ~/mirrors/linux status
Time (mean ± σ): 849.3 ms ± 13.8 ms [User: 710.7 ms, System: 199.3 ms]
Range (min … max): 821.7 ms … 870.2 ms 10 runs
Relative speed comparison
1.32 ± 0.04 target/release-with-debug/jj-0 -R ~/mirrors/linux status
1.08 ± 0.03 target/release-with-debug/jj-1 -R ~/mirrors/linux status
Cache-misses got reduced:
% perf stat -e task-clock,cycles,instructions,cache-references,cache-misses \
-- ./target/release-with-debug/jj-0 -R ~/mirrors/linux --no-pager status
1,091.68 msec task-clock # 1.032 CPUs utilized
4,179,596,978 cycles # 3.829 GHz
6,166,231,489 instructions # 1.48 insn per cycle
134,032,047 cache-references # 122.776 M/sec
29,322,707 cache-misses # 21.88% of all cache refs
1.057474164 seconds time elapsed
0.897042000 seconds user
0.194819000 seconds sys
% perf stat -e task-clock,cycles,instructions,cache-references,cache-misses \
-- ./target/release-with-debug/jj-1 -R ~/mirrors/linux --no-pager status
927.05 msec task-clock # 1.083 CPUs utilized
3,451,299,198 cycles # 3.723 GHz
6,222,418,272 instructions # 1.80 insn per cycle
98,499,363 cache-references # 106.251 M/sec
11,998,523 cache-misses # 12.18% of all cache refs
0.855938336 seconds time elapsed
0.720568000 seconds user
0.207924000 seconds sys
The idea is the same as the heads_pos() change in 9832ee205d. While
IndexEntry::position() should be cheap, saving 20 bytes per entry appears to
improve the performance in mid-size repos.
In my "linux" repo:
revsets/all()
-------------
baseline 1.24 156.0±1.06ms
this 1.00 126.0±0.51ms
I don't see significant difference in small-sized repos like "jj" or "git".
IndexEntryByPosition isn't removed since it's still used by the revset engine.
This removes the last use of `ouroboros`. Since `TreeEntriesDirItem`
is only used in "legacy trees" (before tree-level conflicts), I didn't
bother to check the performance impact. I also didn't bother to check
the matcher before adding the entries to the list, instead leaving
that where it is in `Iterator::next()`.
This removes the last use of `ouroboros` in `merged_tree.rs`. The set
of conflicts to iterate is usually so small that I didn't bother
checking the performance impact.
While importing the `ouroboros` crate and the `aliasable` crate it
depends on, the "unsafe Rust reviewer" expressed some concern that
they contain a lot of unsafe code that's hard to review. We can avoid
the unsafe code altogether by making `TreeEntriesIterator` not
self-refential. Instead, we can collect the matching entries in an
individual tree up front. It does have some performance cost:
```
❯ hyperfine --warmup 3 --runs 30 \
'/tmp/jj-before --ignore-working-copy files -r v6.0' \
'/tmp/jj-after --ignore-working-copy files -r v6.0'
Benchmark 1: /tmp/jj-before --ignore-working-copy files -r v6.0
Time (mean ± σ): 461.4 ms ± 14.3 ms [User: 232.1 ms, System: 229.4 ms]
Range (min … max): 443.4 ms … 496.3 ms 30 runs
Benchmark 2: /tmp/jj-after --ignore-working-copy files -r v6.0
Time (mean ± σ): 482.0 ms ± 14.3 ms [User: 257.2 ms, System: 224.9 ms]
Range (min … max): 461.8 ms … 513.3 ms 30 runs
Summary
'/tmp/jj-before --ignore-working-copy files -r v6.0' ran
1.04 ± 0.04 times faster than '/tmp/jj-after --ignore-working-copy files -r v6.0'
```
I think that's acceptable.
This is much faster (maybe because of better cache locality?) Another option
is to use BTreeSet, but the BinaryHeap version is slightly faster.
"bench revset" result in my linux repo:
revsets/heads(tags())
---------------------
baseline 3.28 560.6±4.01ms
1 2.92 500.0±2.99ms
2 1.98 339.6±1.64ms
3 (this) 1.00 171.2±0.30ms
Apparently, IndexEntry::generation_number() isn't cheap probably because it
involves random access to larger memory region, and the u32 value might not
be aligned. Let's instead store the generation numbers in BinaryHeap.
Also, heads_pos() becomes slightly faster by keeping the BinaryHeap entries
small, so I've removed the IndexEntry at all.
This makes the default log and disambiguation revsets fast, which evaluate
'heads(immutable_heads())'.
"bench revset" result in my linux repo:
revsets/heads(tags())
---------------------
baseline 3.28 560.6±4.01ms
1 2.92 500.0±2.99ms
2 (this) 1.98 339.6±1.64ms
All callers just iterate over the parent entries.
"bench revset" result in my linux repo:
revsets/heads(tags())
---------------------
baseline 3.28 560.6±4.01ms
1 (this) 2.92 500.0±2.99ms
For loose refs, uninteresting directories can be just skipped. For packed refs,
gix will have to do binary search for each prefix to find the starting point.
Still it's better overall if the repository contains tons of refs/jj/keep refs.
With my linux repo containing ~5k loose jj refs, this saves ~40ms:
% hyperfine --warmup 3 --runs 10 \
"/tmp/jj-gix --ignore-working-copy git import -R ~/mirrors/linux" \
"/tmp/jj-gix-iter --ignore-working-copy git import -R ~/mirrors/linux"
Benchmark 1: /tmp/jj-gix --ignore-working-copy git import -R ~/mirrors/linux
Time (mean ± σ): 151.6 ms ± 11.4 ms [User: 38.8 ms, System: 111.6 ms]
Range (min … max): 129.8 ms … 159.5 ms 10 runs
Benchmark 2: /tmp/jj-gix-iter --ignore-working-copy git import -R ~/mirrors/linux
Time (mean ± σ): 109.9 ms ± 11.6 ms [User: 27.5 ms, System: 82.4 ms]
Range (min … max): 89.4 ms … 117.8 ms 10 runs
Gitoxide errors are boxed since there are various error types and they tend
to exceed the clippy size limit.
Apparently, gitoxide is faster than git2:
% hyperfine --warmup 3 --runs 10 \
"/tmp/jj-baseline --ignore-working-copy git import -R ~/mirrors/linux" \
"/tmp/jj-gix --ignore-working-copy git import -R ~/mirrors/linux"
Benchmark 1: /tmp/jj-baseline --ignore-working-copy git import -R ~/mirrors/linux
Time (mean ± σ): 205.4 ms ± 15.7 ms [User: 59.6 ms, System: 144.6 ms]
Range (min … max): 189.7 ms … 223.9 ms 10 runs
Benchmark 2: /tmp/jj-gix --ignore-working-copy git import -R ~/mirrors/linux
Time (mean ± σ): 176.2 ms ± 13.7 ms [User: 41.2 ms, System: 134.0 ms]
Range (min … max): 155.4 ms … 186.5 ms 10 runs
If a commit pointed to by HEAD or ref is missing, the ref is considered
invalid and excluded by import_refs(). The current test behavior appears to
depend on some in-memory cache of git2::Repository.
We need to .collect_vec() the parents iterator to temporary buffer since the
borrowed iterator can't be returned back to the dag_walk functions. Another
option is to clone op_store and parent ids to remove &self lifetime from the
iterator, but that also means a temporary Vec is created.
Unlike dfs_ok(), this function short-circuits at an Err as we use non-lazy
topo_order_forward() internally. I think that's good enough. If we implement
GC on operation log, deleted parents will be excluded (or mapped to tombstone)
by caller. An Err shouldn't mean it's GC-ed.
This unblocks the use of Result<T, E> in op.parents().
There are two ways to encode errors:
a. impl IntoIterator<Item = Result<T, E>>
b. Result<V, E> where V: FromIterator<Item = T>
I think (a) is more natural to algorithms like dfs(), which can process error
nodes transparently.
Still the caller might have to collect the source iterator to temporary Vec
to conform to the neighbors_fn signature. It's not easy for neighbors_fn to
return an iterator borrowing the input node. We already have GAT, but doesn't
have return-position impl Trait in trait yet.
Recognize signature metadata from git commit objects, implement
a basic version of that for the native backend.
Extract the signed data (a commit binary repr without the signature) to
be verified later.
Otherwise, ref updates would fail if we port git::export_refs() to gitoxide.
This change isn't strictly needed for the backend itself, but we'll reuse the
gix::Repository instance created by the backend when importing and exporting
Git refs.
GitBackend will use it to configure gix::Repository. I think UserSettings
is generally useful to pass store-specific parameters, so I've updated all
factory functions.
While the safe implementation is a bit more complex (and probably more branchy),
I don't think the runtime overhead would matter here. Let's remove one more
unsafe for better code maintainability.