I think the conclusion from #2600 is that at least auto-rebasing
should not simplify merge commits that merge a commit with its
ancestor. Let's start by adding an option for that in the library.
The shortest change id prefix will become a few digits longer, but I think
that's acceptable. Entries included in the "revsets.short-prefixes" set are
unaffected.
The reachable set is calculated eagerly, but this is still faster as we no
longer need to sort the reachable entries by change id. The lazy version will
save another ~100ms in mid-size repos.
"jj log" without working copy snapshot:
```
% hyperfine --sort command --warmup 3 --runs 20 -L bin jj-0,jj-1,jj-2 \
-s "target/release-with-debug/{bin} -R ~/mirrors/linux debug reindex" \
"target/release-with-debug/{bin} -R ~/mirrors/linux \
--ignore-working-copy log -r.. -l100 --config-toml='revsets.short-prefixes=\"\"'"
Benchmark 1: target/release-with-debug/jj-0 -R ~/mirrors/linux --ignore-working-copy log -r.. -l100 --config-toml='revsets.short-prefixes=""'
Time (mean ± σ): 353.6 ms ± 11.9 ms [User: 266.7 ms, System: 87.0 ms]
Range (min … max): 329.0 ms … 365.6 ms 20 runs
Benchmark 2: target/release-with-debug/jj-1 -R ~/mirrors/linux --ignore-working-copy log -r.. -l100 --config-toml='revsets.short-prefixes=""'
Time (mean ± σ): 271.3 ms ± 9.9 ms [User: 183.8 ms, System: 87.7 ms]
Range (min … max): 250.5 ms … 282.7 ms 20 runs
Relative speed comparison
1.99 ± 0.16 target/release-with-debug/jj-0 -R ~/mirrors/linux --ignore-working-copy log -r.. -l100 --config-toml='revsets.short-prefixes=""'
1.53 ± 0.12 target/release-with-debug/jj-1 -R ~/mirrors/linux --ignore-working-copy log -r.. -l100 --config-toml='revsets.short-prefixes=""'
```
"jj status" with working copy snapshot (watchman enabled):
```
% hyperfine --sort command --warmup 3 --runs 20 -L bin jj-0,jj-1,jj-2 \
-s "target/release-with-debug/{bin} -R ~/mirrors/linux debug reindex" \
"target/release-with-debug/{bin} -R ~/mirrors/linux \
status --config-toml='revsets.short-prefixes=\"\"'"
Benchmark 1: target/release-with-debug/jj-0 -R ~/mirrors/linux status --config-toml='revsets.short-prefixes=""'
Time (mean ± σ): 396.6 ms ± 10.1 ms [User: 300.7 ms, System: 94.0 ms]
Range (min … max): 373.6 ms … 408.0 ms 20 runs
Benchmark 2: target/release-with-debug/jj-1 -R ~/mirrors/linux status --config-toml='revsets.short-prefixes=""'
Time (mean ± σ): 318.6 ms ± 12.6 ms [User: 219.1 ms, System: 94.1 ms]
Range (min … max): 294.2 ms … 333.0 ms 20 runs
Relative speed comparison
1.85 ± 0.14 target/release-with-debug/jj-0 -R ~/mirrors/linux status --config-toml='revsets.short-prefixes=""'
1.48 ± 0.12 target/release-with-debug/jj-1 -R ~/mirrors/linux status --config-toml='revsets.short-prefixes=""'
```
These methods are basically the same as the commit_id versions, but
resolve_change_id_prefix() is a bit more involved as we need to gather matches
from multiple segments.
In resolve_change_id_prefix(), I've implemented two different ways of
collecting the overflow items. I don't think they impact the performance,
but we can switch to the alternative method as needed.
This basically means that the change ids are interned. We'll implement binary
search over the sorted change ids table. The table could be sorted differently
for better cache locality, but it is in lexicographical order for simplicity.
With my testing, the cost of the id lookup isn't dominant.
Unlike the parent entries, the size of the per-id overflow items isn't saved.
That's s because the number of the same-change-id commits is either 1 or many.
It doesn't make sense to allocate 8 bytes for each change id. Instead, we'll
pay extra indirection cost to determine the size.
I'm going to add change id overflow table whose elements are of LocalPosition
type. Let's make sure that the serialization code would break if we changed
the underlying data type.
Apparently, gix has 100ms timeout. Since this test tries to create contended
situation, it's possible that the ref lock can't be acquired. I've added
upper bound to the retry loop at b37293fa68 "tests: add upper bound to
test_concurrent_read_write_commit() loop", so ignoring arbitrary errors
should be okay.
The problem can be reproduced on my Linux machine by inserting 10ms sleep() to
gix and increasing the concurrency.
Fixes#3069
This is for completeness and to avoid accidents such as someone calling
`ContentHash::hash(1234u32.to_le_bytes())` and expecting it to hash properly as
a u32 instead of a 4 byte slice, which produces a different hash due to hashing
the length of the slice before its contents.
The `ContentHash` documentation specifies that implementations for enums should
hash the ordinal number of the variant contained in the enum as a 32-bit
little-endian number and then hash the contents of the variant, if any.
The current implementations for `std::Option`, `MergedTreeId`, and
`RemoteRefState` are non-conformant since they hash the ordinal number as a u8
with platform specific endianness.
Fixes#3051
Similar to the previous commit, these functions will be reused by the change id
lookup methods. The return value isn't cloned because resolve_id_prefix() will
return (key, value) pair, and the current caller doesn't need a cloned value.
This removes redundant case from resolve_neighbor_commit_ids(). The returned
position should never be lower than the prefix id.
The implementation is basically a copy of slice::binary_search_by(). We still
use (low + high) / 2 as the size wouldn't exceed 2^31.
https://github.com/rust-lang/rust/blob/1.76.0/library/core/src/slice/mod.rs#L2825
Since IdIndex sorts the entries by using .sort_unstable_by_key(), the order of
the same-key elements is undefined. Perhaps, it's stable for short arrays, and
the test passes because of that.
This saves 4 more bytes per entry, and more importantly, most commit parents
can be resolved with no indirection to the overflow table.
IIRC, Git always inlines the first parent, but that wouldn't be useful in jj
since jj diffs merge commit against the auto-merge parent. The first merge
parent is nothing special.
I'll use a similar encoding in change id sstable, where only one position
will be inlined (to optimize for imported commits.)
Benchmark number measuring the cost of change id index building:
```
% hyperfine --sort command --warmup 3 --runs 20 -L bin jj-0,jj-1 \
-s "target/release-with-debug/{bin} -R ~/mirrors/linux \
--ignore-working-copy debug reindex" \
"target/release-with-debug/{bin} -R ~/mirrors/linux \
--ignore-working-copy log -r@ --config-toml='revsets.short-prefixes=\"\"'"
Benchmark 1: target/release-with-debug/jj-0 -R ~/mirrors/linux --ignore-working-copy log -r@ --config-toml='revsets.short-prefixes=""'
Time (mean ± σ): 342.9 ms ± 14.5 ms [User: 202.4 ms, System: 140.6 ms]
Range (min … max): 326.6 ms … 360.6 ms 20 runs
Benchmark 2: target/release-with-debug/jj-1 -R ~/mirrors/linux --ignore-working-copy log -r@ --config-toml='revsets.short-prefixes=""'
Time (mean ± σ): 325.0 ms ± 13.6 ms [User: 196.2 ms, System: 128.8 ms]
Range (min … max): 311.6 ms … 343.2 ms 20 runs
Relative speed comparison
1.06 ± 0.06 target/release-with-debug/jj-0 -R ~/mirrors/linux --ignore-working-copy log -r@ --config-toml='revsets.short-prefixes=""'
1.00 target/release-with-debug/jj-1 -R ~/mirrors/linux --ignore-working-copy log -r@ --config-toml='revsets.short-prefixes=""'
```
I'm going to change the index format to store local positions in the lookup
table. That's not super important, but I think it makes sense because the
lookup table should never contain inter-segment links.
The mutable segment now stores local positions in its lookup map. The readonly
segment will be updated later.
This unblocks removal of 'is_legacy: bool' fields.
Note that all legacy dag range expressions can't be accepted by the new grammar.
For example, 'x:y()' is parsed as ('x:y', error) because 'x:y' is a valid string
pattern expression, and '(' isn't an infix operator. The old compat_dag_range_op
is NOT removed as it can still translate 'x():y' or 'x:(y)' to a better error,
and we might make the string pattern syntax stricter #2101.
The legacy parsing rules are turned into compatibility errors. The x:y rule
is temporarily enabled when parsing string patterns. It's weird, but we can't
isolate the parsing function because a string pattern may be defined in an
alias.
It's not ideal to print the error there, but using stderr should be slightly
better. It could be a tracing message, but tracing won't be displayed by
default.
I'm going to introduce breaking changes in index format. Some of them will
affect the file size, so version number or signature won't be needed. However,
I think it's safer to detect the format change as early as possible.
I have no idea if embedded version number is the best way. Because segment
files are looked up through the operation links, the version number could be
stored there and/or the "segments" directory could be versioned. If we want to
support multiple format versions and clients, it might be better to split the
tables into data chunks (e.g. graph entries, commit id table, change id table),
and add per-chunk version/type tag. I choose the per-file version just because
it's simple and would be non-controversial.
As I'm going to introduce format change pretty soon, this patch doesn't
implement data migration. The existing index files will be deleted and new
files will be created from scratch.
Planned index format changes include:
1. remove unused "flags" field
2. inline commit parents up to two
3. add sorted change ids table
Since the operation log has a root operation, we don't need to create
the repo-initialization operation in order to create a valid
`ReadonlyRepo` instance. I think it's conceptually simpler to create
the instance at the root operation id and then add the initial
operation using the usual `Transaction` API. That's what this patch
does.
Doing that also brought two issues to light:
1. The empty view object doesn't have the root commit as head.
2. The initialized `OpHeadsStore` doesn't have the root operation as
head.
Both of those seem somewhat reasonable, but maybe we should change
them. For now, I just made the initial repo (before the initial
operation) have a single op head (to compensate for (2)). It might be
worth addressing both issues so the repo is in a better state before
we create the initial operation. Until we do, we probably shouldn't
drop the initial operation.
This is #3002 with tests rerun to account for changes
to `strsim`, as @thoughtpolice noticed in
https://github.com/martinvonz/jj/pull/3002#issuecomment-1936763101
The string similarity changes include an example that
seems better and one that seems worse. Decreasing
the threshold definitely makes things worse.
I was a bit surprised to learn (or be reminded?) that checking out
symlinks on Windows leads to a panic. This patch fixes the crash by
materializing symlinks from the repo as regular files. It also updates
the snapshotting code so we preserve the symlink-ness of a path. The
user can update the symlink in the repo by updating the regular file
in the working copy. This seems to match Git's behavior on Windows
when symlinks are disabled.
The `write_path_to_store()` has almost no overlapping code between the
handling of symlinks and regular files, which suggests that we should
move out the handling of symlinks to the caller (there's only one).
If the operation corresponding to a workspace is missing for some reason
(the specific situation in the test in this commit is that an operation
was abandoned and garbage-collected from another workspace), currently,
jj fails with a 255 error code. Teach jj a way to recover from this
situation.
When jj detects such a situation, it prints a message and stops
operation, similar to when a workspace is stale. The message tells the
user what command to run.
When that command is run, jj loads the repo at the @ operation (instead
of the operation of the workspace), creates a new commit on the @
commit with an empty tree, and then proceeds as usual - in particular,
including the auto-snapshotting of the working tree, which creates
another commit that obsoletes the newly created commit.
There are several design points I considered.
1) Whether the recovery should be automatic, or (as in this commit)
manual in that the user should be prompted to run a command. The user
might prefer to recover in another way (e.g. by simply deleting the
workspace) and this situation is (hopefully) rare enough that I think
it's better to prompt the user.
2) Which command the user should be prompted to run (and thus, which
command should be taught to perform the recovery). I chose "workspace
update-stale" because the circumstances are very similar to it: it's
symptom is that the regular jj operation is blocked somewhere at the
beginning, and "workspace update-stale" already does some special work
before the blockage (this commit adds more of such special work). But it
might be better for something more explicitly named, or even a sequence
of commands (e.g. "create a new operation that becomes @ that no
workspace points to", "low-level command that makes a workspace point to
the operation @") but I can see how this can be unnecessarily confusing
for the user.
3) How we recover. I can think of several ways:
a) Always create a commit, and allow the automatic snapshotting to
create another commit that obsoletes this commit.
b) Create a commit but somehow teach the automatic snapshotting to
replace the created commit in-place (so it has no predecessor, as viewed
in "obslog").
c) Do either a) or b), with the added improvement that if there is no
diff between the newly created commit and the former @, to behave as if
no new commit was created (@ remains as the former @).
I chose a) since it was the simplest and most easily reasoned about,
which I think is the best way to go when recovering from a rare
situation.