This mostly reverts https://github.com/martinvonz/jj/pull/2901 as well as its
fixup https://github.com/martinvonz/jj/pull/2903. The related bug is reopened,
see https://github.com/martinvonz/jj/issues/2869#issuecomment-1920367932.
The problem is that while the fix did fix#2869 in most cases, it did
reintroduce the more severe bug https://github.com/martinvonz/jj/issues/2760
in one case, if the working copy is the commit being rebased.
For example, suppose you have the tree
```
root -> A -> B -> @ (empty) -> C
```
### Before this commit
#### Case 1
`jj rebase -s B -d root --skip-empty` would work perfectly before this
commit, resulting in
```
root -> A
\-------B -> C
\- @ (new, empty)
```
#### Case 2
Unfortunately, if you run `jj rebase -s @ -d A --skip-empty`, you'd have the
following result (before this commit), which shows the reintroduction of #2760:
```
root -> A @ -> C
\-- B
```
with the working copy at `A`. The reason for this is explained in
https://github.com/martinvonz/jj/pull/2901#issuecomment-1920043560.
### After this commit
After this commit, both case 1 and case 2 will be wrong in the sense of #2869,
but it will no longer exhibit the worse bug #2760 in the second case.
Case 1 would result in:
```
root -> A
\-------B -> @ (empty) -> C
```
Case 2 would result in:
```
root -> A -> @ -> C
\-- B
```
with the working copy remaining a descendant of A
With my jj repo, the number of jj/keep refs went down from 87887 to 27733.
The .git directory size is halved, but we'll need to clean up extra and index
files to save disk space. "git gc --prune=now && jj debug reindex" passed, so
the repo wouldn't be corrupted.
#12
I'm going to make WorkspaceCommandHelper::maybe_snapshot() snapshot the working
copy before importing refs. git::import_some_refs() can rebase the working copy
branch and therefore @ can be moved. git::import_head() doesn't, and it should
be invoked before snapshotting.
git::import_head() is inserted to some of the git:import_refs() callers where
HEAD seems to matter. I feel it's a bit odd that the HEAD ref is imported to
non-colocated repo, but "jj init --git-repo" relies on that, and I think the
existence of HEAD@git is harmless. It's merely a ref to the revision checked
out somewhere else.
We didn't have any tests with negative snapshots (after a `-------`
line). I initially thought we couldn't produce such conflict markers
anymore. I'm not sure we want to render conflicts like the one in the
test like this. I don't think I intended for `add_index` in the code
to be able to be two steps ahead of the remove. Maybe we should
rewrite the algorithm to not do that and thus never produce negative
snapshots.
It seems obvious in hindsight to have a virtual root operation just
like we have a virtual root commit. It removes the same kind of
problems by making sure there's always a common ancestor (or multiple)
between any two commits.
I think the reason I didn't add a root operation from the beginning
was that there used to be a mandatory working-copy commit in the view
(this was before support for multiple workspaces).
Perhaps we should remove the "initialize repo" operation now. The only
difference between their view objects is that the "initialize repo"
operation adds the root commit as a head. We could add that to the
root operation, but then the root operation's value depends on the
commit backend.
We've had the public_heads for as long as we've had the View object,
IIRC (I didn't check), but we still don't use it for anything. I don't
have any concrete plans for using it either. Maybe our config for
immutable commits is good enough, or maybe we'll want something more
generic (like Mercurial's phases). For now, I think we should simplify
by removing it the storage for public heads.
Since hidden commits can be looked up by remote_branches() revset for example,
reindexing should traverse ancestors from all named refs in addition to the
visible heads.
change_id_index() is only used by Readonly/MutableRepo, so we don't need an
abstraction at Index. evaluate_revset() is somewhat similar, but the callers
rely on &dyn Repo.
Since new operations and views may be added concurrently by another process,
there's a risk of data corruption. The keep_newer parameter is a mitigation
for this problem. It's set to preserve files modified within the last 2 weeks,
which is the default of "git gc". Still, a concurrent process may replace an
existing view which is about to be deleted by the gc process, and the view
file would be lost.
#12
We current have `Revset::change_id_index()` for creating a
`ChangeIdIndex` for a given revset. I think it will be hard to make it
performant for general revsets, especially in very large repos and
with custom index implementations, like the one we have at Google. If
we instead restrict it to including all ancestors of a set of heads, I
think it will be much easier to implement. We only use
`Revset::change_id_index()` with revsets including all visible commits
today, so we won't lose any current functionality by making it more
restricted.
I plan to replace `Revset::change_id_index()` by
`Index::change_id_index(heads)`, but one of the tests currently uses a
set of commits that does not include ancestors. This patch updates it
to include ancestors (and changes the set of heads to keep the set
small enough for the test).
This isn't technically needed, but it prevents API misuse. Another option
is to do some compile-time substitution, but most callers are tests and the
runtime performance wouldn't matter.
I'm going to add try_from_hex(), which requires Self: Sized. Such trait bound
could be added, but I don't think we'll need abstracted ObjectId constructors
at all.
I'm going to add a prefix resolution method to OpStore, but OpStore is
unrelated to the index. I think ObjectId, HexPrefix, and PrefixResolution can
be extracted to this module.
This will be used in "jj op abandon ..op_id" command. The "op_id..@" range will
be reparented onto the root operation.
The current implementation is good enough for local repos, but it won't scale.
We might want to extract it as a trait method or introduce OpIndex for
efficient DAG operation.
Fixes#2760
Given the tree:
```
A-B-C
\
B2
```
And the command `jj rebase -s B -d B2`
We were previously marking B as abandoned, despite the comment stating that we were marking it as being succeeded by B2. This resulted in a call to `rewrite(rewrites={}, abandoned={B})` instead of `rewrite(rewrites={B=>B2}, abandoned={})`, which then made the new parent of `C` into `A` instead of `B2`
This completes the process of removing DescendantRebaser-related APIs from
tests. It requires creating some new test utils and a new
`rebase_descendants_with_option_return_map`.
This removes uses of `DescendantRebaser::new` or
`MutRepo::create_descendant_rebaser` from most tests. The exceptions are the
tests having to do with abandoning empty commits on rebase, since adjusting
those is a bit more elaborate (see follow-up commits).
A possible use case is when doing some archaeology around a certain operation.
The current implementation is quadratic if + is repeated. Suppose op_id is
usually close to the current op heads, I think it'll practically work better
than building a reverse lookup table.
My jj repo contains such head commits, and "jj debug reindex" fails. To address
this problem, we'll probably need to implement GC, and the user will discard
operations before the first bad op id.
This requires creating a new public API as a substitute. I took the opportunity
to also add some comments to the
`MutRepo::record_rewritten_commit`/`record_abandoned_commit` functions.
I imade the simplest possible addition to the API; it is not a very elegant
one. Eventually, the entire `record_rewritten_commit` API should probably be
refactored again.
I also added some comments explaining what these functions do.
Spotted while adding error propagation there. This wouldn't likely be a real
problem because "jj debug reindex" removes all of the operation links.
The "} else {" condition is removed because it doesn't make sense to exclude
only the exact parent_op_id operation. This can be optimized to not walk
ancestors of the parent_op_id operation, but I don't see a motivation to add
tests covering such scenarios. It's pretty rare that an intermediate operation
link is missing.
We'll probably need a better abstraction, but a separate method is good
enough to remove unsafe code from ReadonlyRepo.
I'm not sure if this is feasible for the other backends, but I guess there
would be less lifetimed variables than DefaultReadonlyIndex.
It seems better to have the caller pass the transaction description
when we finish the transaction than when we start it. That way we have
all the information we want to include more readily available.
Previously, the function relied on both the `self.parent_mapping` and
`self.rebased`. If `(A,B)` was in `parent_mapping` and `(B,C)` was in `rebased`,
`new_parents` would map `A` to `C`.
Now, `self.rebased` is ignored by `new_parents`. In the same situation,
DescendantRebaser is changed so that both `(A,B)` and `(B,C)` are in
`parent_mapping` before. `new_parents` now applies `parent_mapping` repeatedly,
and will map `A` to `C` in this situation.
## Cons
- The semantics are changed; `new_parents` now panics if `self.parent_mapping`
contain cycles. AFAICT, such cycles never happen in `jj` anyway, except for
one test that I had to fix. I think it's a sensible restriction to live with;
if you do want to swap children of two commits, you can call
`rebase_descendants` twice.
## Pros
- I find the new logic much easier to reason about. I plan to extract it into a
function, to be used in refactors for `jj rebase -r` and `jj new --after`. It
will make it much easier to have a correct implementation of `jj rebase -r
--after`, even when rebasing onto a descendant.
- The de-duplication is no longer O(n^2). I tried to keep the common case fast.
## Alternatives
- We could make `jj rebase` and `jj new` use a separate function with the
algorithm shown here, without changing DescendantRebaser. I believe that the new
algorithm makes DescendatRebaser easier to understand, though, and it feels more
elegant to reduce code duplication.
- The de-duplication optimization here is independent of other changes, and
could be used on its own.
default_index_store.rs is relatively big, and it contains types and impls in
arbitrary order. Let's split them into sub modules. After everything moved,
mod.rs will only contain tests.
The wrapper type isn't needed for the mutable layer, but this mirrors the
readonly type structure. Test cases are also migrated to be using the index
wrapper so long as we don't have to care for the nesting of the segment files.
Gitoxide errors don't implement PartialEq. We could instead stringify the
errors, but there aren't many callers who expect FailedRefExportReason to
be comparable.
I think this is a variant of the problem fixed by 7fda80fc22 "tree: simplify
conflict before resolving at hunk level." We need to simplify() the conflict
before and after extracting file ids because the source conflict values may
contain trees to be cancelled out, and the file values may differ only in exec
bits. Since the legacy tree passes a simplified conflict in to this function,
I made the merged tree do the same.
Fixes#2654
Otherwise an empty subtree would be added to the parent tree.
If the stored tree contained an empty subtree, simplify() wouldn't work
against new "absent" subtree representation. I don't know if there's a
such code path, but I believe it's very rare to encounter the problem.
#2654
In snapshot(), changed_file_states are received in arbitrary order. For the
other callers, entries are in diff_stream order, so we don't have to sort
them.
With watchman enabled, we can see the cost of sorting the sorted proto entries.
I don't think this is significant, but we can mitigate it by adding
is_file_states_sorted flag to the proto message if needed:
```
% hyperfine --sort command --warmup 3 --runs 20 -L bin jj-0,jj-1 \
"target/release-with-debug/{bin} -R ~/mirrors/linux files ~/mirrors/linux/no-match"
Benchmark 1: target/release-with-debug/jj-0 -R ~/mirrors/linux files ~/mirrors/linux/no-match
Time (mean ± σ): 164.8 ms ± 16.6 ms [User: 50.2 ms, System: 111.7 ms]
Range (min … max): 148.1 ms … 195.0 ms 20 runs
Benchmark 2: target/release-with-debug/jj-1 -R ~/mirrors/linux files ~/mirrors/linux/no-match
Time (mean ± σ): 171.8 ms ± 13.6 ms [User: 61.7 ms, System: 109.0 ms]
Range (min … max): 159.5 ms … 192.1 ms 20 runs
```
Without watchman:
```
% hyperfine --sort command --warmup 3 --runs 20 -L bin jj-0,jj-1 \
"target/release-with-debug/{bin} -R ~/mirrors/linux files ~/mirrors/linux/no-match"
Benchmark 1: target/release-with-debug/jj-0 -R ~/mirrors/linux files ~/mirrors/linux/no-match
Time (mean ± σ): 367.3 ms ± 30.3 ms [User: 1415.2 ms, System: 633.8 ms]
Range (min … max): 325.4 ms … 421.7 ms 20 runs
Benchmark 2: target/release-with-debug/jj-1 -R ~/mirrors/linux files ~/mirrors/linux/no-match
Time (mean ± σ): 327.7 ms ± 24.9 ms [User: 1059.1 ms, System: 654.3 ms]
Range (min … max): 296.0 ms … 385.4 ms 20 runs
```
I haven't measured snapshotting against dirty working copy, but I don't think
it would be slower than the original implementation.
This enables cheap str-to-RepoPath cast, which is useful when sorting and
filtering a large Vec<(String, _)> list by using matcher for example. It
will also eliminate temporary allocation by repo_path.parent().
I'm going to add borrowed RepoPath type, and most from_internal_string()
callers will be migrated to it. For the remaining callers, it makes more
sense to move the ownership of String to RepoPathBuf.
RepoPath::from_components() is removed since it is no longer a primitive
function.
The components iterator could be implemented on top of str::split(), but
it's not as we'll probably want to add components.as_path() -> &RepoPath.
Tree walking and tree_states map construction get slightly faster thanks to
fewer allocations and/or better cache locality. If we add a borrowed RepoPath
type, we can also implement a cheap &str to &RepoPath conversion on top. Then,
we can get rid of BTreeMap<RepoPath, FileState> construction at all.
Snapshot without watchman:
```
% hyperfine --sort command --warmup 3 --runs 10 -L bin jj-0,jj-1 \
"target/release-with-debug/{bin} -R ~/mirrors/linux status"
Benchmark 1: target/release-with-debug/jj-0 -R ~/mirrors/linux status
Time (mean ± σ): 950.1 ms ± 24.9 ms [User: 1642.4 ms, System: 681.1 ms]
Range (min … max): 913.8 ms … 990.9 ms 10 runs
Benchmark 2: target/release-with-debug/jj-1 -R ~/mirrors/linux status
Time (mean ± σ): 872.1 ms ± 14.5 ms [User: 1922.3 ms, System: 625.8 ms]
Range (min … max): 853.2 ms … 895.9 ms 10 runs
Relative speed comparison
1.09 ± 0.03 target/release-with-debug/jj-0 -R ~/mirrors/linux status
1.00 target/release-with-debug/jj-1 -R ~/mirrors/linux status
```
Tree walk:
```
% hyperfine --sort command --warmup 3 --runs 10 -L bin jj-0,jj-1 \
"target/release-with-debug/{bin} -R ~/mirrors/linux files --ignore-working-copy"
Benchmark 1: target/release-with-debug/jj-0 -R ~/mirrors/linux files --ignore-working-copy
Time (mean ± σ): 375.3 ms ± 15.4 ms [User: 223.3 ms, System: 151.8 ms]
Range (min … max): 359.4 ms … 394.1 ms 10 runs
Benchmark 2: target/release-with-debug/jj-1 -R ~/mirrors/linux files --ignore-working-copy
Time (mean ± σ): 357.1 ms ± 16.2 ms [User: 214.7 ms, System: 142.6 ms]
Range (min … max): 341.6 ms … 378.9 ms 10 runs
Relative speed comparison
1.05 ± 0.06 target/release-with-debug/jj-0 -R ~/mirrors/linux files --ignore-working-copy
1.00 target/release-with-debug/jj-1 -R ~/mirrors/linux files --ignore-working-copy
```
This is a step towards introducing a borrowed RepoPath type. The current
RepoPath type is inefficient as each component String is usually short. We
could apply short-string optimization, but still each inlined component would
consume 24 bytes just for e.g. "src", and increase the chance of random memory
access. If the owned RepoPath type is backed by String, we can implement cheap
cast from &str to borrowed &RepoPath type.
`RevsetExpression::resolve()` is meant for programmatically created
expressions. In particular, it may not contain symbols. Let's try to
clarify that by renaming the function and documenting it.
Hopefully this will fix the unfinished Windows CI issue. A possible scenario
is that recent migration to gitoxide made this test flaky on Windows. For
example, gitoxide might have in-memory object cache that relies on file mtime,
and occasionally fails to detect new object on Windows.
If a commit pointed to by HEAD or ref is missing, the ref is considered
invalid and excluded by import_refs(). The current test behavior appears to
depend on some in-memory cache of git2::Repository.
GitBackend will use it to configure gix::Repository. I think UserSettings
is generally useful to pass store-specific parameters, so I've updated all
factory functions.
As discussed in Discord, it's less useful if remote_branches() included
Git-tracking branches. Users wouldn't consider the backing Git repo as
a remote.
We could allow explicit 'remote_branches(remote=exact:"git")' query by changing
the default remote pattern to something like 'remote=~exact:"git"'. I don't
know which will be better overall, but we don't have support for negative
patterns anyway.
Since the concurrent diff algorithm is significantly slower when using
the Git backend, I think we'll have to use switch between the two
algorithms depending on backend. Even if the concurrent version always
performed as well as the sequential version, exactly how concurrent it
should be probably still depends on the backend. This commit therefore
adds a function to the `Backend` trait, so each backend can say how
much concurrency they deal well with. I then use that number for
choosing between the sequential and concurrent versions in
`MergedTree::diff_stream()`, and also to decide the number of
concurrent reads to do in the concurrent version.
When diffing two trees, we currently start at the root and diff those
trees. Then we diff each subtree, one at a time, recursively. When
using a commit backend that uses remote storage, like our backend at
Google does, diffing the subtrees one at a time gets very slow. We
should be able to diff subtrees concurrently. That way, the number of
roundtrips to a server becomes determined by the depth of the deepest
difference instead of by the number of differing trees (times 2,
even). This patch implements such an algorithm behind a `Stream`
interface. It's not hooked in to `MergedTree::diff_stream()` yet; that
will happen in the next commit.
I timed the new implementation by updating `jj diff -s` to use the new
diff stream and then ran it on the Linux repo with `jj diff
--ignore-working-copy -s --from v5.0 --to v6.0`. That slowed down by
~20%, from ~750 ms to ~900 ms. Maybe we can get some of that
performance back but I think it'll be hard to match
`MergedTree::diff()`. We can decide later if we're okay with the
difference (after hopefully reducing the gap a bit) or if we want to
keep both implementations.
I also timed the new implementation on our cloud-based repo at
Google. As expected, it made some diffs much faster (I'm not sure if
I'm allowed to share figures).
I'm about to add a few more checks for diffing with a matcher. I think
it will help make it readable and reduce the risk of mixing up
variables between each part of the test if we use some nested blocks.
I also removed some unnecessary `.clone()` calls while at it.
I'm going to add a Merge method that removes negative/positive terms pair, and
swap_remove() is the easiest option. The order of the conflicted ref targets
doesn't matter.
One less git2 API use in CLI.
The function name GitBackend::init_colocated() is a bit odd, but we need to
specify the work-tree path, not the ".git" repo path. So we can't eliminate
the notion of the working copy path anyway.
During the transition to using more async code, I keep running into
https://github.com/rust-lang/futures-rs/issues/2090. Right now, I want
to convert `MergedTree::diff()` into a `Stream`. I don't want to
update all call sites at once, so instead I'm adding a
`MergedTree::diff_stream()` method, which just wraps
`MergedTree::diff()` in a `Stream. However, since the iterator is
synchronous, it needs to block on the async `Backend::read_tree()`
calls. If we then also block on the `Stream` in the CLI, we run into
the panic.
We had similar code in two places for restoring paths from one tree to
another. Let's reuse it instead.
I put the new function in the `rewrite` module. I'm not sure if that's
right place. Maybe it belongs in `tree`?
I want to fix error propagation before I start using async in this
code. This makes the diff iterator propagate errors from reading tree
objects.
Errors include the path and don't stop the iteration. The idea is that
we should be able to show the user an error inline in diff output if
we failed to read a tree. That's going to be especially useful for
backends that can return `BackendError::AccessDenied`. That error
variant doesn't yet exist, but I plan to add it, and use it in
Google's internal backend.
I'm going to add `MergedTreeValue` as an alias for
`Merge<Option<TreeValue>>`, but we already have a type by that name in
`merged_tree`. This patch renames it away, to make room for the new
alias. I used `MergedTreeVal` for this borrowing version to be a bit
like how `str` is a borrowed version of `String`.