It's faster to add only files matched by the Watchman matcher to the
set of deleted files than to add all files and then removed files not
matched. This speeds up `jj diff` with Watchman in the Linux repo from
~530 ms to ~460 ms.
The intention in this and some future commits is to have `update_file_state` accept `&self` instead of `&mut self` to make clear what data is updated by `update_file_state` and to ensure transactional safety of the `TreeState` contents.
It's currently a bit complicated to snapshot the working copy and
there's a lot of duplication in tests. This commit introduces a
function to simplify it. I made the function snapshot the working copy
and save the updated state. Some of the tests I changed previously
discarded the changes instead of saving them, but I think they all did
so because it was simpler. I left a few call sites unchanged because
they make concurrent changes.
This is breaking change. Old jj binary will panic if it sees a view saved by
new jj. Alternatively, we can store both new and legacy data for backward
compatibility.
This also lets us compare the resulting tree because the working copy
now exactly matches the tree (it used to be that the `.gitignore` file
wasn't initially snapshotted).
I guess we don't depend on `read_view()` ever returning `NotFound` the
way `read_operation()` does, but it seems like it still should return
`NotFound` when the view doesn't exist.
There's a comment saying that mutating the file's current state
simplifies later logic, but I don't think that's true. It might have
been true in the past, when we had `FileType::Conflict`.
We don't seem to have any tests that our protection from undetected
changes caused by writes happening right after checkout, so let's add
one. The test case loops 100 times and each iteration fails slightly
more than 80% of the time on my machine (if I remove the protection in
`TreeState::update_file_state()`), so it seems quite good at
triggering the race.
I don't see any reason for these tests to differ depending on backend,
so let's avoid running them twice (there are probably lots of other
tests that we're also running with different backends for little
reason).
When snapshotting a conflict, we read the contents and parse conflict
markers. If we determine that it's no longer a conflict, then we write
a normal file entry to the tree instead. When we do that, we re-read
the file from the working copy. Let's instead write the file contents
we already read.
When a file's mtime has changed on disk, we update our record of that
mtime, but we did so in a separate place for conflicts compared to
non-conflicts. Let's reuse it.
The working copy's current tree tracks whether a file is a
conflict. We also track that in the `TreeState` object. That allows us
to not read the trees object to decide if we should try to parse a
file as a conflict.
One disadvantage is that it's redundant information that needs to be
kept in sync with the tree object. Also, for Watchman, we would like
to completely ignore the persisted `FileState`.
This commit removes the `FileType::Conflict` variant and instead
checks in the tree object whether a given path was a conflict. This is
the change I mentioned in dc8a207737. We still skip the check
completely if the file's mtime etc. is unchanged, so it shouldn't have
much effect in the common case of a mostly unchanged working copy. I
measured a slowdown on `jj diff` by ~3% in the Linux repo with a clean
working copy with all mtimes bumped. I think the simpler code and
reduced risk of subtle bugs is worth the performance hit.
The idea is simple. New heads are ignored until the node dependency resolution
stuck. Then, only the first head that will unblock the visit will be queued.
Closes#242
The original idea was similar to Mercurial's "topo" sorting, but it was bad
at handling merge-heavy history. In order to render merges of topic branches
nicely, we need to prioritize branches at merge point, not at fork point.
OTOH, we do also want to place unmerged branches as close to the fork point
as possible. This commit implements the former requirement, and the latter
will be addressed by the next commit.
I think this is similar to Git's sorting logic described in the following blog
post. In our case, the in-degree walk can be dumb since topological order is
guaranteed by the index. We keep HashSet<CommitId> instead of an in-degree
integer value, which will be used in the next commit to resolve new heads as
late as possible.
https://github.blog/2022-08-30-gits-database-internals-ii-commit-history-queries/#topological-sorting
Compared to Sapling's beautify_graph(), this is lazy, and can roughly preserve
the index (or chronological) order. I tried beautify_graph() with prioritizing
the @ commit, but the result seemed too aggressively reordered. Perhaps, for
more complex history, beautify_graph() would produce a better result. For my
wip branches (~30 branches, a couple of commits per branch), this works pretty
well.
#242
This partially reverts 4c8f484278 "graphlog: key by commit id (not index
position)." As Martin pointed out, it made "log -r 'tags()' -T.." in git
repo super slow. Apparently, both clone() and hash map insertion/lookup costs
increased by that change. Since we don't need CommitId inside the graph
iterator, we can simply replace it with IndexPosition, and resolve it to
CommitId later.
Copied from MergedTree::has_conflict(). I feel it's slightly better since
RefTarget "is" always a Conflict-based type. It could be inverted to
RefTarget::is_resolved(), but refs are usually resolved, and all callers
have special case for !is_resolved() state.
`MergedTree` is now ready to be used when checking if a commit has
conflicts, and when listing conflicts. We don't yet a way for the user
to say they want to use tree-level conflicts even for these
cases. However, since the backend can decide, we should be able to
have our backend return tree-level conflicts. All writes will still
use path-level conflicts, so the experimentation we can do at Google
is limited.
Beacause `MergedTree` doesn't yet have a way of walking conflicts
while restricting it by a matcher, this will make `jj resolve` a
little slower. I suspect no one will notice.
Tree-level conflicts (#1624) will be stored as multiple trees
associated with a single commit. This patch adds support for that in
`backend::Commit` and in the backends.
When the Git backend writes a tree conflict, it creates a special root
tree for the commit. That tree has only the individual trees from the
conflict as subtrees. That way we prevent the trees from getting
GC'd. We also write the tree ids to the extra metadata table
(i.e. outside of the Git repo) so we don't need to load the tree
object to determine if there are conflicts.
I also added new flag to `backend::Commit` indicating whether the
commit is a new-style commit (with support for tree-level
conflicts). That will help with the migration. We will remove it once
we no longer care about old repos. When the flag is set, we know that
a commit with a single tree cannot have conflicts. When the flag is
not set, it's an old-style commit where we have to walk the whole tree
to find conflicts.
With `MergedTree`, we can iterate over conflicts by descending into
only the subdirectories that cannot be trivially resolved. We assume
that the trees have previously been resolved as much as possible, so
we don't attempt to resolve conflicts again.
This adds a function for resolving conflicts that can be automatically
resolved, i.e. like our current `merge_trees()` function. However, the
new function is written to merge an arbitrary number of trees and, in
case of unresolvable conflicts, to produce a `Conflict<TreeId>` as
result instead of writing path-level conflicts to the backend. Like
`merge_trees()`, it still leaves conflicts unresolved at the file
level if any hunks conflict, and it resolves paths that can be
trivially resolved even if there are other paths that do conflict.
In order to store conflicts in the commit, as conflicts between a set
of trees, we want to be able merge those trees on the fly. This
introduces a type for that. It has a `Merge(Conflict(Tree))` variant,
where the individual trees cannot have path-level conflicts. It also
has a `Legacy(Tree)` variant, which does allow path-level conflicts. I
think that should help us with the migration.
Alternatively, we can wrap BTreeMap<String, Option<RefTarget>> to flatten
Option<&Option<..>> internally, but doing that would be tedious. It would
also be unclear if map.remove(name) should construct an absent RefTarget if
the ref doesn't exist.
The next commit will change these maps to store Option<RefTarget> entries, but
None entries will still be omitted from the serialized data. Since ContentHash
should describe the serialized data, relying on the generic ContentHash would
cause future hash conflict where absent RefTarget entries will be preserved.
For example, ([remove], [None, add]) will be serialized as ([remove], [add]),
and deserialized to ([remove], [add, None]). If we add support for lossless
serialization, hash(([remove], [None, add])) should differ from the lossy one.
Summary: Unneeded with the MSRV bump. The other one
will have to wait until Rust 1.72.0
Signed-off-by: Austin Seipp <aseipp@pobox.com>
Change-Id: Ifb3d862aedd3f2aeb05a86ce76978d4f
Summary: Now that we have Rust 1.71.0 at our fingertips, the `map_first_last`
feature has been stabilized. That means we can get rid of the `jj-lib` build
script and also the `nightly_shims` module.
Signed-off-by: Austin Seipp <aseipp@pobox.com>
Change-Id: Ibb5ce3258818a2de670763fbbaf3c2e7
Summary: Let's be more aggressive about tracking the latest stable Rust release.
There's little benefit to being conservative so early on, especially when no
users seem to have faced any issue with upgrading, or strictly required an old
Rust version.
Right now, just lagging Rust by 1 major release probably seems fine. We're
targeting 1.71.0 to get ahead of the curve, since 1.72.0 will likely release
sometime before the next `jj` release.
Signed-off-by: Austin Seipp <aseipp@pobox.com>
Change-Id: I4e691b6ba63b5b9023a75ae0a6917672
Since RefTarget will be reimplemented on top of Conflict<Option<CommitId>>,
we won't be able to simply return a slice of type &[CommitId]. These functions
are also renamed in order to disambiguate from Conflict::adds()/removes().
We saw a failure in `test_merge_views_git_heads()` in the GitHub CI,
but I wasn't able to reproduce it locally. Using the
`commit_transactions()` helper in the test should fix it.
.unwrap() calls will be removed if we migrate RefTarget to new Conflict-based
type. Some of them were previously .unwrap_or_default(), but they would panic
anyway inside ref_target_from_proto().
is_none()/is_some() are wrapped as is_absent()/is_present() respectively.
If we see new RefTarget as a Conflict, an "absent" target isn't an emptyish
value like None, but an add of single [None]. It seemed a bit weird to call
such target is_none().
RefTarget type will be a wrapper around Conflict<Option<CommitId>>, and the
match arms would be more verbose. Fortunately, new code looks simpler since
we can merge non-conflicting cases.
It's named after Conflict::from_legacy_form(). If RefTarget is migrated to
new Conflict type, from_legacy_form([], [add]) will create a normal target,
and from_legacy_form([], []) will be equivalent to the current None target.
That's why this function isn't named as RefTarget::conflict().
If RefTarget is migrated to new Conflict type, this function will create
RefTarget(Conflict::resolved(Some(id))).
We still need some .unwrap() to insert Option<RefTarget> into map, but maps
will be changed to store new RefTarget type, and their mutation API will
guarantee that Conflict::resolved(None) is eliminated.
Perhaps, all view.get_<kind>(name) functions can return reference, but I'm
not willing to change the interface at this point. I'll revisit this after
migrating Option<RefTarget> to new Conflict-based type.
If we migrate RefTarget to new Conflict-based type, it won't store
Conflict<CommitId>, but Conflict<Option<CommitId>>. As the Option will
be internalized, new RefTarget type will also represent an absent target.
The 'target: Option<RefTarget>' argument will be replaced with new RefTarget
type.
I've also renamed the function for consistency with the following changes.
It would be surprising if set_local_branch(name, target) could remove the
branch. I feel the name set_local_branch_target() is less confusing.
This might look more complicated than the original code, but it clarifies
that we always eliminate a (remove, add) pair. The behavior slightly changed
since an absent ref (i.e. None) in "removes" can now be paired with an "add"
even if "removes" contained other ids. Before, it was possible only when
removes.is_empty().
The order of conflicted ids slightly changed since Conflict::simplify()
tries to preserve diff pairs. It shouldn't matter so long as the result is
stable.
Git CLI rejects it (though the data model would probably work fine with
"HEAD" branch.) This ensures that HEAD@git in JJ world is not an exported
branch named "HEAD".
This is (almost) a result of running
cargo +nightly clippy --workspace --all-targets --fix \
-- -A 'clippy::needless_raw_string_hashes'
with yesterday's nightly clippy.
https://github.com/mitsuhiko/insta/issues/389 causes numerous additional
`needless_raw_string_hashes` warnings, but it will hopefully be fixed soon.
For now, I recommend appending the second line to your invocations.
This will help to rewrite refs::merge_ref_targets() to leverage Conflict
type. I'm not pretty sure if we'll want to do that, but this change seems
also good for the existing code.
Almost everyone calls the project "jj", and there seeems to be
consensus that we should rename the crates. I originally wanted the
crates to be called `jj` and `jj-lib`, but `jj` was already
taken. `jj-cli` is probably at least as good for it anyway.
Once we've published a 0.8.0 under the new names, we'll release 0.7.1
versions under the old names with pointers to the new crates names.
It seemed too verbose to always include @remote branches, so synced remotes
are omitted by default. If the given symbol contained '@', all remote symbols
are populated so that the distance of remote fragment is taken into account.
Errors that may occur while loading backend would vary per backends, and
it's unlikely that these errors could be mapped to BackendError variants
other than BackendError::Other. So let's extract Other(_) of that kind as
a separate type to clarify there would be no other error variants.
Perhaps, Backend/Error will be renamed to CommitBackend/Error or
CommitStore/Error?, whereas I think BackendInit/LoadError can be shared
among store factories.
This is much simpler and I was slightly surprised that it doesn't have
much impact on performance. I tried `jj --ignore-working-copy diff -s
--from root --to v5.15` in the Linux kernel repo, and there was
perhaps a 1.5% slowdown (508 ms -> 515 ms). In more normal cases (like
diffing a single commit against its parent), I couldn't measure any
difference at all.
This helps to map initialization error to BackendError without too general
From impl. I don't think io::Error (or our PathError) should be automatically
translated to BackendError::Other because BackendError has more specific
variants depending on context. If the error is specific to initialization,
it makes sense to translate it to Other variant.
I don't think we'll want to record a label for each term, because such
labels would get stale, and it seems hard to make them make sense
after transferring a remote to another repo. I think we'll probably
want to infer labels on demand instead (#1176).
Perhaps, this would handle patterns like ["a(b", "c)"] better. It might not
be correct to error out on "(", but should be better than building wrong
regexp pattern "a(b|c)".
Now that we've replaced `MergeHunk` by a `Conflict`, it makes sense to
convert the input `Conflict<FileId>` by mapping each term. Unlike
`Option::map()` I made `Conflict::map()` take a reference `self`,
because it's not uncommon to want to map the same conflict multiple
times. I'm going to use that for producing a
`Conflict<Option<TreeValue>>` from a `Conflict<Tree>` and a set of
paths.
Since `Conflict`s can represent the resolved state, so
`Conflict<ContentHunk>` can represent the states that we use
`MergeHunk` for. `MergeHunk` does force the user to handle the
resolved case, which may be useful. I suppose one could use the same
argument for making `Conflict` an enum, i.e. if we think that
`MergeHunk`'s two variants are beneficial, then we should consider
making `Conflict` an enum with those two variants.
It's useful to have a more readable `Debug` format for `Vec<u8>`
(`"foo"` is better than `[102, 111, 111]`). It might also make types
in function signatures and elsewhere more readable.
This only parses the fields relevant to us, i.e.:
- name: the stable identifier of the submodule
- path: the path to the submodule in the current commit
- url: the remote we can clone the submodule from
The full list of .gitmodules fields can be found at
https://git-scm.com/docs/gitmodules.
This eliminates indirect access through Vec<u8> and improves cache locality
while sorting the index entries. We can achieve a similar result by using
SmallVec<[u8; 24]> in place of Commit/ChangeId(Vec<u8>), but we would have
to determine a reasonable id length across backends. Indexing [u8; 4] performs
better, at the cost of the API and implementation complexity.
For temporary Commit/ChangeId allocation in general, I think a borrowed type
like Path/PathBuf will help.
Testing with my "linux" repo, this saves ~670ms needed to initialize both
change id index and disambiguation indexes.
I'll rewrite resolve_prefix_range() to branch depending on the prefix length,
and the easiest way to do that is passing iterator to continuation function
instead of returning iterator as an either (or boxed) type.
I'm going to rewrite IdIndex to store only first few bytes of the key. A
separate table helps there.
At this point, it wouldn't make sense to convert usize to u32, but the new
index will store ([u8; 4], u32) pairs.
It allows us to build multiple IdIndex instances within a single loop. As the
final sorting is heavy operation, I don't want to implement Default + Extend
for IdIndex to be compatible with Iterator::unzip().
With my colocated "linux" repo, this appears to save ~50ms startup overhead.
Since the repo has lots of indirect tags, we can't eliminate tag object
loading at all. But still, it's faster than falling back to peel_to_commit().
It was convenient that what the git backend stored in its "extras"
table is exactly a subset of the fields that local backend stores, but
it's bit ugly and limiting. For example, it makes it possible to
populate the `author` field in the git extras, but that would have no
effect. It's better that it's not possible to do that (we store the
author field in the git commit, of course).
What made me notice this now was that I'm working on tree-level
conflicts (#1624) and I'm thinking of adding a field to the git extras
saying "this commit has single tree, but it's still a new-style
commit", so we can know not to walking such trees to find path-level
conflicts. That's only needed for the git backend because we don't
care about compatibility for the local backend.
If we allow `Conflict::simplify()` to swap the removes and adds as
freely as we currently do, we may present the user with a conflict
marker with a diff that has never appeared anywhere before the
simplification. That seems very confusing. Let's instead preserve the
diffs when we simplify conflicts.
It's a bit weird to simplify a conflict like `A B->C D->E C->F` to `A
B->E D->F` because it changes which diffs are in the conflict, but
that's what we currently do. Let's have a test for that.
We actually already have tests showing how `A B->C D->A` gets
simplified to `C B->D`, but those are less obviously weird because
when rendered as `removes = [B], adds = [C, D]`, it doesn't look that
different from the reverse.