The idea is that .as_composite() is equivalent to .as_index(), but for the
implementation type. I'm going to add "impl Index for CompositeIndex" to
clean up index references passed to revset engine.
This handles the basic case of where the matcher says that a whole
subtree is not matched. In the Linux repo, That's already enough to
speed up `jj --ignore-working-copy files samples` from 298 ms to 129
ms.
Note that one test changed because the new `trivial_merge()` is more
strict than the old algorithm. I don't think that's a problem because
5-way conflicts are not very common, and I prefer to be strict now and
possibly relax it later if we decide that we would prefer that.
All call paths already check before calling the function that the
condition is true. One caller - `tree::try_resolve_file_conflict()` -
checks it itself. The other caller -
`conflicts::materialize_merge_result()` - doesn't, but its callers
have checked it via `extract_file_conflict_as_single_hunk()`.
The deleted comment about empty strings seems to be obsolete since
e48ace56d1. The caller pads the inputs with empty strings since that
commit.
I think we should ideally change this function's signature to make it
impossible to call it with bad inputs, and I hope to get back to that
soon.
We already resolve merge conflicts between hunks, trees, and refs, and
maybe more. They each have their own code for the handling trivial
merges (where the output is equal to one of the inputs). They look
surprisingly different. This commit adds a generic function for doing
that. Curiously, this new implementation uses implements it in yet
another way (basically using a multi-set).
I've added a helper function because the construction of the range expression
is a bit noisy. It could be a Repo method, but I don't want to make it a
default implementation of the trait method.
revset::walk_revs() let the caller handle RevsetEvaluationError since the
evaluation engine may error out even with such a trivial query. For now, most
callers just .unwrap() the error as before.
Otherwise, "jj init --git-repo ." would create extra table files per commit,
and merge them.
I considered adding an explicit GitBackend method to be called from
git::import_refs(), but the call order matters. The method should be invoked
before calling store.get_commit(..) or mut_repo.add_head(..). Since commits
are likely to be loaded from the head, we can instead make read_commit()
import ancestor metadata at all.
Alternatively, we could make a Git commit hidden until it's inserted into
the extra table. It's rather big change, and I wouldn't like to do that
without thinking more thoroughly.
I'm going to extract a helper function that converts git2::Commit to
backend::Commit struct, and the commit id can also be obtained from the
git2::Commit object.
My first attempt was to fix up corrupted index when merging, but it turned
out to be not easy because the self side may contain corrupted data. It's
also possible that two concurrent commit operations have exactly the same
view state (because change id isn't hashed into commit id), and only the
table heads diverge.
#924
GitBackend will reuse this lock to not assign multiple change ids to a
single commit. We could add a separate lock file that covers the section
from get_head() to save_table(), but I think reusing the table lock is good
enough.
This bug concerns the way `import_refs` that gets called by `fetch` computes
the heads that should be visible after the import.
Previously, the list of such heads was computed *before* local branches were
updated based on changes to the remote branches. So, commits that should have
been abandoned based on this update of the local branches weren't properly
abandoned.
Now, `import_refs` tracks the heads that need to be visible because of some ref
in a mapping keyed by the ref. If the ref moves or is deleted, the
corresponding heads are updated.
Fixes#864
Now that we return the written commit from `write_commit()`, let's
make the timestamps match what was actually written, accounting for
the whole-second precision and the adjustment we do to avoid
collisions.
The internal backend at Google doesn't let you write any value you
want for in the committer field. The `Store` type still caches the
value it attempted to write, which gets a little weird when the
written value is not what we tried to write. We should use the value
the backend actually wrote. However, we don't know if the backend
changed anything without reading the value back, which is often
wasteful. This commit changes the API to return the written value.
I only changed the signature of `write_commit()` for now. Maybe we
should make a similar change to `write_tree()`.
This has several advantages:
* Makes it possible to downcast to non-Git custom backends (might be
useful at Google, but we haven't needed it yet)
* Lets us access more specific functionality on the `GitBackend`,
making it possible to access the `git2::Repository` without
creating a copy of it.
* Removes the dependency on Git from the backend
In large repos, the unique prefixes can get somewhat long (~6 hex
digits seems typical in the Linux repo), which makes them less useful
for manually entering on the CLI. The user typically cares most about
a small set of commits, so it would be nice to give shorter unique ids
to those. That's what Mercurial enables with its
`experimental.revisions.disambiguatewithin` config. This commit
provides an implementation of that feature in `IdPrefixContext`.
In very large repos, it can also be slow to calculate the unique
prefixes, especially if it involves a request to a server. This
feature becomes much more important in such repos.
I would like to copy Mercurial's way of abbreviating ids within a
user-configurable revset. We would do it for both commit ids and
change ids. For that feature, we need a place to keep the set of
commits the revset evaluates to. This commit adds a new
`IdPrefixContext` type which will eventually be that place. The new
type has functions for going back and forth between full and
abbreviated ids. I've updated the templater to use it.
When creating `RevsetExpression` programmatically, I think we should
use commit ids instead of symbols in the expression. This commit adds
a check for that by using a `SymbolResolver` that always errors
out.
I would eventually want the `SymbolResolver` to be customizable (in
custom `jj` binaries), so we want to make sure we always use the
customized version of it.
I left `RevsetExpression::resolve()` unchanged. I consider that to be
for programmatically created expressions.
I'd like to make the symbol resolution more flexible, both so we can
support customizing it (in custom `jj` binaries) and so we can use it
for resolving short prefixes within a small revset.
I plan to add `revsets.short-prefixes` and `revsets.immutable` soon,
and I think `[revsets]` seems like reasonable place to put them. It
seems consistent with our `[templates]` section. However, it also
suffers from the same problem as that section, which is that the
difference between `[templates]` and `[template-aliases]` is not
clear. We can decide about about templates and revsets later.
The current behavior was introduced by 20eb9ecec1 "git: don't abandon
HEAD commit when it loses a branch." While the change made HEAD mutation
behavior more consistent with a plain ref operation, HEAD can also move on
checkout, and checkout shouldn't be considered a history rewriting operation.
I'm not saying the new behavior is always correct, but I think it's safer
than losing old HEAD branch. I also think this change will help if we want
to extract HEAD management function from git::import_refs().
Fixes#1042.
Allows automatic recovery when encountering stale lockfiles, and more
efficient blocking rather than polling for fresh ones. The previous
implementation is preserved for other platforms.
There were two issues on my end:
1. `known_hosts` doesn't seem to be recognized
2. SSH Agent is ignored despite running
A workaround for 1. is to set the HOME environment variable on Windows, so I added a hint to suggest this. Ideally we would add a `certificate_check` callback to the remote callbacks, but the git2 crate doesn't expose whether the certificate check already succeeded, which makes it useless for this purpose (as we'd be prompting users to accept a certificate even though that certificate is already known to be valid).
As for 2., I changed the behavior from "check SSH Agent if some env variables exist" to "check SSH Agent and only fail if some env variables exist". On Windows SSH Agent doesn't use these env variables (but trying to communicate with it will still work), so now Windows properly works with SSH Agent.
When using a sparse working copy (e.g. with no files at all) and
updating the working copy from the root commit to a commit with
millions of files, we shouldn't have to walk the parts of the diff
that doesn't match the sparse patterns. However, we still do the full
walk because our `Tree::diff()` currently doesn't care about what the
matcher tells us to visit, it only filters out unwanted files after
visiting them. This commit fixes that for the special (but common)
case of matching nothing in a directory.
I tried also adding special handling for when the matcher says that we
should only visit a few entries, but it wasn't clearly better in the
cases I tested it on. I'll keep that patch around and might send it if
I find some cases where it helps.
We could add `walk.descendants(root_positions)` method, and apply
`.filter_by_generation(range)`, but queue-based `.descendants()` would be
slower than the one using reachable set. So I didn't add such method.
I also considered reimplementing non-lazy version of this function without
using the current RevWalkGenerationRange, but it appears the current iterator
version performs well even if we have to do .collect_vec() and .reverse().
This helps to extract a trait that abstracts CompositeIndex and descendants
map. Since the entry type E is a newtype wrapper, there wouldn't be runtime
cost.
I'm going to add a RevWalk method to walk descendants with generation filter,
which will use this helper method. RevWalk::take_until_roots() uses .min()
instead of .last() since RevWalk shouldn't know the order of the input set.
Before, the number of the generations to track would increase at each merge
point. This was really bad for queries like ':@--' in merge-heavy history,
but I didn't notice the problem because ancestors query is lazy and
the default log template is slow. Since I'm going to reuse RevWalk for
'roots++:' queries, which can't be lazy, I need to fix this problem first.
As we don't have a revset expression to specify exact generation range,
gen.end is initialized to either 1 or close to u32::MAX. So, this change
means long-lived generation ranges will eventually be merged into one.
The substitution rule and tests are copied from ancestors/parents. The backend
logic will be reimplemented later. For now, it naively repeats children().
This is a minimal change to replace Children with Descendants. A generation
parameter could be added to RevsetExpression::DagRange, but it's not needed
as of now.
I'm going to add generation parameter to Children/DagRange nodes, and
'Children { .. }' will be substituted to 'DagRange { .., gen: 1 }'. This
commit helps future code move.
Lifetime bounds of the arguments are unnecessarily restricted. It appears
walk_ancestors_until_roots() captures arguments lifetime on rustc 1.64.0.
I think the problem will go away if walk_*() functions are extracted to
RevWalk methods where input arguments will become less generic.
It doesn't matter, but can simplify the function interface. I'll probably
extract this function to RevWalk so the descendants with/without generation
filter can be tested without using revset API.
It no longer needs to be on the `Index` trait, thereby removing the
last direct use of `IndexEntry` in the trait (it's still used
indirectly in `walk_revs()`).
Now we have 4 callers, I concluded this is common enough to add an
extension method. Still I think it's preferred to define config items in
src/config/*.toml if possible. It will catch typo of config keys.
A chain of 4 billion commits is a lot, but it's not out of the
question, so let's support it. The current default index will not be
able to handle that many commits, so I let that still use 32-bit
integers.
New ResolvedExpression enum ensures that the evaluation engine doesn't have
to know the symbol resolution details. In this commit, I've moved Filter
and NotIn resolution to resolve_visibility(). Implicit All/VisibleHeads
resolution will be migrated later.
It's tempting to combine resolve_symbols() and resolve_visibility() to get
rid of panic!()s, but the resolution might have to be two passes to first
resolve&collect explicit commit ids, and then substitute "all()" with
"(:visible_heads())|commit_id|..". It's also possible to apply some tree
transformation after symbol resolution.
This makes it clear what should be resolved at resolve_symbols(). Symbol
is a bit special while parsing function arguments, but it's no different
than the other unresolved references at expression level.
I'm thinking of transforming RevsetExpression to a enum dedicated for
the evaluation stage. To help the migration, I want to remove the use of
the RevsetExpression builder API from the evaluation engine.
Fewer virtual dispatch is also better.
The `heads()` revset function with one argument is the counterpart to
`roots()`. Without arguments, it returns the visible heads in the
repo, i.e. `heads(all())`. The two use cases are quite different, and
I think it would be good to clarify that the no-arg form returns the
visible heads, so let's split that out to a new `visible_heads()`
function.
This basically removes hidden 'all() &' from union/negation of filters. To
achieve that, I have two options: 1. add separate evaluation path (like the
one this commit introduced), or 2. wrap "all()" revset to override predicate
as Box::new(|_| true) function. I took the former since it's less ad-hoc.
We can add an explicit RevsetExpression node to branch between evaluate()
and evaluate_predicate(), but I don't think it would simplify the
implementation at this point. We might need such node if we want to resolve
"all()" at resolve_symbols(). It might be even better to extract a subset of
RevsetExpression enum, which only contains evaluatable nodes.
The cost of 'all() &' isn't significant for most filters. '~merges()' is
the exception. For jj repo,
revsets/:v0.3.0 & (author(martinvonz) | committer(martinvonz))
--------------------------------------------------------------
base 1.06 11.2±0.04m
new 1.00 10.5±0.05m
revsets/~merges()
-----------------
base 1.69 750.0±8.47µ
new 1.00 444.1±3.50µ
I think this is more readable, and apparently it produces slightly better code
maybe because the compiler can determine that there are no unwanted markers.
Since filter is slow in general, its input set should be minimized. This has
measurable impact on artificial query like '~(v0.4.0..) & author(_)'. If it
were evaluated as a difference of sets, all commits would have to be loaded.
Since resolve_symbols() now removes Present(_) node, it make sense to
handle symbol resolution error there. That's why I added a "pre" callback
to try_transform_expression().
Perhaps, "operation" scope (#1283) can be implemented in a similar way,
(but somehow need to resolve operation id and call repo.reload_at(op).)
This makes it clear that RevsetExpression::Present node is noop at the
evaluation stage.
RevsetEvaluationError::StoreError is unused right now, but I'm not sure if
it should be removed. It makes some sense that evaluate() can propagate
StoreError as it has access to the store.
The `Repo` is a higher-level type that the index shouldn't have to
know about. With this change, a custom revset implementation should be
able evaluate the revset on a server without knowing which repo it
refers to.
We already pass a `CompositeIndex` to
`default_revset_engine::evaluate()` so let's use that wherever we
currently use `repo.index()`. That will help us remove the `repo`
argument, and it will also let us internal types (like `IndexEntry`)
in the index methods we call.
I'm about to replace the `&dyn Repo` argument by several smaller
types, and it's easier to collect those in a single context type than
to pass them separately as arguments.
I also moved `revset_for_commit_ids()` and `take_latest_revset()` onto
the new type because it was easy. `build_predicate_fn()` and
`has_diff_from_parent()` ran into some lifetime issue when I tried.
The `public_heads()` revset only contains the root commit in
practice. I'm not sure what we want to do about phases, but since we
don't have any real support for them yet, let's just remove this
revset. I didn't update the changelog because we don't seem to have
documented the revset function (and it seems unlikely that users who
found out about it found it useful enough to use it when they could
just use `root`).
The `ProtoOpStore` was separated out to simplify the migration from
Thrift. Now that the `ThriftOpStore` is gone, we can inline
`ProtoOpStore` as the TODO says.
Inline diffs on multi-byte UTF-8 characters would match individual
bytes, causing garbled diffs in some cases. For example, replacing
`⊢` with `⊣`, which differ in the final byte only, caused the
diff to display a diff of the bytes instead the character.
This commit uses a workaround present in Mercurial by treating all
bytes 0x80 and above as word characters, causing any multi-byte
character to be treated as a word and not segmented.
https://www.mercurial-scm.org/repo/hg/file/6.3.3/mercurial/patch.py#l51
This serves the role of limit() in Mercurial. Since revsets in JJ is
(conceptually) an unordered set, a "limit" predicate should define its
ordering criteria. That's why the added predicate is named as "latest".
Closes#1110
There are no remaining places where we iterate over a revset and need
the `IndexEntry`s, so we can now make `Revset::iter()` yield
`CommitId`s instead.
I'm about to make `Revset::iter()` yield just `CommitId`s, but the
tests in `test_default_revset_graph_iterator.rs` need an `IndexEntry`
iterator so they can pass it into `RevsetGraphIterator::new()`. This
commits prepares for the change by adding a
`RevsetImpl::iter_graph_impl()` that returns `RevsetGraphIterator`,
keeping `InternalRevset` still hidden within the revset engine. We
could instead have made that (and `ToPredicateFn`) visible to tests. I
can't say which is better.
I don't know if we ever resolve revsets in a mutable repo, but now
that we can get a change id index from a revset, it's easier to
implement this functionality that way.
This replaces the direct use of `IdIndex` in `ReadonlyRepo` by use of
`Revset::change_id_index()`.
I made the `Index` trait require `Send` and `Sync` in order to be able
to store an instance of it in `ReadonlyRepo` (via `ChangeIdIndex`) and
still have that be `Send` and `Sync`. We could alternatively store the
`ChangeIdIndex` in a `Mutex`. Now that will be up to the
`ChangeIdIndex` instead.
One of the remaining places we depend on index positions is when
creating a `ChangeIdIndex`. This moves that into the revset engine
(which is coupled to the commit index implementation) by adding a
`Revset::change_id_index()` method. We will also use this function
later when add support for resolving change id prefixes within a small
revset.
The current implementation simply creates an in-memory index using the
existing `IdIndex` we have in `repo.rs`.
The custom implementation at Google might do the same for small
revsets that are available on the client, but for revsets involving
many commits on the server, it might use a suboptimmal implementation
that uses longer-than-necessary prefixes for performance reasons. That
can be done by querying a server-side index including changes not in
the revset, and then verifying that the resulting commits are actually
in the revset.
When deciding the order to visit commits to rebase, we currently look
up parents in the index. I'm trying to remove the current `IndexEntry`
type and will probably have revsets iterators yield simply
`CommitId`. Let's therefore look up commit objects here.
I timed this by rewriting all commits in the jj repo. I couldn't
measure any difference. That makes sense since we cache the commits in
`Store` and we would read the commit when rebasing it anyway.
The function is only used in tests, so it doesn't belong in
`default_revset_engine`. Also, it's not specific to that
implementation, so I rewrote as a revset evaluation.
I'd like to be able to pass a `self` of `type `&ReadonlyRepo` to
functions that take a `&dyn Repo`. For that, we need `ReadonlyRepo`
itself to implement `Repo` instead of having `Arc<ReadonlyRepo>`
implement it. I could have solved it in a different way, but the `Arc`
requirement seems like an unnecessary constraint.
The functions resolving a change id to commits currently return a
`Vec<IndexEntry>`. We want to avoid depending on `IndexEntry` and we
only need the commit ids here.
The index position is specific to the default index implementation and
we don't want to use it in outside of there. This commit removes the
use of it as a key for nodes in the graphlog.
I timed it on the git.git repo using `jj log -r 'all()' -T commit_id`
(the worst case I can think of) and it slowed down from ~2.02 s to
~2.20 s (~9%).
Since we hid the graph iterator implementation behind
`Revset::iter_graph()`, I don't think we have any callers of
`Revset::iter()` require the iteration to be in index position order,
so let's not promise that. We do want to promise that the iteration is
in topological order with children before parents, however.
We need 1.64 to bump `clap` to `4.1`. We don't really need to upgrade
to that, but being on an older version causes minor confusions like
#1393. Rust 1.64 is very close to 6 months old at this point.
For large repos, it's useful to be able to use shorter change id and
commit id prefixes by resolving the prefix in a limited subset of the
repo (typically the same subset that you'd want to see in your default
log output). For very large repos, like Google's internal one, the
shortest unique prefix evaluated within the whole repo is practically
useless because it's long enough that the user would want to copy and
paste it anyway.
Mercurial supports this with its `revisions.disambiguatewithin` config
(added in https://www.mercurial-scm.org/repo/hg/rev/503f936489dd). I'd
like to add the same feature to jj. Mercurial's implementation works
by attempting to resolve the prefix in the whole repo and then, if the
prefix was ambiguous, it resolves it in the configured subset
instead. The advantage of doing it that way is that there's no extra
cost of resolving the revset defining the subset if the prefix was not
ambiguous within the whole repo. However, there are two important
reasons to do it differently in jj:
* We support very large repos using custom backends, and it's probably
cheaper to resolve a prefix within the subset because it can all be
cached on the client. Resolving the prefix within the whole repo
requires a roundtrip to the server.
* We want to be able to resolve change id prefixes, which is always
done in *some* revset. That revset is currently `all()`, i.e. all
visible commits. Even on local disk, it's probably cheaper to
resolve a small revset first and then resolve the prefix within that
than it is to build up the index of all visible change ids.
We could achieve the goal by letting each revset engine respect the
configured subset, but since the solution proposed above makes sense
also for local-disk repos, I think it's better to do it outside of the
revset engine, so all revset engines can share the code.
This commit prepares for the new functionality by moving the symbol
resolution out of `Index::evaluate_revset()`.
We want to allow custom revset engines define their own graph
iterator. This commit helps with that by adding a
`Revset::iter_graph()` function that returns an abstract iterator.
The current `RevsetGraphIterator` can be configured to skip or include
transitive edges. It skips them by default and we don't expose option
in the CLI. I didn't bother including that functionality in the new
`iter_graph()` either. At least for now, it will be up to the
implementation whether it includes such edges (it would of course be
free to ignore the caller's request even if we added an option for it
in the API).
This commit adds an `evaluate_revset()` function to the `Index`
trait. It will require some further cleanup, but it already achieves
the goal of letting the index implementation decide which revset
engine to use.
We want to allow customization of the revset engine, so it can query
server indexes, for example. The current revset implementation will be
our default implementation for now. What's left in the `revset` module
after this commit is mostly parsing code.
Now that there's a single implementation of `Revset`, I think it makes
more sense for `is_empty()` to be defined there. Maybe different
revset engines have different ways of implementing it. Even if they
don't, this is trivial to re-implement in each revset engine.
As the comment above `ToPredicateFn` says, it could be a private
type. This commit makes that happen by making the private `Revset`
implementations (`DifferenceRevset` etc.) instead implement an
internal revset type called `InternalRevset`. That type is what
extends `ToPredicateFn`, so the public type doesn't have to. The new
type will not need to implement the new functions I'm about to add to
the `Revset` trait.
We don't want the public `Revset` interface to know about
`ToPredicateFn`. In order to hide it, I'm wrapping the internal type
in another type, so only the internal type can keep implementing
`ToPredicateFn`.
I'd like to be able to change the return type of `evaluate_revset()`
to be an internal type. Since all external callers currently call the
function via `RevsetExpression::evaluate()`, it turns out it's easy to
make it private. To benefit from an internal type, we also need to
make the recursive calls be directly to the internal function.
The tests adding and removing heads to the repo mostly want to verify
that the set of heads is expected. Some of them also check that
commits are available in the index. But they shouldn't care about the
exact index stats.
I don't think there's much to gain from making the index match exactly
what's reachable from the view. FWIW, our cloud-based implementation
at Google will probably make everyone's commits visible in the index
regardless of which operation they're at.
We don't want custom index implementations to have to conform to the
same kind of stats as the default implementation. This commit also
makes the command error out on non-default index types.
I broke the commands in a27da7d8d5 and thought I just fixed it in
c7cf914694a8. However, as I added a test, I realized that I made it
only reindex the commits since the previous operation. I meant for the
command to do a full reindexing of th repo. This fixes that.
I broke `jj debug reindex` in a27da7d8d5. From that commit, we no
longer delete the pointer to the old index, so nothing happens when we
reload the index. This commit fixes that, and also makes the command
error out if run on a repo with a non-default index type.
This is yet another step towards making the index pluggable. The
`IndexStore` trait seems reasonable after this commit. There's still a
lot of work to remove `IndexPosition` from the `Index` trait.
I didn't make `ReadonlyIndex` extend `Index` because it needed an
`as_index()` to convert to `&dyn Index` trait object
anyway. Separating the types also gives us flexibility to implement
the two traits on different types.
Not all index implementations may want to store the readonly index
implementation in an Arc. Exposing the Arc in the interface is also
problematic because `Arc<IndexImpl>` cannot be cast to `Arc<dyn
Index>`.
These two files are closely related, and `Index` and `IndexStore` are
expected to be customized together, so it seems better to keep them in
a single file.
This is another step towards allowing a custom `jj` binary to have its
own index type. We're going to have a server-backed index
implementation at Google, for example.
This is a step towards making the index storage pluggable. The
interface will probably change a bit soon, but let's start with
functions that match the current implementation.
I called the current implementation the `DefaultIndexStore`. Calling
it `SimpleIndexStore` (like `SimpleOpStore` and `SimpleOpHeadsStore`)
didn't seem accurate.
In `git_fetch()`, any glob present in `globs` is an "allow" mark. Using
`&[]` to represent an "allow-all" may be misleading, as it could
indicate that no branch (only the git HEAD) should be fetched.
By using an `Option<&[&str]>`, it is clearer that `None` means that
all branches are fetched.
Using &[String] forces the caller to materalize owned strings if they
have only references, which is costly. Using &[&str] makes it cheap
if the caller owns strings as well.
To be able to make e.g. `jj log some/path` perform well on cloud-based
repos, a custom revset engine needs to be able to see the paths to
filter by. That way it is able pass those to a server-side index. This
commit helps with that by effectively converting `jj log -r foo
some/path` into `jj log -r 'foo & file(some/path)'`.