The added comparison functions correspond to --ignore-all-space and
--ignore-space-change. --ignore-space-at-eol can be combined with the other
flags, so it will have to be implemented as a preprocessing function.
--ignore-blank-lines will also require some change in the tokenizer function.
This could be implemented as a newtype `Wrapper<'a>(&'a [u8])`, but a lifetime
of the wrap function couldn't be specified correctly:
fn diff(left: &[u8], right: &[u8], wrap_fn: F, ..)
where
F: for<'a> Fn(&'a [u8]) -> W<'a>, // F::Output<'a> can't be specified
W: Copy + Eq + Hash
If the wrapper were of `&Wrapper([u8])` type, `Fn(&[u8]) -> &W` works. However,
it means we can no longer set comparison parameter (such as Regex) dynamically.
Another idea is to add some filter function of `Fn(&[u8]) -> Cow<'_, [u8]>`
type, but I don't think we would want to pay the allocation cost in
hashing/comparison code. `Fn(&[u8]) -> impl Iterator<Item = &[u8]>` might work,
but it would be equally complex.
We'll use low-level HashTable to customize Eq/Hash without implementing newtype
wrappers.
Unneeded default features are disabled for now. Note that the new default
hasher, foldhash, is released under the Zlib license, which isn't currently
included in the allow list.
Most collection references implement `.into_iter()` or its mutable version,
so it is possible to iterate over the elements without using an explicit
method to do so.
We have two options to achieve "diff --ignore-*-space":
a. preprocess contents to be diffed, then translate hunk ranges back
b. add hooks to customize eq and hash functions
I originally thought (a) would be easier, but actually, there aren't many
changes needed to implement (b). And (b) should have a fewer logic errors.
This patch removes assumption that each unchanged region has the same content
length. It won't be true if whitespace characters are ignored.
Intersection of unchanged ranges becomes a simple merge-join loop, so I've
removed the existing tests. I also added a fast path for the common 2-way
diffs in which we don't have to build vec![(pos, vec![pos])].
One source of confusion introduced by this change is that WordPosition means
both global and local indices. This is covered by the added tests, but I might
add separate local/global types later.
It's silly that we build new Vec for each recursion stack and merge elements
back. I don't see a measurable performance difference in the diff bench, but
this change will help simplify the next patch. If a result vec were created for
each unchanged_ranges() invocation, it would probably make more sense to return
a list of "local" word positions. Then, callers would have to translate the
returned positions to the caller's local positions.
We can assign a unique integer to each (word, occurrence) pair instead. As a
bonus, HashMap can be replaced with Vec.
```
group new old
----- --- ---
bench_diff_git_git_read_tree_c 1.00 72.5±3.25µs 1.08 78.5±0.48µs
bench_diff_lines/modified/10k 1.00 45.1±1.18ms 1.10 49.8±1.85ms
bench_diff_lines/modified/1k 1.00 4.1±0.07ms 1.11 4.5±0.34ms
bench_diff_lines/reversed/10k 1.00 19.0±0.12ms 1.12 21.2±1.26ms
bench_diff_lines/reversed/1k 1.00 558.5±37.42µs 1.17 655.6±16.27µs
bench_diff_lines/unchanged/10k 1.00 5.3±0.78ms 1.33 7.0±0.89ms
bench_diff_lines/unchanged/1k 1.00 422.0±16.68µs 1.28 540.7±13.96µs
```
I'll add a few more helper methods there. It might also make sense to cache
precomputed hash values.
unchanged_ranges() is made private since there are no external callers, and
I'm going to add more private types.
See discussion thread in linked issue.
With this PR, all revset functions in [BUILTIN_FUNCTION_MAP](8d166c7642/lib/src/revset.rs (L570))
that return multiple values are either named in plural or the naming is hard to misunderstand (e.g. `reachable`)
Fixes: #4122
Stacking at AliasExpanded node looks wonky. If we migrate error handling to
Diagnostics API, it might make sense to remove AliasExpanded node and add
node.aliases: vec![(id, span), ..] field instead.
Some closure arguments are inlined in order to help type inference.
I recently got random test_commit_parallel*() failures, and this patch appears
to fix the problem.
SimpleOpHeadsStore::update_op_heads() adds new_id file and removes old_ids
files in that order. It ensures that there exists at least one id file, but it
doesn't mean readdir() can observe the added id file.
1. process A: get_op_heads() -> opendir()
2. process B: update_op_heads() -> add_op_head(new_id), remove_op_head(old_id)
3. process A: -> readdir() (can miss new_id)
update_op_heads() could do rename(old_ids[0], new_id), but I don't remember if
readdir() can always pick up a renamed entry.
I think it's slightly easier to follow if we calculate a diff of input bits
first. I don't know which one is faster, but I assume compiler can optimize to
similar instructions.
Comparing each byte before comparing the nibbles is more efficient. A
benchmark comparing the old and new implementations with various common
prefix lengths shows:
```
Common hex len/old/3 time: [7.5444 ns 7.5807 ns 7.6140 ns]
Common hex len/new/3 time: [1.2100 ns 1.2144 ns 1.2192 ns]
Common hex len/old/6 time: [11.849 ns 11.879 ns 11.910 ns]
Common hex len/new/6 time: [1.9950 ns 2.0046 ns 2.0156 ns]
Common hex len/old/32 time: [63.030 ns 63.345 ns 63.718 ns]
Common hex len/new/32 time: [6.4647 ns 6.4800 ns 6.4999 ns]
```
Jujutsu's branches do not behave like Git branches, which is a major
hurdle for people adopting it from Git. They rather behave like
Mercurial's (hg) bookmarks.
We've had multiple discussions about it in the last ~1.5 years about this rename in the Discord,
where multiple people agreed that this _false_ familiarity does not help anyone. Initially we were
reluctant to do it but overtime, more and more users agreed that `bookmark` was a better for name
the current mechanism. This may be hard break for current `jj branch` users, but it will immensly
help Jujutsu's future, by defining it as our first own term. The `[experimental-moving-branches]`
config option is currently left alone, to force not another large config update for
users, since the last time this happened was when `jj log -T show` was removed, which immediately
resulted in breaking users and introduced soft deprecations.
This name change will also make it easier to introduce Topics (#3402) as _topological branches_
with a easier model.
This was mostly done via LSP, ripgrep and sed and a whole bunch of manual changes either from
me being lazy or thankfully pointed out by reviewers.
`jj new <commit>` automatically adds the checked out commits into the view head ids. However,
`jj edit` does not.
To reproduce:
```
jj git init test
cd test
jj commit -m "my commit"
jj log -r @- -T commit_id # Save the id
jj abandon -r @-
jj edit <saved_id>
jj log -r :: # Does not show the currently editing commit
```
It's a pretty frequent request to have support for turning off
auto-tracking of new files and to have a command to manually track
them instead. This patch adds a `snapshot.auto-track` config to decide
which paths to auto-track (defaults to `all()`). It also adds a `jj
track` command to manually track the untracked paths.
This patch does not include displaying the untracked paths in `jj
status`, so for now this is probably only useful in colocated repos
where you can run `git status` to find the untracked files.
#323
When loading repos on a server, there is no repo path. We currently
use a placeholder at Google. This patch will let us not do that.
I think we can remove the paths from `Workspace` too, but I'll leave
that for later, if at all.
In `TestBackend`, we simply use the path as key to look up the storage
in memory. That means we can't rely on the file system to look find
the same path given two different representaitons of the same path. We
therefore need to canonicalize the paths if we want to allow the
caller to pass in different paths for the same real path.
`RepoLoader` can be used on a server. You would then call
`RepoLoader::new()`. Let's clarify what `init()` does by renaming it
and documenting it.
I think `WorkspaceLoader`, OTOH, is only useful for loading a
workspace from disk. I also documented that.
We were passing the max file size to snapshot to
`WorkingCopy::snapshot()` via `UserSettings`. It's simpler and more
flexible to set it on `SnapshotOptions` instead.
We had both `repo()` and `mut_repo()` on `Transaction` and I think it
was easy to get confused and think that the former returned a
`&ReadonlyRepo` but both of them actually return a reference to
`MutableRepo` (the latter obviously returns a mutable reference). I
hope that renaming to the more idiomatic `repo_mut()` will help
clarify.
We could instead have renamed them to `mut_repo()` and
`mut_repo_mut()` but that seemed unnecessarily long. It would better
match the `mut_repo` variables we typically use, though.
I'm thinking of adding an option to embed operation diffs in "op log", and
"op log" shouldn't fail at the root operation. Let's make "op diff"/"show"
also work for consistency.
This doesn't provide any benefit yet bit I think we've known for a
while that we want to make the backend write methods async. It's just
not been important to Google because we have the local daemon process
that makes our writes pretty fast. Regardless, this first commit just
changes the API and all callers immediately block for now, so it won't
help even on slow backends.
This helps resolve diverged refs by abandoning both sides:
D ref = [D]
|\
| C C ref = [B - D + C]
| | |
B | B | B ref = [B - D + A]
|/ |/ |
A A A A ref = [A - D + A]
This is closer to the original behavior before 5e8d7f8c "rewrite: update
references after rewriting all commits." References can move to divergent
commits, so they should propagate further if there are more rewrites. See
the inline comment for subtle behavior difference.
We could instead replay parent_mapping in topological order, but we would
still need to flatten abandon records.
"Concurrent" operations are not necessarily actually concurrent, so
"divergent" seems like a better name. And "reconcile" seems like a
better term for merging them, though we also sometimes use "merge".
Like https://github.com/martinvonz/jj/pull/4189, this allows extensions the ability to load the repo in an environment where the local filesystem is not accessible. This change allows such extensions to exist at the CLI layer where jj is invoked as a subprocess, rather than a library (common in testing).
I just choose "clru" because it already exists in our dependency tree thorough
gix. I don't think LRU is the best cache eviction policy for our use case (a
simpler FIFO-based one might be good enough?), but it wouldn't matter for CLI
or GUI use case. I don't see significant performance degradation with "jj log
--stat -n1000".
RwLock is replaced with Mutex since get() is inherently a mutable operation.
If merge-heavy history was abandoned, intermediate parent chains can have tons
of duplicates, and the process explodes soon. Instead, we can skip any parent
ids that have been remapped.
We can no longer detect cycles reliably, but I think that's okay so long as
the function terminates.
Fixes#4352
FileConflict will be changed to not materialize Merge<BString>. I also updated
the revset engine to ignore non-file conflict. It doesn't make sense to grep
conflict description.
The native backend currently errors out if you ask it about copies. So
does the test backend. I think it's better to return an empty stream
of copies so it doesn't prevent other functionality.
Not all callers need this information, but I assumed it's relatively cheap to
look up the source path in the target tree compared to diffing.
This could be represented as Regular(_)|Copied(_, _)|Renamed(_, _), but it's
a bit weird if Copied and Renamed were separate variants. Instead, I decided
to wrap copy metadata in Option.
This patch adds accessor methods as I'm going to change the underlying data
types. Since entry values are consumed separately, these methods are implemented
on CopiesTreeDiffEntryPath, not on *TreeDiffEntry.
I plan to provide a richer version of `TreeDiffEntry` with copy info
(and to make `TreeDiffEntry` itself "poorer"). Most callers want to
know about copies/renames, but at least working copy implementations
probably don't. This patch adds separate `diff_stream()` and
`diff_stream_with_copies()` so we can provide the simpler interface
for callers that don't need copy info.
The support for copy tracing is already simply added to the stream
just before yielding the item, so we can easily implement it as a
stream adapter. That ensures that we use the same logic for the
iterator- and stream-based versions. More importantly, it enables
further cleanups and a simpler interface.
So that more tests can leverage diff::diff() helper.
I also removed the fast path for identical inputs. This function is only used by
tests and benches, and production code usually compares content hashes first.
The tree-level conflicts have worked well in practice and we don't
want to allow users to use legacy trees for new commits. We don't
really support legacy trees very well since 0590f8bece anyway.
I'm going to split color-words diffs to by_line() and by_word() stages.
Perhaps, Diff::default_refinement() can be removed once all non-test callers
are migrated.
I'm thinking of adding some heuristics to render hunks containing lots of
small word changes differently, in a similar manner to the unified diffs. This
patch might help add some pre/post-processing at consumer.
files::diff() is inlined to caller to get around 'self borrowing.
When rebasing a new child commit on top of the moved commit(s), the
order of the new child commit's parent commits is now correctly
preserved if the original parent commit is now a parent of the moved
commit(s).
Closes#3969.
- add support for copy tracking to `diff --stat`
- switch `--summary` to match git's output more closely
- rework `show_diff_summary` signature to be more consistent
Add home directory expansion for SSH key filepaths. This allows the
`signing.key` configuration value to work more universally across both
Linux and macOS without requiring an absolute path.
This moved and renamed the previous `expand_git_path` function to a more
generic location, and the prior use was updated accordingly.
Perhaps, we should also cache merged trees, but this patch saves time until
we implement the bookkeeping. Even if we had a cache, it wouldn't be ideal to
calculate uncached merged trees during revset evaluation.
```
% hyperfine --sort command --warmup 3 --runs 10 -L bin jj-1,jj-2 \
'target/release-with-debug/{bin} -R ~/mirrors/git --ignore-working-copy \
log -r "::@ & file(root:builtin)" --no-graph -n50'
Benchmark 1: target/release-with-debug/jj-1 ..
Time (mean ± σ): 3.512 s ± 0.014 s [User: 3.391 s, System: 0.119 s]
Range (min … max): 3.489 s … 3.528 s 10 runs
Benchmark 2: target/release-with-debug/jj-2 ..
Time (mean ± σ): 1.351 s ± 0.010 s [User: 1.275 s, System: 0.074 s]
Range (min … max): 1.332 s … 1.366 s 10 runs
Relative speed comparison
2.60 ± 0.02 target/release-with-debug/jj-1 ..
1.00 target/release-with-debug/jj-2 ..
```
I'll add conflict resolution there.
This change adds more synchronization points, which is probably bad for
concurrency. However, this module is a revset engine for the default index,
so the store backends are supposed to be fast local disks.
I'll add a public function that resolves file conflicts. This function will
take owned MergedTreeValue, and that's why the extracted function returns
None instead of cloning the passed value.
MergedTreeVal was roughly equivalent to Merge<Option<Cow<_>>. As we've dropped
support for the legacy trees, it can be simplified to Merge<Option<&_>>.
try_resolve_file_conflict() is also updated. It could be a generic function,
but there are only two callers, and the legacy tree one is used only in tests.
For the same reason as 2cb7e91d "merged_tree: do not re-look up non-conflicting
tree values by name." This appears to bring a similar performance improvement.
I assume this change is/will be covered by test_merged_tree.rs. I considered
adding a few unit tests, but constructing Tree object isn't trivial, and the
iterator implementation is relatively straightforward.
- force each diff command to explicitly enable copy tracking
- enable copy tracking in diff_summary
- post-process for diff iterator
- post-process for diff stream
- update changelog
- use a single commit instead of an array of them. This simplifies the
implementation. A higher level api can wrap this when an array of
commits is desired and those semantics are figured out.
- since this API is directly 1-1 on parents, there are no conflicts
- if we introduce a higher level API that handles lists of commits, we
may need to restore the conflict/resolved distinction, but for now
simplify
This allows us to diff trees without fully resolving conflicts:
let from_tree = merge_no_resolve(..);
for (path, (from, to)) in from_tree.diff(to_tree, matcher) {
let from = resolve_conflicts(from);
if from == to {
continue; // resolved file may be identical
...
I originally considered adding a matcher argument to merge() functions, but the
resulting API looked misleading. If merge() took a matcher, callers might expect
unmatched trees and files were omitted, not left unresolved. It's also slower
than diffing unresolved trees because merge(.., matcher) would have to write
partially resolved trees to the store.
Since "ancestor_tree" isn't resolved by itself, this patch has subtle behavior
change. For example, "jj diff -r9eaef582" in the "git" repository is no longer
empty. I think the new behavior is also technically correct, but I'm not pretty
sure.
While measuring file(path) query, I noticed BTreeMap lookup appears in perf.
It actually has a measurable cost if the history is linear and parent trees
don't have to be merged dynamically. For merge-heavy history, the cost of
tree merges is more significant. I'll address that separately.
```
% hyperfine --sort command --warmup 3 --runs 50 -L bin jj-1,jj-2 \
'target/release-with-debug/{bin} -R ~/mirrors/git --ignore-working-copy \
log -r "::trunk() & ~merges() & file(root:builtin)" --no-graph -n100'
Benchmark 1: target/release-with-debug/jj-1 ..
Time (mean ± σ): 239.7 ms ± 7.1 ms [User: 192.1 ms, System: 46.5 ms]
Range (min … max): 222.2 ms … 249.7 ms 50 runs
Benchmark 2: target/release-with-debug/jj-2 ..
Time (mean ± σ): 201.7 ms ± 6.9 ms [User: 153.7 ms, System: 46.6 ms]
Range (min … max): 184.2 ms … 211.1 ms 50 runs
Relative speed comparison
1.19 ± 0.05 target/release-with-debug/jj-1 ..
1.00 target/release-with-debug/jj-2 ..
```
Suppose we add copy information to MergedTree, a MergedTree can be considered
a root tree representation plus global metadata. I think Merge<Tree> is a better
type for sub trees.
I considered making `MergedTree` just a newtype (1-tuple) but I went
with a struct instead because we may want to add copy information in a
separate field in the future.
In order to remove the `MergedTree::Legacy` form, we need to stop
creating such instances. This patch removes the last place we create
them, which is in `Store::get_root_tree()`.
The main practical consequence of this change is that loading legacy
trees gets a lot slower on large repos. However, since the default log
template includes the `conflict` keyword, we ended up scanning all
paths in `jj log` anyway, so I'm not sure many people will notice.
Since "op abandon" just rewrites DAG, it works no matter if the heads are
merged or not. This change will help crash recovery. "op abandon
--at-op=<one-of-the-heads>" can't be used because ancestor operations would be
preserved by the other head.
Suppose a squash node in obslog is analogous to a merge in revisions log, it
makes sense to show diffs from auto-merge (or auto-squash) parents. This
basically means a non-partial squash node no longer shows diffs.
This also fixes missing diffs at the root predecessors if there were.
Author dates and committer dates can be filtered like so:
committer_date(before:"1 hour ago") # more than 1 hour ago
committer_date(after:"1 hour ago") # 1 hour ago or less
A date range can be created by combining revsets. For example, to see any
revisions committed yesterday:
committer_date(after:"yesterday") & committer_date(before:"today")
Creates a DatePattern type that can be created by parsing a string in any
format supported by the chrono-english crate, including:
- 2024-03-25
- 2024-03-25T00:00:00
- 2024-03-25T00:00:00-08:00
- 2 weeks ago
- 5 minutes ago
- yesterday
- yesterday 5pm
- yesterday 10:30
- yesterday 15:30
- tomorrow
A `kind` can be specified to indicate whether the pattern should match dates at
or after (`after`) or strictly before (`before`) the given instant.
chrono-english supports US and UK dialects to disambiguate mm/dd/yy from
dd/mm/yy, but for now we default to US. This should probably be a config
setting.
This enables the creation of Repo objects in environments without standard filesystem support, by allowing the caller to load the store objects however they see fit. This confines interaction with the filesystem to the WorkingCopy abstractions.
This is part of migrating away from legacy trees (with path-level
conflicts). I can't think of any practical impact (we already compare
the tree ids equal).
This basically reverts 20eb9ecec1 "git: don't abandon HEAD commit when it
loses a branch." I think the new behavior is more consistent because the Git
HEAD is equivalent to @- in jj, so it shouldn't be considered a named ref.
Note that we've made old HEAD branch not considered at 92cfffd843 "git: on
external HEAD move, do not abandon old branch."
#4108
If readonly_index() and index() returned Result, it would propagate to many
call sites. That seems bad for API ergonomics. Suppose most "repo" commands
depend on an index, I think it's okay to load index eagerly:
- "jj config" doesn't load repo (nor index)
- "jj workspace root" doesn't load repo (nor index)
- some other mutation commands load index when printing commit summary
- many other commands load index when resolving revset
In order to render description template, we'll need a Commit object that
represents the old state (with new tree and parents) before updating the
commit description. The added functions will help generate an intermediate
Commit object.
Alternatively, we can create an in-memory Commit object with some fake
CommitId. It should be lightweight, but might cause weird issue because the
fake id wouldn't be found in the store.
I think it's okay to write a temporary commit and rely on GC as we do for
merge trees. However, I should note that temporary commits are more likely to
be preserved as they are pinned by no-gc refs until "jj util gc".
This allows us to construct a builder, format description template with an
intermediate commit, then write() a final commit object to the repo.
I originally considered removing mut_repo from CommitBuilder at all, but
rewriter APIs rely on that CommitBuilder has &mut_repo, and splitting them
would make call sites uglier.
The inner builder methods are based on &mut Self instead of Self, because it's
easier to wrap, and users of the inner builder will bind it to a named variable
anyway.