rework the tutorial with the new paradigm

2025-01-27 15:07:03 +00:00 · 2022-08-04 01:31:13 -04:00 · 2022-08-04 01:31:13 -04:00 · e222cce854
commit e222cce854
parent 8f2f664e96
3 changed files with 125 additions and 138 deletions
--- a/book/src/SUMMARY.md
+++ b/book/src/SUMMARY.md
@ -9,7 +9,7 @@
  - [Basic structure](./tutorial/structure.md)
  - [Jars and databases](./tutorial/jar.md)
  - [Defining the database struct](./tutorial/db.md)
-  - [Defining the IR: input, interned, and tracked structs](./tutorial/ir.md)
+  - [Defining the IR: the various "salsa structs"](./tutorial/ir.md)
  - [Defining the parser: memoized functions and inputs](./tutorial/parser.md)
  - [Defining the parser: reporting errors](./tutorial/accumulators.md)
  - [Defining the parser: debug impls and testing](./tutorial/debug.md)
--- a/book/src/tutorial/ir.md
+++ b/book/src/tutorial/ir.md
@ -1,9 +1,27 @@
 # Defining the IR

 Before we can define the [parser](./parser.md), we need to define the intermediate representation (IR) that we will use for `calc` programs.
-In the [basic structure](./structure.md), we defined some "pseudo-Rust" structures like `Statement`, `Expression`, and so forth, and now we are going to define them for real.
+In the [basic structure](./structure.md), we defined some "pseudo-Rust" structures like `Statement` and `Expression`;
+now we are going to define them for real.

-## Input
+## "Salsa structs"
+
+In addition to regular Rust types, we will make use of various **salsa structs**.
+A salsa struct is a struct that has been annotated with one of the salsa annotations:
+
+* [`#[salsa::input]`](#input-structs), which designates the "base inputs" to your computation;
+* [`#[salsa::tracked]`](#tracked-structs), which designate intermediate values created during your computation;
+* [`#[salsa::interned]`](#interned-structs), which designate small values that are easy to compare for equality.
+
+All salsa structs store the actual values of their fields in the salsa database.
+This permits us to track when the values of those fields change to figure out what work will need to be re-executed.
+
+When you annotate a struct with one of the above salsa attributes, salsa actually generates a bunch of code to link that struct into the database.
+This code must be connected to some [jar](./jar.md).
+By default, this is `crate::Jar`, but you can specify a different jar with the `jar=` attribute (e.g., `#[salsa::input(jar = MyJar)]`).
+You must also list the struct in the jar definition itself, or you will get errors.
+
+## Input structs

 The first thing we will define is our **input**. 
 Every salsa program has some basic inputs that drive the rest of the computation.
@ -17,9 +35,6 @@ Inputs are defined as Rust structs with a `#[salsa::input]` annotation:
 ```

 In our compiler, we have just one simple input, the `ProgramSource`, which has a `text` field (the string).
-(By the way, the `#[salsa::input]` annotation must be connected to a jar, but it defaults to `crate::Jar`.
-If you wanted to create the jar somewhere else, you would write `#[salsa::input(jar = path::to::Jar)]`.
-Wherever the jar is defined, you also have to list the input as one of its fields.)

 ### The data lives in the database

@ -42,74 +57,32 @@ For an input, a `&mut db` reference is required, along with the values for each
 let source = ProgramSource::new(&mut db, "print 11 + 11".to_string());
 ```

-You can read the value of the field with `source.text(db)`, 
+You can read the value of the field with `source.text(&db)`, 
 and you can set the value of the field with `source.set_text(&mut db, "print 11 * 2".to_string())`.

-## Interning
+### Database revisions

-Interning is a builtin feature to salsa where you take a struct and replace it with a single integer.
-The integer you get back is arbitrary, but whenever you intern the same struct twice, you get back the same integer.
-In our compiler, we'll use interning to define `FunctionId` and `VariableId`, which are effectively interned strings.
+Whenever a function takes an `&mut` reference to the database, 
+that means that it can only be invoked from outside the incrementalized part of your program,
+as explained in [the overview](../overview.md#goal-of-salsa).
+When you change the value of an input field, that increments a 'revision counter' in the database,
+indicating that some inputs are different now.
+When we talk about a "revision" of the database, we are referring to the state of the database in between changes to the input values.

-Interned structs in Salsa are defined with the `#[salsa::interned]` attribute macro:
+## Tracked structs

-```rust
-{{#include ../../../calc-example/calc/src/ir.rs:interned_ids}}
-```
-
-As with `#[salsa::input]`, the data for an interned struct is stored in the database, and the struct itself is just an integer. 
-The interned structs have a few methods.
-
-These interned structs also have a few methods:
-
- The `new` method creates an interned struct given a database `db` and a value for each field (e.g., `let v = VariableId::new(db, "foo".to_string())`).
- For each field of the interned struct, there is an accessor method with the same name (e.g., `v.name(db)`).
-  - If the field is marked with `#[id(ref)]`, as it is here, than this accessor returns a reference! In this case, it returns an `&String` tied to the database `db`. This is useful when the values of the fields are not `Copy`.
-  - The `id` here refers to the fact that the value of this field is part of the _identity_, i.e., it affects the `salsa::Id` integer. For interned structs, all fields are `id` fields, but later on we'll see entity structs, which also have `value` fields.
-
-### The data struct
-
-In addition to the main `VariableId` and `FunctionId` structs, the `salsa::interned` macro also creates a "data struct".
-This is normally named the same as the original struct, but with `Data` appended to the end (i.e., `VariableIdData` and `FunctionIdData`).
-You can override the name by using the `data` option (`#[salsa::interned(data = MyName)]`).
-The data struct also inherits all the `derive` and other attributes from the original source.
-In the case of our examples, the generated data struct would be something like:
-
-```rust
-// Generated by `salsa::interned`
-
-#[derive(Eq, PartialEq, Clone, Hash)]
-pub struct VariableIdData {
-    pub text: String,
-}
-
-#[derive(Eq, PartialEq, Clone, Hash)]
-pub struct FunctionIdData {
-    pub text: String,
-}
-```
-
-## Expressions and statements
-
-We'll also intern expressions and statements. This is convenient primarily because it allows us to have recursive structures very easily. Since we don't really need the "cheap equality comparison" aspect of interning, this isn't the most efficient choice, and many compilers would opt to represent expressions/statements in some other way.
-
-```rust
-{{#include ../../../calc-example/calc/src/ir.rs:statements_and_expressions}}
-```
-
-## Function entities
-
-The final piece of our IR is the representation of _functions_. Here, we use an _entity struct_:
+Next we will define a **tracked struct** to represent the functions in our input.
+Whereas inputs represent the *start* of a computation, tracked structs represent intermediate values created during your computation.
+In this case, we are going to parse the raw input program, and create a `Function` for each of the functions defined by the user.

 ```rust
 {{#include ../../../calc-example/calc/src/ir.rs:functions}}
 ```

-An **entity** is very similar to an interned struct, except that it has own identity. That is, each time you invoke `Function::new`, you will get back a new `Function, even if all the values of the fields are equal.
+Unlike with inputs, the fields of tracked structs are immutable once created. Otherwise, working with a tracked struct is quite similar to an input:

-### Interning vs entities
-
-Unless you want a cheap way to compare for equality across functions, you should prefer entities to interning. They correspond most closely to creating a "new" struct.
+* You can create a new value by using `new`, but with a tracked struct, you only need an `&dyn` database, not `&mut` (e.g., `Function::new(&db, some_name, some_args, some_body)`)
+* You use a getter to read the value of a field, just like with an input (e.g., `my_func.args(db)` to read the `args` field).

 ### id fields

@ -118,3 +91,46 @@ Normally, you would do this on fields that represent the "name" of an entity.
 This indicates that, across two revisions R1 and R2, if two functions are created with the same name, they refer to the same entity, so we can compare their other fields for equality to determine what needs to be re-executed.
 Adding `#[id]` attributes is an optimization and never affects correctness.
 For more details, see the [algorithm](../reference/algorithm.md) page of the reference.
+
+## Interned structs
+
+The final kind of salsa struct are *interned structs*.
+As with input and tracked structs, the data for an interned struct is stored in the database, and you just pass around a single integer.
+Unlike those structs, if you intern the same data twice, you get back the **same integer**.
+
+A classic use of interning is for small strings like function names and variables.
+It's annoying and inefficient to pass around those names with `String` values which must be cloned;
+it's also inefficient to have to compare them for equality via string comparison.
+Therefore, we define two interned structs, `FunctionId` and `VariableId`, each with a single field that stores the string:
+
+```rust
+{{#include ../../../calc-example/calc/src/ir.rs:interned_ids}}
+```
+
+When you invoke e.g. `FunctionId::new(&db, "my_string".to_string())`, you will get back a `FunctionId` that is just a newtype'd integer.
+But if you invoke the same call to `new` again, you get back the same integer:
+
+```rust
+let f1 = FunctionId::new(&db, "my_string".to_string());
+let f2 = FunctionId::new(&db, "my_string".to_string());
+assert_eq!(f1, f2);
+```
+
+### Expressions and statements
+
+We'll also intern expressions and statements. This is convenient primarily because it allows us to have recursive structures very easily. Since we don't really need the "cheap equality comparison" aspect of interning, this isn't the most efficient choice, and many compilers would opt to represent expressions/statements in some other way.
+
+```rust
+{{#include ../../../calc-example/calc/src/ir.rs:statements_and_expressions}}
+```
+
+### Interned ids are guaranteed to be consistent within a revision, but not across revisions (but you don't have to care)
+
+Interned ids are guaranteed not to change within a single revision, so you can intern things from all over your program and get back consistent results.
+When you change the inputs, however, salsa may opt to clear some of the interned values and choose different integers.
+However, if this happens, it will also be sure to re-execute every function that interned that value, so all of them still see a consistent value,
+just a different one than they saw in a previous revision.
+
+In other words, within a salsa computation, you can assume that interning produces a single consistent integer, and you don't have to think about it.
+If however you export interned identifiers outside the computation, and then change the inputs, they may not longer be valid or may refer to different values.
+
--- a/book/src/tutorial/parser.md
+++ b/book/src/tutorial/parser.md
@ -1,102 +1,73 @@
 # Defining the parser: memoized functions and inputs

 The next step in the `calc` compiler is to define the parser.
-The role of the parser will be to take the raw bytes from the input and create the `Statement`, `Function`, and `Expression` structures that [we defined in the `ir` module](./ir.md).
+The role of the parser will be to take the `ProgramSource` input,
+read the string from the `text` field,
+and create the `Statement`, `Function`, and `Expression` structures that [we defined in the `ir` module](./ir.md).

 To minimize dependencies, we are going to write a [recursive descent parser][rd].
 Another option would be to use a [Rust parsing framework](https://rustrepo.com/catalog/rust-parsing_newest_1).
+We won't cover the parsing itself in this tutorial -- you can read the code if you want to see how it works.
+We're going to focus only on the salsa-related aspects.

 [rd]: https://en.wikipedia.org/wiki/Recursive_descent_parser

-## The `source_text` the function
-
-Let's start by looking at the `source_text` function:
-
-```rust
-{{#include ../../../calc-example/calc/src/parser.rs:source_text}}
-```
-
-This is a bit of an odd function!
-You can see it is annotated as memoized,
-which means that salsa will store the return value in the database,
-so that if you call it again, it does not re-execute unless its inputs have changed.
-However, the function body itself is just a `panic!`, so it can never successfully return.
-What is going on?
-
-This function is an example of a common convention called an **input**.
-Whenever you have a memoized function, it is possible to set its return value explicitly
-(the [chapter on testing](./debug.md) shows how it is done).
-When you set the return value explicitly, it never executes;
-instead, when it is called, that return value is just returned.
-This makes the function into an _input_ to the entire computation.
-
-In this case, the body is just `panic!`,
-which indicates that `source_text` is always meant to be set explicitly.
-It's possible to set a return value for functions that have a body,
-in which case they can act as either an input or a computation.
-
-### Arguments to a memoized function
-
-The first parameter to a memoized function is always the database,
-which should be a `dyn Trait` value for the database trait associated with the jar
-(the default jar is `crate::Jar`).
-
-Memoized functions may take other arguments as well, though our examples here do not.
-Those arguments must be something that can be interned.
-
-### Memoized functions with `return_ref`
-
-`source_text` is not only memoized, it is annotated with `return_ref`.
-Ordinarily, when you call a memoized function,
-the result you get back is cloned out of the database.
-The `return_ref` attribute means that a reference into the database is returned instead.
-So, when called, `source_text` will return an `&String` rather than cloning the `String`.
-This is useful as a performance optimization.
-
 ## The `parse_statements` function

-The next function is `parse_statements`, which has the job of actually doing the parsing.
-The comments in the function explain how it works.
+The starting point for the parser is the `parse_statements` function:

 ```rust
 {{#include ../../../calc-example/calc/src/parser.rs:parse_statements}}
 ```

-The most interesting part, from salsa's point of view,
-is that `parse_statements` calls `source_text` to get its input.
-Salsa will track this dependency.
-If `parse_statements` is called again, it will only re-execute if the return value of `source_text` may have changed.
+This function is annotated as `#[salsa::tracked]`.
+That means that, when it is called, salsa will track what inputs it reads as well as what value it returns.
+The return value is *memoized*,
+which means that if you call this function again without changing the inputs,
+salsa will just clone the result rather than re-execute it.

-We won't explain how the parser works in detail here.
-You can read the comments in the source file to get a better understanding.
-But we will cover a few interesting points that interact with Salsa specifically.
+### Tracked functions are the unit of reuse

-### Creating interned values with the `from` method
+Tracked functions are the core part of how salsa enables incremental reuse.
+The goal of the framework is to avoid re-executing tracked functions and instead to clone their result.
+Salsa uses the [red-green algorithm](../reference/algorithm.md) to decide when to re-execute a function.
+The short version is that a tracked function is re-executed if either (a) it directly reads an input, and that input has changed
+or (b) it directly invokes another tracked function, and that function's return value has changed.
+In the case of `parse_statements`, it directly reads `ProgramSource::text`, so if the text changes, then the parser will re-execute.

-The `parse_statement` method parses a single statement from the input:
+By choosing which functions to mark as `#[tracked]`, you control how much reuse you get.
+In our case, we're opting to mark the outermost parsing function as tracked, but not the inner ones.
+This means that if the input changes, we will always re-parse the entire input and re-create the resulting statements and so forth.
+We'll see later that this *doesn't* mean we will always re-run the type checker and other parts of the compiler.

-```rust
-{{#include ../../../calc-example/calc/src/parser.rs:parse_statement}}
-```
+This trade-off makes sense because (a) parsing is very cheap, so the overhead of tracking and enabling finer-grained reuse doesn't pay off
+and because (b) since strings are just a big blob-o-bytes without any structure, it's rather hard to identify which parts of the IR need to be reparsed.
+Some systems do choose to do more granular reparsing, often by doing a "first pass" over the string to give it a bit of structure, 
+e.g. to identify the functions,
+but deferring the parsing of the body of each function until later.
+Setting up a scheme like this is relatively easy in salsa, and uses the same principles that we will use later to avoid re-executing the type checker.

-The part we want to highlight is how an interned enum is created:
+### Parameters to a tracked function

-```rust
-Statement::from(self.db, StatementData::Function(func))
-```
+The **first** parameter to a tracked function is **always** the database, `db: &dyn crate::Db`.
+It must be a `dyn` value of whatever database is associated with the jar.

-On any interned value, the `from` method takes a database and an instance of the "data" type (here, `StatementData`).
-It then interns this value and returns the interned type (here, `Statement`).
+The **second** parameter to a tracked function is **always** some kind of salsa struct.
+The first parameter to a memoized function is always the database,
+which should be a `dyn Trait` value for the database trait associated with the jar
+(the default jar is `crate::Jar`).

-### Creating entity values, or interned structs, with the `new` method
+Tracked functions may take other arguments as well, though our examples here do not.
+Functions that take additional arguments are less efficient and flexible.
+It's generally better to structure tracked functions as functions of a single salsa struct if possible.

-The other way to create an interned/entity struct is with the `new` method.
-This only works when the struct has named fields (i.e., it doesn't work with enums like `Statement`).
-The `parse_function` method demonstrates:
+### The `return_ref` annotation

-```rust
-{{#include ../../../calc-example/calc/src/parser.rs:parse_function}}
-```
+You may have noticed that `parse_statements` is tagged with `#[salsa::tracked(return_ref)]`.
+Ordinarily, when you call a tracked function, the result you get back is cloned out of the database.
+The `return_ref` attribute means that a reference into the database is returned instead.
+So, when called, `parse_statements` will return an `&Vec<Statement>` rather than cloning the `Vec`.
+This is useful as a performance optimization.
+(You may recall the `return_ref` annotation from the [ir](./ir.md) section of the tutorial, 
+where it was placed on struct fields, with roughly the same meaning.)

-You can see that we invoke `FunctionnId::new` (an interned struct) and `Function::new` (an entity).
-In each case, the `new` method takes the database, and then the value of each field.