Based on actual events.
Let's say you've got a blog. The blog has a bunch of posts. Each post has a title and a set of tags. The metadata for these posts is all contained in TOML files in a single directory. (If you use Zola, you're pretty close to that.) And now you need to generate a CSV file showing a matrix of blog posts and their tags. Seems like a great job for Rust!
In this post, we're going to:
- Explore how we'd solve this (fairly simple) problem
- Investigate how Rust's types tell us a lot about memory usage
- Play with some nice and not-so-nice ways to optimize our program
Onwards!
Program behavior
We've got a bunch of TOML files sitting in the posts
directory. Here are some example files:
# devops-for-developers.toml
title = "DevOps for (Skeptical) Developers"
tags = ["dev", "devops"]
# rust-devops.toml
title = "Rust with DevOps"
tags = ["devops", "rust"]
We want to create a CSV file that looks like this:
Title,dev,devops,rust,streaming
DevOps for (Skeptical) Developers,true,true,false,false
Rust with DevOps,false,true,true,false
Serverless Rust using WASM and Cloudflare,false,true,true,false
Streaming UTF-8 in Haskell and Rust,false,false,true,true
To make this happen, we need to:
- Iterate through the files in the
posts
directory
- Load and parse each TOML file
- Collect a set of all tags present in all posts
- Collect the parsed post information
- Create a CSV file from that information
Not too bad, right?
Setup
You should make sure you've installed the Rust tools. Then you can create a new empty project with cargo new tagcsv
.
Later on, we're going to play with some unstable language features, so let's opt into a nightly version of the compiler. To do this, create a rust-toolchain
file containing:
nightly-2020-08-29
Then add the following dependencies to your Cargo.toml
file:
[dependencies]
csv = "1.1.3"
serde = "1.0.115"
serde_derive = "1.0.115"
toml = "0.5.6"
OK, now we can finally work on some code!
First version
We're going to use the toml
crate to parse our metadata files. toml
is built on top of serde
, and we can conveniently use serde_derive
to automatically derive a Deserialize
implementation for a struct
that represents that metadata. So we'll start off our program with:
use serde_derive::Deserialize;
use std::collections::HashSet;
#[derive(Deserialize)]
struct Post {
title: String,
tags: HashSet<String>,
}
Next, we'll define our main
function to load the data:
fn main() -> Result<(), std::io::Error> {
// Collect all tags across all of the posts
let mut all_tags: HashSet<String> = HashSet::new();
// And collect the individual posts
let mut posts: Vec<Post> = Vec::new();
// Read in the files in the posts directory
let dir = std::fs::read_dir("posts")?;
for entry in dir {
// Error handling
let entry = entry?;
// Read the file contents as a String
let contents = std::fs::read_to_string(entry.path())?;
// Parse the contents with the toml crate
let post: Post = toml::from_str(&contents)?;
// Add all of the tags to the all_tags set
for tag in &post.tags {
all_tags.insert(tag.clone());
}
// Update the Vec of posts
posts.push(post);
}
// Generate the CSV output
gen_csv(&all_tags, &posts)?;
Ok(())
}
And finally, let's define our gen_csv
function to take the set of tags and the Vec
of posts and generate the output file:
fn gen_csv(all_tags: &HashSet<String>, posts: &[Post]) -> Result<(), std::io::Error> {
// Open the file for output
let mut writer = csv::Writer::from_path("tag-matrix.csv")?;
// Generate the header, with the word "Title" and then all of the tags
let mut header = vec!["Title"];
for tag in all_tags.iter() {
header.push(tag);
}
writer.write_record(header)?;
// Print out a separate row for each post
for post in posts {
// Create a record with the post title...
let mut record = vec![post.title.as_str()];
for tag in all_tags {
// and then a true or false for each tag name
let field = if post.tags.contains(tag) {
"true"
} else {
"false"
};
record.push(field);
}
writer.write_record(record)?;
}
writer.flush()?;
Ok(())
}
Side note: it would be slightly nicer to alphabetize the set of tags, which you can do by collecting all of the tags into a Vec
and then sorting it. I had that previously, but removed it in the code above to reduce incidental noise to the example. If you feel like having fun, try adding that back.
Anyway, this program works exactly as we want, and produces a CSV file. Perfect, right?
Let the types guide you
I love type-driven programming. I love the idea that looking at the types tells you a lot about the behavior of your program. And in Rust, the types can often tell you about the memory usage of your program. I want to focus on two lines, and then prove a point with a third. Consider:
tags: HashSet<String>,
and
let mut all_tags: HashSet<String> = HashSet::new();
Firstly, I love the fact that the types tell us so much about expected behavior. The tags are a set: the order is unimportant, and there are no duplicates. That makes sense. We don't want to list "devops" twice in our set of all tags. And there's nothing inherently "first" or "second" about "dev" vs "rust". And we know that tags are arbitrary pieces of textual data. Awesome.
But what I really like here is that it tells us about memory usage. Each post has its own copy of each tag. So does the all_tags
set. How do I know this? Easy: because that's exactly what String
means. There's no possibility of data sharing, at all. If we have 200 posts tagged "dev", we will have 201 copies of the string "dev" in memory (200 for the posts, once for the all_tags
).
And now that we've seen it in the types, we can see evidence of it in the implementation too:
all_tags.insert(tag.clone());
That .clone()
bothered me when I first wrote it. And that's what got me to look at the types, which bothered me further.
In reality, this is nothing to worry about. Even with 1,000 posts, averaging 5 tags, with each tag averaging 20 bytes, this will only take up an extra 100,000 bytes of memory. So optimizing this away is not a good use of our time. We're much better off doing something else.
But I wanted to have fun. And if you're reading this post, I think you want to continue this journey too. Onwards!
Rc
This isn't the first solution I tried. But it's the first one that worked easily. So we'll start here.
The first thing we have to change is our types. As long as we have HashSet<String>
, we know for a fact that we'll have extra copies of the data. This seems like a nice use case for Rc
. Rc
uses reference counting to let multiple values share ownership of another value. Sounds like exactly what we want!
My approach here is to use compiler-error-driven development, and I encourage you to play along with your own copy of the code. First, let's use
Rc
:
use std::rc::Rc;
Next, let's change our definition of Post
to use an Rc<String>
instead of String
:
#[derive(Deserialize)]
struct Post {
title: String,
tags: HashSet<Rc<String>>,
}
The compiler doesn't like this very much. We can't derive Deserialize
for an Rc<String>
. So instead, let's make a RawPost
struct for the deserializing, and then dedicate Post
for holding the data with Rc<String>
. In other words:
#[derive(Deserialize)]
struct RawPost {
title: String,
tags: HashSet<String>,
}
struct Post {
title: String,
tags: HashSet<Rc<String>>,
}
And then, when parsing the toml
, we'll parse into a RawPost
type:
let post: RawPost = toml::from_str(&contents)?;
If you're following along, you'll only have one error message at this point about posts.push(post);
having a mismatch between Post
and RawPost
. But before we address that, let's make one more type change above. I want to make all_tags
contain Rc<String>
.
let mut all_tags: HashSet<Rc<String>> = HashSet::new();
OK, now we've got some nice error messages about mismatches between Rc<String>
and String
. This is where we have to be careful. The easiest thing to do would be to simply wrap our String
s in an Rc
and end up with lots of copies of String
. Let's implement the next bit incorrectly first to see what I'm talking about.
At this point in our code rewrite, we've got a RawPost
, and we need to:
- Add its tags to
all_tags
- Create a new
Post
value based on the RawPost
- Add the
Post
to the posts
Vec
Here's the simple and wasteful implementation:
let raw_post: RawPost = toml::from_str(&contents)?;
let mut post_tags: HashSet<Rc<String>> = HashSet::new();
for tag in raw_post.tags {
let tag = Rc::new(tag);
all_tags.insert(tag.clone());
post_tags.insert(tag);
}
let post = Post {
title: raw_post.title,
tags: post_tags,
};
posts.push(post);
The problem here is that we always keep the original String
from the RawPost
. If that tag is already present in the all_tags
set, we don't end up using the same copy.
There's an unstable method on HashSet
s that helps us out here. get_or_insert
will try to insert a value into a HashSet
. If the value is already present, it will drop the new value and return a reference to the original value. If the value isn't present, the value is added to the HashSet
and we get a reference back to it. Changing our code to use that is pretty easy:
for tag in raw_post.tags {
let tag = Rc::new(tag);
let tag = all_tags.get_or_insert(tag);
post_tags.insert(tag.clone());
}
We still end up with a .clone()
call, but now it's a clone of an Rc
, which is a cheap integer increment. No additional memory allocation required! Since this method is unstable, we also have to enable the feature by adding this at the top of your source file:
#![feature(hash_set_entry)]
And only one more change required. The signature for gen_csv
is expecting a &HashSet<String>
. If you change that to &HashSet<Rc<String>>
, the code will compile and run correctly. Yay!
In case you got lost with all of the edits above, here's the current version of main.rs
:
Quibbles
I already told you that the original HashSet<String>
version of the code is likely Good Enough™ for most cases. I'll tell you that, if you're really bothered by that overhead, the HashSet<Rc<String>>
version if almost certainly the right call. So we should probably just stop here and end the blog post on a nice, safe note.
But let's be bold and crazy. I don't actually like this version of the code that much, for two reasons:
- The
Rc
feels dirty here. Rc
is great for weird lifetime situations with values. But in our case, we know that the all_tags
set, which owns all of the tags, will always outlive the usage of the tags inside the Post
s. So reference counting feels like an unnecessary overhead and obscuring the situation.
- As demonstrated before, it's all too easy to mess up with the
Rc<String>
version. You can accidentally bypass all of the memory saving benefits by using a new String
instead of cloning a reference to an existing one.
What I'd really like to do is to have all_tags
be a HashSet<String>
and own the tags themselves. And then, inside Post
, I'd like to keep references to those tags. Unfortunately, this doesn't quite work. Can you foresee why? If not, don't worry, I didn't see it until the borrow checker told me how wrong I was a few times. Let's experience that joy together. And we'll do it with compiler-driven development again.
The first thing I'm going to do is remove the use std::rc::Rc;
statement. That leads to our first error: Rc
isn't in scope for Post
. We want to keep a &str
in this struct. But we have to be explicit about lifetimes when holding references in structs. So our code ends up as:
struct Post<'a> {
title: String,
tags: HashSet<&'a str>,
}
The next error is about the definition of all_tags
in main
. That's easy enough: just take out the Rc
:
let mut all_tags: HashSet<String> = HashSet::new();
This is easy! Similarly, post_tags
is defined as a HashSet<Rc<String>>
. In this case, we want to hold &str
s instead, so:
let mut post_tags: HashSet<&str> = HashSet::new();
We no longer need to use Rc::new
in the for
loop, or clone the Rc
. So our loop simplifies down to:
for tag in raw_post.tags {
let tag = all_tags.get_or_insert(tag);
post_tags.insert(tag);
}
And (misleadingly), we just have one error message left: the signature for gen_csv
still uses a Rc
. We'll get rid of that with the new signature:
fn gen_csv(all_tags: &HashSet<String>, posts: &[Post]) -> Result<(), std::io::Error> {
And we get an (IMO confusing) error message about &str
and &String
not quite lining up:
error[E0277]: the trait bound `&str: std::borrow::Borrow<std::string::String>` is not satisfied
--> src\main.rs:67:38
|
67 | let field = if post.tags.contains(tag) {
| ^^^^^^^^ the trait `std::borrow::Borrow<std::string::String>` is not implemented for `&str`
But this can be solved by explicitly asking for a &str
via the as_str
method:
let field = if post.tags.contains(tag.as_str()) {
And you might think we're done. But this is where the "misleading" idea comes into play.
The borrow checker wins
If you've been following along, you should now see an error message on your screen that looks something like:
error[E0499]: cannot borrow `all_tags` as mutable more than once at a time
--> src\main.rs:35:23
|
35 | let tag = all_tags.get_or_insert(tag);
| ^^^^^^^^ mutable borrow starts here in previous iteration of loop
error[E0502]: cannot borrow `all_tags` as immutable because it is also borrowed as mutable
--> src\main.rs:46:13
|
35 | let tag = all_tags.get_or_insert(tag);
| -------- mutable borrow occurs here
...
46 | gen_csv(&all_tags, &posts)?;
| ^^^^^^^^^ ------ mutable borrow later used here
| |
| immutable borrow occurs here
I was convinced that the borrow checker was being overly cautious here. Why would a mutable borrow of all_tags
to insert a tag into the set conflict with an immutable borrow of the tags inside the set? (If you already see my error, feel free to laugh at my naivete.) I could follow why I'd violated borrow check rules. Specifically: you can't have a mutable reference and any other reference live at the same time. But I didn't see how this was actually stopping my code from segfaulting.
After a bit more thinking, it clicked. I realized that I had an invariant in my head which did not appear anywhere in my types. And therefore, the borrow checker was fully justified in saying my code was unsafe. What I realized is that I had implicitly been assuming that my mutations of the all_tags
set would never delete any existing values in the set. I can look at my code and see that that's the case. However, the borrow checker doesn't play those kinds of games. It deals with types and facts. And in fact, my code was not provably correct.
So now is really time to quit, and accept the Rc
s, or even just the String
s and wasted memory. We're all done. Please don't keep reading.
Time to get unsafe
OK, I lied. We're going to take one last step here. I'm not going to tell you this is a good idea. I'm not going to tell you this code is generally safe. I am going to tell you that it works in my testing, and that I refuse to commit it to the master branch of the project I'm working on.
We've got two issues:
- We have an unstated invariant that we never delete tags from our
all_tags
HashSet
- We need a mutable reference to the
HashSet
to insert, and that prevents taking immutable references for our tags
Let's fix this. We're going to define a new struct
, called an AppendSet
, which only provides the ability to insert new tags, not delete old ones.
struct AppendSet<T> {
inner: HashSet<T>,
}
We're going to provide three methods:
- A static method
new
, boring
- A
get_or_insert
that behaves just like HashSet
's, but only needs an immutable reference, not a mutable one
- An
inner
method that returns a reference to the internal HashSet
so we can reuse its Iterator
interface
The first and last are really easy. get_or_insert
is a bit more involved, let's just stub it out for now.
impl<T> AppendSet<T> {
fn new() -> Self {
AppendSet {
inner: HashSet::new(),
}
}
fn get_or_insert(&self, t: T) -> &T
where
T: Eq + std::hash::Hash,
{
unimplemented!()
}
fn inner(&self) -> &HashSet<T> {
&self.inner
}
}
Next, we'll redefine all_tags
as:
let all_tags: AppendSet<String> = AppendSet::new();
Note that we no longer have the mut
keyword here. We never need to mutate this thing... sort of. We'll interact with it via get_or_insert
, which at least claims it doesn't mutate. The only other change we have to make is in the call to gen_csv
, where we want to use the inner()
method:
gen_csv(all_tags.inner(), &posts)?;
And perhaps surprisingly, our code now compiles. There's only one thing left to do: implement that get_or_insert
method. And this is where the dirtiness happens.
fn get_or_insert(&self, t: T) -> &T
where
T: Eq + std::hash::Hash,
{
let const_ptr = self as *const Self;
let mut_ptr = const_ptr as *mut Self;
let this = unsafe { &mut *mut_ptr };
this.inner.get_or_insert(t)
}
That's right, unsafe
baby!
This code absolutely works. I'm also fairly certain it won't generally work. We are very likely violating invariants of HashSet
's interface. As one simple example, we now have the ability to change the contents of a HashSet
while there is an active iterate looping through it. I haven't investigated the internals of HashSet
, but I wouldn't be surprised at all to find out this breaks some invariants.
NOTE To address one of these concerns: what if we modified the inner
method on AppendSet
to consume the self
and return a HashSet
? That would definitely help us avoid accidentally violating invariants. But it also won't compile. The AppendSet
itself is immutably borrowed by the Post
values, and therefore we cannot move it.
So does this code work? It seems to. Will AppendSet
generally work for similar problems? I have no idea. Will this code continue to work with future versions of the standard library with changes to HashSet
's implementation? I have no idea. In other words: don't use this code. But it sure was fun to write.
Conclusion
That was certainly a fun excursion. It's a bit disappointing to end up at the ideal solution requiring unsafe to work. But the Rc
version is a really nice middle ground. And even the "bad" version isn't so bad.
A theoretically better answer would be to use a data structure specifically designed for this use case. I didn't do any investigation to see if such things existed already. If you have any advice, please let me know!
Check out other FP Complete Rust information.
Update: accidentally safe
On December 13, 2020, I received a really great email from Pierre Avital, who detailed why the code above is "accidentally safe." With permission, I'm including Pierre's comments here:
I find the unsafe version in your article to have one of the most wonderful cases of "working by accident" I've ever seen. It's actually memory safe, but not for the reasons your might think (you might have thought it through, but it's kind of a funny one so I'll explain my reasoning anyway).
See with other types, even &String
, this would have been technically unsafe (but most likely would never segfault anyway) despite you preventing deletion, because HashSet can't guarantee pointer stability when adding : if it runs out of capacity, it will ask for more memory to the allocator, which might give a completely different chunk of memory (although it is extremely unlikely for small sets). So upon adding an element, you might invalidate the other references (and still, the memory where they pointed would still need to be overwritten for a bug to appear, depending on paging and how lenient the OS is).
This specific issue could be avoided by using HashSet::with_capacity(upper_bound)
, provided you never add more elements than said upper_bound
.
But see, here's the magic of what you wrote: you used &str
, and &str
isn't a pointer to a String
, it's actually a slice, aka a begin and an end pointers, wrapped into a coat to trick you into thinking they're one pointer. This means the references used in Post
don't point to the HashSet's String
s, but directly to their content, which won't move in memory unless they are modified.
Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.
Do you like this blog post and need help with Next Generation Software Engineering, Platform Engineering or Blockchain & Smart Contracts? Contact us.