Avoiding duplicating strings in Rust

Based on actual events.

Let's say you've got a blog. The blog has a bunch of posts. Each post has a title and a set of tags. The metadata for these posts is all contained in TOML files in a single directory. (If you use Zola, you're pretty close to that.) And now you need to generate a CSV file showing a matrix of blog posts and their tags. Seems like a great job for Rust!

In this post, we're going to:

Explore how we'd solve this (fairly simple) problem
Investigate how Rust's types tell us a lot about memory usage
Play with some nice and not-so-nice ways to optimize our program

Onwards!

Program behavior

We've got a bunch of TOML files sitting in the posts directory. Here are some example files:

# devops-for-developers.toml
title = "DevOps for (Skeptical) Developers"
tags = ["dev", "devops"]

# rust-devops.toml
title = "Rust with DevOps"
tags = ["devops", "rust"]

We want to create a CSV file that looks like this:

Title,dev,devops,rust,streaming
DevOps for (Skeptical) Developers,true,true,false,false
Rust with DevOps,false,true,true,false
Serverless Rust using WASM and Cloudflare,false,true,true,false
Streaming UTF-8 in Haskell and Rust,false,false,true,true

To make this happen, we need to:

Iterate through the files in the posts directory
Load and parse each TOML file
Collect a set of all tags present in all posts
Collect the parsed post information
Create a CSV file from that information

Not too bad, right?

Setup

You should make sure you've installed the Rust tools. Then you can create a new empty project with cargo new tagcsv.

Later on, we're going to play with some unstable language features, so let's opt into a nightly version of the compiler. To do this, create a rust-toolchain file containing:

nightly-2020-08-29

Then add the following dependencies to your Cargo.toml file:

[dependencies]
csv = "1.1.3"
serde = "1.0.115"
serde_derive = "1.0.115"
toml = "0.5.6"

OK, now we can finally work on some code!

First version

We're going to use the toml crate to parse our metadata files. toml is built on top of serde, and we can conveniently use serde_derive to automatically derive a Deserialize implementation for a struct that represents that metadata. So we'll start off our program with:

use serde_derive::Deserialize;
use std::collections::HashSet;

#[derive(Deserialize)]
struct Post {
    title: String,
    tags: HashSet<String>,
}

Next, we'll define our main function to load the data:

fn main() -> Result<(), std::io::Error> {
    // Collect all tags across all of the posts
    let mut all_tags: HashSet<String> = HashSet::new();
    // And collect the individual posts
    let mut posts: Vec<Post> = Vec::new();

    // Read in the files in the posts directory
    let dir = std::fs::read_dir("posts")?;
    for entry in dir {
        // Error handling
        let entry = entry?;
        // Read the file contents as a String
        let contents = std::fs::read_to_string(entry.path())?;
        // Parse the contents with the toml crate
        let post: Post = toml::from_str(&contents)?;
        // Add all of the tags to the all_tags set
        for tag in &post.tags {
            all_tags.insert(tag.clone());
        }
        // Update the Vec of posts
        posts.push(post);
    }
    // Generate the CSV output
    gen_csv(&all_tags, &posts)?;
    Ok(())
}

And finally, let's define our gen_csv function to take the set of tags and the Vec of posts and generate the output file:

fn gen_csv(all_tags: &HashSet<String>, posts: &[Post]) -> Result<(), std::io::Error> {
    // Open the file for output
    let mut writer = csv::Writer::from_path("tag-matrix.csv")?;

    // Generate the header, with the word "Title" and then all of the tags
    let mut header = vec!["Title"];
    for tag in all_tags.iter() {
        header.push(tag);
    }
    writer.write_record(header)?;

    // Print out a separate row for each post
    for post in posts {
        // Create a record with the post title...
        let mut record = vec![post.title.as_str()];
        for tag in all_tags {
            // and then a true or false for each tag name
            let field = if post.tags.contains(tag) {
                "true"
            } else {
                "false"
            };
            record.push(field);
        }
        writer.write_record(record)?;
    }
    writer.flush()?;
    Ok(())
}

Side note: it would be slightly nicer to alphabetize the set of tags, which you can do by collecting all of the tags into a Vec and then sorting it. I had that previously, but removed it in the code above to reduce incidental noise to the example. If you feel like having fun, try adding that back.

Anyway, this program works exactly as we want, and produces a CSV file. Perfect, right?

Let the types guide you

I love type-driven programming. I love the idea that looking at the types tells you a lot about the behavior of your program. And in Rust, the types can often tell you about the memory usage of your program. I want to focus on two lines, and then prove a point with a third. Consider:

tags: HashSet<String>,

and

let mut all_tags: HashSet<String> = HashSet::new();

Firstly, I love the fact that the types tell us so much about expected behavior. The tags are a set: the order is unimportant, and there are no duplicates. That makes sense. We don't want to list "devops" twice in our set of all tags. And there's nothing inherently "first" or "second" about "dev" vs "rust". And we know that tags are arbitrary pieces of textual data. Awesome.

But what I really like here is that it tells us about memory usage. Each post has its own copy of each tag. So does the all_tags set. How do I know this? Easy: because that's exactly what String means. There's no possibility of data sharing, at all. If we have 200 posts tagged "dev", we will have 201 copies of the string "dev" in memory (200 for the posts, once for the all_tags).

And now that we've seen it in the types, we can see evidence of it in the implementation too:

all_tags.insert(tag.clone());

That .clone() bothered me when I first wrote it. And that's what got me to look at the types, which bothered me further.

In reality, this is nothing to worry about. Even with 1,000 posts, averaging 5 tags, with each tag averaging 20 bytes, this will only take up an extra 100,000 bytes of memory. So optimizing this away is not a good use of our time. We're much better off doing something else.

But I wanted to have fun. And if you're reading this post, I think you want to continue this journey too. Onwards!

Rc

This isn't the first solution I tried. But it's the first one that worked easily. So we'll start here.

The first thing we have to change is our types. As long as we have HashSet<String>, we know for a fact that we'll have extra copies of the data. This seems like a nice use case for Rc. Rc uses reference counting to let multiple values share ownership of another value. Sounds like exactly what we want!

My approach here is to use compiler-error-driven development, and I encourage you to play along with your own copy of the code. First, let's use Rc:

use std::rc::Rc;

Next, let's change our definition of Post to use an Rc<String> instead of String:

#[derive(Deserialize)]
struct Post {
    title: String,
    tags: HashSet<Rc<String>>,
}

The compiler doesn't like this very much. We can't derive Deserialize for an Rc<String>. So instead, let's make a RawPost struct for the deserializing, and then dedicate Post for holding the data with Rc<String>. In other words:

#[derive(Deserialize)]
struct RawPost {
    title: String,
    tags: HashSet<String>,
}

struct Post {
    title: String,
    tags: HashSet<Rc<String>>,
}

And then, when parsing the toml, we'll parse into a RawPost type:

let post: RawPost = toml::from_str(&contents)?;

If you're following along, you'll only have one error message at this point about posts.push(post); having a mismatch between Post and RawPost. But before we address that, let's make one more type change above. I want to make all_tags contain Rc<String>.

let mut all_tags: HashSet<Rc<String>> = HashSet::new();

OK, now we've got some nice error messages about mismatches between Rc<String> and String. This is where we have to be careful. The easiest thing to do would be to simply wrap our Strings in an Rc and end up with lots of copies of String. Let's implement the next bit incorrectly first to see what I'm talking about.

At this point in our code rewrite, we've got a RawPost, and we need to:

Add its tags to all_tags
Create a new Post value based on the RawPost
Add the Post to the posts Vec

Here's the simple and wasteful implementation:

let raw_post: RawPost = toml::from_str(&contents)?;

let mut post_tags: HashSet<Rc<String>> = HashSet::new();

for tag in raw_post.tags {
    let tag = Rc::new(tag);
    all_tags.insert(tag.clone());
    post_tags.insert(tag);
}

let post = Post {
    title: raw_post.title,
    tags: post_tags,
};
posts.push(post);

The problem here is that we always keep the original String from the RawPost. If that tag is already present in the all_tags set, we don't end up using the same copy.

There's an unstable method on HashSets that helps us out here. get_or_insert will try to insert a value into a HashSet. If the value is already present, it will drop the new value and return a reference to the original value. If the value isn't present, the value is added to the HashSet and we get a reference back to it. Changing our code to use that is pretty easy:

for tag in raw_post.tags {
    let tag = Rc::new(tag);
    let tag = all_tags.get_or_insert(tag);
    post_tags.insert(tag.clone());
}

We still end up with a .clone() call, but now it's a clone of an Rc, which is a cheap integer increment. No additional memory allocation required! Since this method is unstable, we also have to enable the feature by adding this at the top of your source file:

#![feature(hash_set_entry)]

And only one more change required. The signature for gen_csv is expecting a &HashSet<String>. If you change that to &HashSet<Rc<String>>, the code will compile and run correctly. Yay!

In case you got lost with all of the edits above, here's the current version of main.rs:

Quibbles

I already told you that the original HashSet<String> version of the code is likely Good Enough™ for most cases. I'll tell you that, if you're really bothered by that overhead, the HashSet<Rc<String>> version if almost certainly the right call. So we should probably just stop here and end the blog post on a nice, safe note.

But let's be bold and crazy. I don't actually like this version of the code that much, for two reasons:

The Rc feels dirty here. Rc is great for weird lifetime situations with values. But in our case, we know that the all_tags set, which owns all of the tags, will always outlive the usage of the tags inside the Posts. So reference counting feels like an unnecessary overhead and obscuring the situation.
As demonstrated before, it's all too easy to mess up with the Rc<String> version. You can accidentally bypass all of the memory saving benefits by using a new String instead of cloning a reference to an existing one.

What I'd really like to do is to have all_tags be a HashSet<String> and own the tags themselves. And then, inside Post, I'd like to keep references to those tags. Unfortunately, this doesn't quite work. Can you foresee why? If not, don't worry, I didn't see it until the borrow checker told me how wrong I was a few times. Let's experience that joy together. And we'll do it with compiler-driven development again.

The first thing I'm going to do is remove the use std::rc::Rc; statement. That leads to our first error: Rc isn't in scope for Post. We want to keep a &str in this struct. But we have to be explicit about lifetimes when holding references in structs. So our code ends up as:

struct Post<'a> {
    title: String,
    tags: HashSet<&'a str>,
}

The next error is about the definition of all_tags in main. That's easy enough: just take out the Rc:

let mut all_tags: HashSet<String> = HashSet::new();

This is easy! Similarly, post_tags is defined as a HashSet<Rc<String>>. In this case, we want to hold &strs instead, so:

let mut post_tags: HashSet<&str> = HashSet::new();

We no longer need to use Rc::new in the for loop, or clone the Rc. So our loop simplifies down to:

for tag in raw_post.tags {
    let tag = all_tags.get_or_insert(tag);
    post_tags.insert(tag);
}

And (misleadingly), we just have one error message left: the signature for gen_csv still uses a Rc. We'll get rid of that with the new signature:

fn gen_csv(all_tags: &HashSet<String>, posts: &[Post]) -> Result<(), std::io::Error> {

And we get an (IMO confusing) error message about &str and &String not quite lining up:

error[E0277]: the trait bound `&str: std::borrow::Borrow<std::string::String>` is not satisfied
  --> src\main.rs:67:38
   |
67 |             let field = if post.tags.contains(tag) {
   |                                      ^^^^^^^^ the trait `std::borrow::Borrow<std::string::String>` is not implemented for `&str`

But this can be solved by explicitly asking for a &str via the as_str method:

let field = if post.tags.contains(tag.as_str()) {

And you might think we're done. But this is where the "misleading" idea comes into play.

The borrow checker wins

If you've been following along, you should now see an error message on your screen that looks something like:

error[E0499]: cannot borrow `all_tags` as mutable more than once at a time
  --> src\main.rs:35:23
   |
35 |             let tag = all_tags.get_or_insert(tag);
   |                       ^^^^^^^^ mutable borrow starts here in previous iteration of loop

error[E0502]: cannot borrow `all_tags` as immutable because it is also borrowed as mutable
  --> src\main.rs:46:13
   |
35 |             let tag = all_tags.get_or_insert(tag);
   |                       -------- mutable borrow occurs here
...
46 |     gen_csv(&all_tags, &posts)?;
   |             ^^^^^^^^^  ------ mutable borrow later used here
   |             |
   |             immutable borrow occurs here

I was convinced that the borrow checker was being overly cautious here. Why would a mutable borrow of all_tags to insert a tag into the set conflict with an immutable borrow of the tags inside the set? (If you already see my error, feel free to laugh at my naivete.) I could follow why I'd violated borrow check rules. Specifically: you can't have a mutable reference and any other reference live at the same time. But I didn't see how this was actually stopping my code from segfaulting.

After a bit more thinking, it clicked. I realized that I had an invariant in my head which did not appear anywhere in my types. And therefore, the borrow checker was fully justified in saying my code was unsafe. What I realized is that I had implicitly been assuming that my mutations of the all_tags set would never delete any existing values in the set. I can look at my code and see that that's the case. However, the borrow checker doesn't play those kinds of games. It deals with types and facts. And in fact, my code was not provably correct.

So now is really time to quit, and accept the Rcs, or even just the Strings and wasted memory. We're all done. Please don't keep reading.

Time to get unsafe

OK, I lied. We're going to take one last step here. I'm not going to tell you this is a good idea. I'm not going to tell you this code is generally safe. I am going to tell you that it works in my testing, and that I refuse to commit it to the master branch of the project I'm working on.

We've got two issues:

We have an unstated invariant that we never delete tags from our all_tags HashSet
We need a mutable reference to the HashSet to insert, and that prevents taking immutable references for our tags

Let's fix this. We're going to define a new struct, called an AppendSet, which only provides the ability to insert new tags, not delete old ones.

struct AppendSet<T> {
    inner: HashSet<T>,
}

We're going to provide three methods:

A static method new, boring
A get_or_insert that behaves just like HashSet's, but only needs an immutable reference, not a mutable one
An inner method that returns a reference to the internal HashSet so we can reuse its Iterator interface

The first and last are really easy. get_or_insert is a bit more involved, let's just stub it out for now.

impl<T> AppendSet<T> {
    fn new() -> Self {
        AppendSet {
            inner: HashSet::new(),
        }
    }

    fn get_or_insert(&self, t: T) -> &T
    where
        T: Eq + std::hash::Hash,
    {
        unimplemented!()
    }

    fn inner(&self) -> &HashSet<T> {
        &self.inner
    }
}

Next, we'll redefine all_tags as:

let all_tags: AppendSet<String> = AppendSet::new();

Note that we no longer have the mut keyword here. We never need to mutate this thing... sort of. We'll interact with it via get_or_insert, which at least claims it doesn't mutate. The only other change we have to make is in the call to gen_csv, where we want to use the inner() method:

gen_csv(all_tags.inner(), &posts)?;

And perhaps surprisingly, our code now compiles. There's only one thing left to do: implement that get_or_insert method. And this is where the dirtiness happens.

fn get_or_insert(&self, t: T) -> &T
where
    T: Eq + std::hash::Hash,
{
    let const_ptr = self as *const Self;
    let mut_ptr = const_ptr as *mut Self;
    let this = unsafe { &mut *mut_ptr };
    this.inner.get_or_insert(t)
}

That's right, unsafe baby!

This code absolutely works. I'm also fairly certain it won't generally work. We are very likely violating invariants of HashSet's interface. As one simple example, we now have the ability to change the contents of a HashSet while there is an active iterate looping through it. I haven't investigated the internals of HashSet, but I wouldn't be surprised at all to find out this breaks some invariants.

NOTE To address one of these concerns: what if we modified the inner method on AppendSet to consume the self and return a HashSet? That would definitely help us avoid accidentally violating invariants. But it also won't compile. The AppendSet itself is immutably borrowed by the Post values, and therefore we cannot move it.

So does this code work? It seems to. Will AppendSet generally work for similar problems? I have no idea. Will this code continue to work with future versions of the standard library with changes to HashSet's implementation? I have no idea. In other words: don't use this code. But it sure was fun to write.

Conclusion

That was certainly a fun excursion. It's a bit disappointing to end up at the ideal solution requiring unsafe to work. But the Rc version is a really nice middle ground. And even the "bad" version isn't so bad.

A theoretically better answer would be to use a data structure specifically designed for this use case. I didn't do any investigation to see if such things existed already. If you have any advice, please let me know!

Check out other FP Complete Rust information.

Update: accidentally safe

On December 13, 2020, I received a really great email from Pierre Avital, who detailed why the code above is "accidentally safe." With permission, I'm including Pierre's comments here:

I find the unsafe version in your article to have one of the most wonderful cases of "working by accident" I've ever seen. It's actually memory safe, but not for the reasons your might think (you might have thought it through, but it's kind of a funny one so I'll explain my reasoning anyway).

See with other types, even &String, this would have been technically unsafe (but most likely would never segfault anyway) despite you preventing deletion, because HashSet can't guarantee pointer stability when adding : if it runs out of capacity, it will ask for more memory to the allocator, which might give a completely different chunk of memory (although it is extremely unlikely for small sets). So upon adding an element, you might invalidate the other references (and still, the memory where they pointed would still need to be overwritten for a bug to appear, depending on paging and how lenient the OS is). This specific issue could be avoided by using HashSet::with_capacity(upper_bound), provided you never add more elements than said upper_bound.

But see, here's the magic of what you wrote: you used &str, and &str isn't a pointer to a String, it's actually a slice, aka a begin and an end pointers, wrapped into a coat to trick you into thinking they're one pointer. This means the references used in Post don't point to the HashSet's Strings, but directly to their content, which won't move in memory unless they are modified.

Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.

Do you like this blog post and need help with Next Generation Software Engineering, Platform Engineering or Blockchain & Smart Contracts? Contact us.