Membership Prices Increasing in January.

How Strings Work In Rust

Jayson Lennon
Jayson Lennon
hero image

If you’ve ever used Rust, then there’s a good chance you’ve run into strings like String and &str.

You may also have seen the methods as_str() and to_string().

If all you’ve known is Rust then this probably doesn’t stand out to you. Big whoop right?

But if you have a background in other languages? Well, then you’ll probably be both surprised and confused by these multiple string types.

Why does rust have so many string types

Why do you need so many? What do they do?

The good news is we’re going to break all this down and more. In fact, we’ll look at:

  • The differences between the Rust string types
  • When to use which string
  • And help you to understand why Rust has different kinds of strings in the first place

So let’s get into it.

The 2 main reasons for multiple string types in Rust

There are two main reasons why Rust has different types of strings.

  1. The first reason is character encoding (external to Rust)
  2. The second is the ownership system imposed by the language

Reason #1. Character encoding

Computers operate on binary digits 0 and 1, and these digits get grouped into bytes (8 binary digits).

However, because humans understand characters like A or B, this means we need to convert an A to a sequence of bytes to use it on a computer.

And this is where character encoding comes in.

Character encoding works by mapping sequences of bytes to human-readable characters.

Different mappings are available, which is one of the reasons for the different string types, and each string type supports one type of encoding. The good news is, the type of string you're working with means you'll always know the type of encoding you’re using.

We'll go into more detail on how this works later.

character encoding

Reason #2. Ownership

Rust uses ownership rules to manage memory in a program.

These rules dictate that all data has an owner somewhere in the program. The rules also allow owners to lend data out to other parts of a program using borrowing.

Important: Borrowing is efficient because the data does not get copied.

Ownership creates two string types with different uses. The first is an "owned" type representing a heap-allocated string. This string type allows for resizing. The second is a "borrowed" type which acts as a slice (or view) into a heap-allocated "owned" string.

understanding rust ownership with strings

The 7 different Rust string types

Even though the string situation in Rust may seem daunting, the different types of strings simplify things:

  • If you have the "borrowed" kind of string, you can look at it and (if mutable) change characters
  • If you have the "owned" kind of string, you can modify it by adding, deleting, and changing characters. You can also delete the entire string
  • Each kind of string uses a specified encoding, so there's no way to use the wrong encoding when accessing data from an external data source. The inverse of this is true as well: it's not possible to produce strings with incorrect encoding

String, &str, and &'static str

These 3 string types are the normal Rust strings that use UTF-8 encoding for Unicode support.

If you don't need to interface with anything, you'll use this kind of string in Rust.

The String type is an owned data type stored on the heap. Since it exists on the heap, resizing the String is possible by adding or removing characters.

The &str type is a dynamically-sized type that acts as a slice (or view) into a heap-allocated String, with the exception of &'static str.

A &'static str is a view into existing string data but is not always borrowing from a String.

For example

Whenever some hard-coded string data is in a program like this:

let msg = "hello";

Then we know that the data type is &'static str.

This example is actually a special case where the compiler will place "hello" into a known location in memory within the generated executable file.

This means that when the system loads the executable file to run it, "hello" gets included in memory as part of the loading process, and msg can then borrow the data from that loaded memory.

The &'static means the data will exist in memory for the remainder of the program, and since the program is running, "hello" will always be in memory.

Key points to remember: When working with &str, you’re borrowing data from somewhere else. Therefore, you can look at the string data, but you cannot change it.

If you have a &mut str, you can modify existing characters, but you can only add or remove characters if you get mutable access to the original String that lent you the &mut str.

OsString and &OsStr

These strings get used when interfacing with the operating system.

Each operating system uses a specified character encoding, and these strings use the appropriate encoding expected by the underlying operating system. Therefore, if you need to execute a system call or work with file paths, use one of these strings.

The naming follows the same pattern as String and &str:

  • OsString is the "owned" kind, while
  • &OsStr is the "borrowed" kind

Besides the names, the main difference between OsString and String is the character encoding.

For example

Here’s a description of the different encoding schemes used by OsString, straight from the docs:

  • On Unix systems, strings are often arbitrary sequences of non-zero bytes. In many cases, Unix systems interpret these sequences as UTF-8
  • On Windows, strings are often arbitrary sequences of non-zero 16-bit values, interpreted as UTF-16 when it’s valid to do so
  • In Rust, strings are always valid UTF-8, which may contain zeros

CString and &CStr

These strings are compatible with the C programming language. You'll need to use these if you write code that interfaces with C via FFI.

CString is the "owned" kind of string, and &CStr is the "borrowed" kind.

C-compatible strings end with nul, cannot have nul in the middle of the string, and are a sequence of arbitrary bytes.

Making new strings

Now that we’ve covered the differences between String and &str, let's look at how we can work with these types through some example code.

We need to have one first to use a String, so let's start with how to make an owned String:

// using From/Into
let foo = String::from("foo");
let foo: String = "foo".into();

// copy from &str
let foo = "foo".to_owned();
let foo = "foo".to_string();

// formatted String
let foobar = format!("{foo}bar");

// empty mutable String:
let mut foo = String::new();
// push another &str onto the end:
foo.push_str("fo");
// push a single character:
foo.push('o');

// String from bytes
let foo = vec![102, 111, 111];
let foo = String::from_utf8(foo).unwrap();

The String::from() and "foo".into() utilize the From and Into traits.

While String::from() is making a String from the &str, and "foo".into() is turning the &str into a String. Both are great ways to make a String when you don't need to perform any formatting.

The .to_owned() method makes a copy of "foo" (which is a &str) and gives you a heap-allocated String to work with. The .to_string() method does the same thing as .to_owned(), but may signal intent better depending on the surrounding code.

format! is great because you can use formatting arguments to create a formatted String at the time of creation.

If you need to edit a String right away, or a String to build up from component parts, String::new() creates an empty string to work with. Using methods on the String type then allows adding or removing parts of the string.

We can also make a String from raw bytes using String::from_utf8().

Remember that the string types in Rust adhere to a specified character encoding, so make sure the bytes are valid for the string type you’re using. If they’re not valid, you'll get an error.

These techniques to create strings are helpful in different circumstances and provide plenty of options to write clean, readable code.

Borrowing strings

Now that we can make a String, we need to know how to use it with functions. Remember that copying a String is inefficient, so we want to borrow as much as possible.

There are two ways to get a &str from a String. The first is to use the as_str() method:

let foo = String::from("foo");
let borrowed = foo.as_str();

The second method involves the use of the Deref trait.

Deref will interpret one type as another, but only when borrowing. String has an implementation of Deref to borrow as &str, so we can do this:

let foo = String::from("foo");

// this is &str:
let borrowed: &str = &foo;

// this is &String:
let borrowed = &foo;

The type annotation on the binding enables Deref to determine if it's OK to interpret the borrow as a different type.

But we don't have to always use annotations on the binding like we did here. Having &str as the type for a function parameter works as well:

fn print_string(s: &str) {
    println!("{s}")
}
let foo: String = "foo".into();
// ok: `print_string` uses &str annotation
print_string(&foo);

String methods

The functionality implemented on the owned and borrowed string types differs because their intended usage differs.

For example

The owned string types (String, OsString, and CString) have functionality related to the manipulation_of the string, such as adding and removing characters.

On the other hand, the borrowed versions (&str, &OsStr, and &CStr) have functionality related to reading the string, such as searching for characters.

If you have a String and want to search for characters, you don't need to go through any conversion processes. You also don’t need to borrow to get a &str.

Simply use this:

let foo = String::from("foo");
assert!(foo.contains("oo"));

The .contains() method has an implementation on &str, but not on String. Yet we’re still able to use it with the String.

What's happening?

The signature for .contains() is contains(&self, pattern) -> bool which borrows &self.

Since String implements Deref, and &self is a borrow, Deref can use the &str type instead of the String type for the .contains() function.

This signature makes all the functionality from String and all from &str available on just the String type, because we can always borrow a String to get &str.

Important: This only works one way, though! We cannot access String functions if we have just a &str available.

Converting between string types

Sometimes you need an OsString, but all you have available is a String or &str. In these situations, you can perform a lossless conversion of String into an OsString like this:

use std::ffi::OsString;
let foo = String::from("foo");
let os_foo = OsString::from(foo);

The inverse is not true, however: an OsString does not use UTF-8 encoding and may contain data which is not valid Unicode. String requires valid Unicode data, so converting an OsString to String will require error handling:

use std::ffi::OsString;
let os_foo = OsString::from("foo");
let foo: Option<&str> = os_foo.to_str();

If .to_str() fails (or if you don't care if it fails), .to_string_lossy() is an alternative. .to_string_lossy() will replace any invalid byte sequences with the Unicode replacement character :

use std::ffi::OsString;
let os_foo = OsString::from("foo");
let foo = os_foo.to_string_lossy(); // no errors to worry about

This replacement works, but be aware that you'll lose information during the conversion.

If you need to send the original OsString back to the system, you'll need to keep a copy of it for later because you won't be able to recover it from the String.

Dealing with individual characters

Working with individual characters is quite complicated so before we can do something like iterate over the characters in a string, we need to go into the details of UTF-8 character encoding.

Understanding UTF-8 character encoding

At its core, a String is a sequence of bytes. These bytes correspond to entries in the UTF-8 encoding map, which uses Unicode and supports all world languages.

There are a large number of characters when all human languages get combined into a single set. Much more than 255 characters. But the highest value we can store in a byte is 255, meaning the most we can fit in a single byte is 255 distinct characters.

Therefore, to accommodate every human language, most characters need to consume more than one byte of memory.

Let's see how this works using the word pâté (a kind of finely chopped meat), which has four characters.

Pate

We know this word will consume at least four bytes because there are four characters. However, since there isn't enough space in a byte to cover every character in every language, we must use multiple bytes for some characters.

Here’s a table showing the byte sequence for each character of the word pâté:

Char  Bytes
--------------
p     70
â     c3 a2
t     74
é     65 cc 81

Since there are characters in pâté which consume more than one byte, trying to interpret each byte as an individual character results in the following:

Char  Bytes
-----------
p     70
à    c3
¢     a2
t     74
é     65
Ì     cc
�     81

This happens because some bytes (like c3) indicate the beginning of a sequence. But, if the "begin sequence" byte is not present, it’s interpreted as a single-byte character which ends up being an entirely different character.

This means that iterating over bytes of a String, and indexing into a String can result in the wrong data, so if you need to do either of these, be extra careful to maintain the correct encoding.

Using Scalars

There’s a method on &str called .chars() which sounds like what we want.

.chars() iterates over Unicode scalar values, and scalar values are mostly characters, so this should work, but let’s check:

let pate = "pâté";
for ch in pate.chars() {
    dbg!(ch);
}
// There's only 4 characters in the string,
// but we have a count of 5...
assert_eq!(pate.chars().count(), 5);
// output:
// [..] ch = 'p'
// [..] ch = 'â'
// [..] ch = 't'
// [..] ch = 'e'
// [..] ch = '\u{301}'   // what?

Unicode supports every character, and it also has a scalar for every common character. However, this method won’t work, because there isn't a scalar for all variations of every character.

Even though some scalars can be entire characters, they can also be diacritic marks like an acute accent (´). To produce these characters, Unicode uses combining marks, which get combined with the previous scalar, to create complete characters such as the é in pâte. ́

But, since .chars() iterates over scalars, it fails to provide us with the actual character we expect. Instead, it provides us with some characters and diacritics, but not a combination.

So, if we can't use bytes and scalars, what can we use?

The answer is grapheme clusters.

Grapheme Clusters

A grapheme cluster is a "user-perceived character" and is the closest thing we have to a single "character". Grapheme clusters consist of one or more scalars, including the combining marks that pull together to make a single character.

To access the grapheme clusters of a string, we need to pull in the Unicode-segmentation crate.

This crate adheres to the Unicode standards for grapheme clusters and word boundaries, and it also provides iterators over grapheme clusters and words:

//! ```cargo
//! [dependencies]
//! unicode-segmentation = "1.10.0"
//! ```

use unicode_segmentation::UnicodeSegmentation;

let pate = "pâté";
for ch in pate.graphemes(true) {
    dbg!(ch);
}
// The assertion passes with a count of 4
// since there are 4 grapheme clusters
assert_eq!(pate.graphemes(true).count(), 4);
// output:
// [..] ch = "p"
// [..] ch = "â"
// [..] ch = "t"
// [..] ch = "e\u{301}"

So let’s break this down.

The first three lines of output from this program show individual characters, and each of these has a corresponding scalar value in Unicode because they are base characters (most characters are base characters).

However, the last line has an output of "e\u{301}", which shows us we’re using the scalars e and \u{301} to form the grapheme cluster.

Also, \u{301} is the acute accent combining mark that combines with e to create , so it works!

And that's it for strings in Rust

Whew - that was a lot of information! Hopefully, this tutorial has helped you to understand strings in Rust a little better, and when to use each.

Don’t feel bad if you need to read through it a few times, but make sure you have a grasp before moving on. Unfortunately, due to the encoding requirements imposed by different systems and the number of human languages used worldwide, this is all pretty vital to learn.

However, if there's one thing to remember from this guide, simply make sure to use the unicode_segmentation crate whenever you’re working with string. Otherwise, your string handling code has a good chance of being incorrect, regardless of which string you try to use.

And if you have any more questions or want to learn more about Rust?

Check out my Rust Programming course, where you’ll learn everything you need to know to confidently use the world’s most loved programming language!

You can also ask questions in the dedicated Discord server and chat with other Rust users, as well as myself!

BONUS: More Rust tutorials, guides & resources

If you've made it this far, you're clearly interested in Rust so definitely check out all of my Rust posts and content:

More from Zero To Mastery

Top 15 Rust Projects To Elevate Your Skills preview
Top 15 Rust Projects To Elevate Your Skills

From beginner to advanced, these are the best Rust projects to push your skills, grow your confidence, and wow potential employers. Check them out now!

53 Rust Interview Questions + Answers (Easy, Medium, Hard) preview
53 Rust Interview Questions + Answers (Easy, Medium, Hard)

Are you ready for your Rust interview? Try out these 53 Rust programming interview questions to find out. Or use them as practice questions to help you prepare!

Rust Programming Language: AMA Deep Dive preview
Rust Programming Language: AMA Deep Dive

Jayson Lennon breaks down the most common asked questions about the Rust programmming language in this developer AMA.