If you’ve ever used Rust, then there’s a good chance you’ve run into strings like String
and &str
.
You may also have seen the methods as_str()
and to_string()
.
If all you’ve known is Rust then this probably doesn’t stand out to you. Big whoop right?
But if you have a background in other languages? Well, then you’ll probably be both surprised and confused by these multiple string types.
Why do you need so many? What do they do?
The good news is we’re going to break all this down and more. In fact, we’ll look at:
So let’s get into it.
There are two main reasons why Rust has different types of strings.
Computers operate on binary digits 0
and 1
, and these digits get grouped into bytes (8 binary digits).
However, because humans understand characters like A
or B
, this means we need to convert an A
to a sequence of bytes to use it on a computer.
And this is where character encoding comes in.
Character encoding works by mapping sequences of bytes to human-readable characters.
Different mappings are available, which is one of the reasons for the different string types, and each string type supports one type of encoding. The good news is, the type of string you're working with means you'll always know the type of encoding you’re using.
We'll go into more detail on how this works later.
Rust uses ownership rules to manage memory in a program.
These rules dictate that all data has an owner somewhere in the program. The rules also allow owners to lend data out to other parts of a program using borrowing.
Important: Borrowing is efficient because the data does not get copied.
Ownership creates two string types with different uses. The first is an "owned" type representing a heap-allocated string. This string type allows for resizing. The second is a "borrowed" type which acts as a slice (or view) into a heap-allocated "owned" string.
Even though the string situation in Rust may seem daunting, the different types of strings simplify things:
String
, &str
, and &'static str
These 3 string types are the normal Rust strings that use UTF-8 encoding for Unicode support.
If you don't need to interface with anything, you'll use this kind of string in Rust.
The String
type is an owned data type stored on the heap. Since it exists on the heap, resizing the String
is possible by adding or removing characters.
The &str
type is a dynamically-sized type that acts as a slice (or view) into a heap-allocated String
, with the exception of &'static str
.
A &'static str
is a view into existing string data but is not always borrowing from a String
.
For example
Whenever some hard-coded string data is in a program like this:
let msg = "hello";
Then we know that the data type is &'static str
.
This example is actually a special case where the compiler will place "hello"
into a known location in memory within the generated executable file.
This means that when the system loads the executable file to run it, "hello"
gets included in memory as part of the loading process, and msg
can then borrow the data from that loaded memory.
The &'static
means the data will exist in memory for the remainder of the program, and since the program is running, "hello"
will always be in memory.
Key points to remember: When working with &str
, you’re borrowing data from somewhere else. Therefore, you can look at the string data, but you cannot change it.
If you have a &mut str
, you can modify existing characters, but you can only add or remove characters if you get mutable access to the original String
that lent you the &mut str
.
OsString
and &OsStr
These strings get used when interfacing with the operating system.
Each operating system uses a specified character encoding, and these strings use the appropriate encoding expected by the underlying operating system. Therefore, if you need to execute a system call or work with file paths, use one of these strings.
The naming follows the same pattern as String
and &str
:
OsString
is the "owned" kind, while&OsStr
is the "borrowed" kindBesides the names, the main difference between OsString
and String
is the character encoding.
For example
Here’s a description of the different encoding schemes used by OsString
, straight from the docs:
CString
and &CStr
These strings are compatible with the C programming language. You'll need to use these if you write code that interfaces with C via FFI.
CString
is the "owned" kind of string, and &CStr
is the "borrowed" kind.
C-compatible strings end with nul
, cannot have nul
in the middle of the string, and are a sequence of arbitrary bytes.
Now that we’ve covered the differences between String
and &str
, let's look at how we can work with these types through some example code.
We need to have one first to use a String
, so let's start with how to make an owned String
:
// using From/Into
let foo = String::from("foo");
let foo: String = "foo".into();
// copy from &str
let foo = "foo".to_owned();
let foo = "foo".to_string();
// formatted String
let foobar = format!("{foo}bar");
// empty mutable String:
let mut foo = String::new();
// push another &str onto the end:
foo.push_str("fo");
// push a single character:
foo.push('o');
// String from bytes
let foo = vec![102, 111, 111];
let foo = String::from_utf8(foo).unwrap();
The String::from()
and "foo".into()
utilize the From
and Into
traits.
While String::from()
is making a String
from the &str
, and "foo".into()
is turning the &str
into a String
. Both are great ways to make a String
when you don't need to perform any formatting.
The .to_owned()
method makes a copy of "foo"
(which is a &str
) and gives you a heap-allocated String
to work with. The .to_string()
method does the same thing as .to_owned()
, but may signal intent better depending on the surrounding code.
format!
is great because you can use formatting arguments to create a formatted String
at the time of creation.
If you need to edit a String
right away, or a String
to build up from component parts, String::new()
creates an empty string to work with. Using methods on the String
type then allows adding or removing parts of the string.
We can also make a String
from raw bytes using String::from_utf8()
.
Remember that the string types in Rust adhere to a specified character encoding, so make sure the bytes are valid for the string type you’re using. If they’re not valid, you'll get an error.
These techniques to create strings are helpful in different circumstances and provide plenty of options to write clean, readable code.
Now that we can make a String
, we need to know how to use it with functions. Remember that copying a String
is inefficient, so we want to borrow as much as possible.
There are two ways to get a &str
from a String
. The first is to use the as_str()
method:
let foo = String::from("foo");
let borrowed = foo.as_str();
The second method involves the use of the Deref
trait.
Deref
will interpret one type as another, but only when borrowing. String
has an implementation of Deref
to borrow as &str
, so we can do this:
let foo = String::from("foo");
// this is &str:
let borrowed: &str = &foo;
// this is &String:
let borrowed = &foo;
The type annotation on the binding enables Deref
to determine if it's OK to interpret the borrow as a different type.
But we don't have to always use annotations on the binding like we did here. Having &str
as the type for a function parameter works as well:
fn print_string(s: &str) {
println!("{s}")
}
let foo: String = "foo".into();
// ok: `print_string` uses &str annotation
print_string(&foo);
The functionality implemented on the owned and borrowed string types differs because their intended usage differs.
For example
The owned string types (String
, OsString
, and CString
) have functionality related to the manipulation_of the string, such as adding and removing characters.
On the other hand, the borrowed versions (&str
, &OsStr
, and &CStr
) have functionality related to reading the string, such as searching for characters.
If you have a String
and want to search for characters, you don't need to go through any conversion processes. You also don’t need to borrow to get a &str
.
Simply use this:
let foo = String::from("foo");
assert!(foo.contains("oo"));
The .contains()
method has an implementation on &str
, but not on String
. Yet we’re still able to use it with the String
.
What's happening?
The signature for .contains()
is contains(&self, pattern) -> bool
which borrows &self
.
Since String
implements Deref
, and &self
is a borrow, Deref
can use the &str
type instead of the String
type for the .contains()
function.
This signature makes all the functionality from String
and all from &str
available on just the String
type, because we can always borrow a String
to get &str
.
Important: This only works one way, though! We cannot access
String
functions if we have just a&str
available.
Sometimes you need an OsString
, but all you have available is a String
or &str
. In these situations, you can perform a lossless conversion of String
into an OsString
like this:
use std::ffi::OsString;
let foo = String::from("foo");
let os_foo = OsString::from(foo);
The inverse is not true, however: an OsString
does not use UTF-8 encoding and may contain data which is not valid Unicode. String
requires valid Unicode data, so converting an OsString
to String
will require error handling:
use std::ffi::OsString;
let os_foo = OsString::from("foo");
let foo: Option<&str> = os_foo.to_str();
If .to_str()
fails (or if you don't care if it fails), .to_string_lossy()
is an alternative. .to_string_lossy()
will replace any invalid byte sequences with the Unicode replacement character �
:
use std::ffi::OsString;
let os_foo = OsString::from("foo");
let foo = os_foo.to_string_lossy(); // no errors to worry about
This replacement works, but be aware that you'll lose information during the conversion.
If you need to send the original OsString
back to the system, you'll need to keep a copy of it for later because you won't be able to recover it from the String
.
Working with individual characters is quite complicated so before we can do something like iterate over the characters in a string, we need to go into the details of UTF-8 character encoding.
At its core, a String
is a sequence of bytes. These bytes correspond to entries in the UTF-8 encoding map, which uses Unicode and supports all world languages.
There are a large number of characters when all human languages get combined into a single set. Much more than 255 characters. But the highest value we can store in a byte is 255, meaning the most we can fit in a single byte is 255 distinct characters.
Therefore, to accommodate every human language, most characters need to consume more than one byte of memory.
Let's see how this works using the word pâté (a kind of finely chopped meat), which has four characters.
We know this word will consume at least four bytes because there are four characters. However, since there isn't enough space in a byte to cover every character in every language, we must use multiple bytes for some characters.
Here’s a table showing the byte sequence for each character of the word pâté:
Char Bytes
--------------
p 70
â c3 a2
t 74
é 65 cc 81
Since there are characters in pâté which consume more than one byte, trying to interpret each byte as an individual character results in the following:
Char Bytes
-----------
p 70
à c3
¢ a2
t 74
é 65
Ì cc
� 81
This happens because some bytes (like c3
) indicate the beginning of a sequence. But, if the "begin sequence" byte is not present, it’s interpreted as a single-byte character which ends up being an entirely different character.
This means that iterating over bytes of a String
, and indexing into a String
can result in the wrong data, so if you need to do either of these, be extra careful to maintain the correct encoding.
There’s a method on &str
called .chars()
which sounds like what we want.
.chars()
iterates over Unicode scalar values, and scalar values are mostly characters, so this should work, but let’s check:
let pate = "pâté";
for ch in pate.chars() {
dbg!(ch);
}
// There's only 4 characters in the string,
// but we have a count of 5...
assert_eq!(pate.chars().count(), 5);
// output:
// [..] ch = 'p'
// [..] ch = 'â'
// [..] ch = 't'
// [..] ch = 'e'
// [..] ch = '\u{301}' // what?
Unicode supports every character, and it also has a scalar for every common character. However, this method won’t work, because there isn't a scalar for all variations of every character.
Even though some scalars can be entire characters, they can also be diacritic marks like an acute accent (´). To produce these characters, Unicode uses combining marks, which get combined with the previous scalar, to create complete characters such as the é in pâte. ́
But, since .chars()
iterates over scalars, it fails to provide us with the actual character we expect. Instead, it provides us with some characters and diacritics, but not a combination.
So, if we can't use bytes and scalars, what can we use?
The answer is grapheme clusters.
A grapheme cluster is a "user-perceived character" and is the closest thing we have to a single "character". Grapheme clusters consist of one or more scalars, including the combining marks that pull together to make a single character.
To access the grapheme clusters of a string, we need to pull in the Unicode-segmentation crate.
This crate adheres to the Unicode standards for grapheme clusters and word boundaries, and it also provides iterators over grapheme clusters and words:
//! ```cargo
//! [dependencies]
//! unicode-segmentation = "1.10.0"
//! ```
use unicode_segmentation::UnicodeSegmentation;
let pate = "pâté";
for ch in pate.graphemes(true) {
dbg!(ch);
}
// The assertion passes with a count of 4
// since there are 4 grapheme clusters
assert_eq!(pate.graphemes(true).count(), 4);
// output:
// [..] ch = "p"
// [..] ch = "â"
// [..] ch = "t"
// [..] ch = "e\u{301}"
So let’s break this down.
The first three lines of output from this program show individual characters, and each of these has a corresponding scalar value in Unicode because they are base characters (most characters are base characters).
However, the last line has an output of "e\u{301}"
, which shows us we’re using the scalars e
and \u{301}
to form the grapheme cluster.
Also, \u{301}
is the acute accent combining mark that combines with e
to create é
, so it works!
Whew - that was a lot of information! Hopefully, this tutorial has helped you to understand strings in Rust a little better, and when to use each.
Don’t feel bad if you need to read through it a few times, but make sure you have a grasp before moving on. Unfortunately, due to the encoding requirements imposed by different systems and the number of human languages used worldwide, this is all pretty vital to learn.
However, if there's one thing to remember from this guide, simply make sure to use the
unicode_segmentation
crate whenever you’re working with string. Otherwise, your string handling code has a good chance of being incorrect, regardless of which string you try to use.
And if you have any more questions or want to learn more about Rust?
Check out my Rust Programming course, where you’ll learn everything you need to know to confidently use the world’s most loved programming language!
You can also ask questions in the dedicated Discord server and chat with other Rust users, as well as myself!
If you've made it this far, you're clearly interested in Rust so definitely check out all of my Rust posts and content: