String Manipulation in C
The page teaches you how to work with strings in the C programming language. If you are used to higher level languages such as Python or Javascript, you’ll quickly discover that C does not have some of the “nice” string manipulation features such as string concatenation with +
(e.g. str1 + str2
), no easy .length()
property (although there is a function for that, strlen()
) nor things like a built in regex engine. It doesn’t really even have the concept of a string type, just “pointer to char” (char*
) (some would argue that for all intents and purposes that is a string type).
What Are Strings In C?
Strings in C are represented by either a pointer to char char* myStr
or array of char char myStr[10]
. What is laid out in memory is a series of byte sized ASCII encoded characters, terminated by the null character ('\0'
or 0x00
).
Say we wanted to store the string “Hello”. We could write char* myStr = "Hello";
. If the compiler decided to plonk this at memory address 0x01
(I avoided 0x00
since that invalid — it’s the null pointer) then the memory would look like this:
But this is not all, the memory above is used just to create the literal "Hello"
. What also is created is the variable myStr
, which points to the first character in this string. Let’s assume the compile decides to plonk this at memory address 0x08
, then your memory will look like this.
Whenever you pass your string variable myStr
to functions, this 0x01
value is what is passed in. String functions, aware of the type (it’s a pointer to char), know to increment through memory until they hit the null char when they want to do operations on the string.
One thing C does differently to a lot of other languages is it’s choice of using null to determine the end of the string, rather than reserving some bytes at the start of the character array for holding it’s length. To find the length of the string you have to iterate through the bytes until you find 0x00
, which is what strlen()
does (more on this below).
Different Ways Of Creating Strings
Whenever you write anything in double quotes in C (e.g. "abc"
), you are creating what is called a string literal. You can assign a string literal to a pointer to char, like so:
You can also assign a string literal to an array of char, like so:
These two ways are mostly equivalent, BUT you must remember that myStr2
is an array with a known size (at compile time), whilst myStr1
is just a pointer to a char. Thus, sizeof()
is going to behave differently.
When compiled for a 64-bit architecture, sizeof(myStr1)
returns 8, the number of bytes required for the pointer myStr1
which points to the memory address of the a
in "abc"
; However, sizeof(myStr2)
returns 4, the number of bytes in the array myStr2
which consists of ['a', 'b', 'c', '\0']
. This difference is bound to trip you up at some point, so be aware of this difference.
You can also create an empty string of a certain size:
Char vs. String
In C, 'a'
and "A"
are completely different types (as opposed to say, Javascript, in where you can swap "
for '
and it doesn’t change anything). 'a'
is a single character, whilst "A"
is a string literal (which is an array of characters). You can’t do this:
You have to do this:
It then makes sense that to set an individual character in a string, you can use the '
notation:
A char
literal does not get a null character appended to the end of it, it is just a single byte.
ASCII Characters
Null Character
The null character (represented by '\0'
, 0x00
, or a constant defined as NULL
) is used to terminate strings and arrays (essentially the same thing anyway). Most string-based commands will automatically insert the null character at the end of a string, while memory commands won’t. Special care has to be taken to make sure the memory set aside for a string can accommodate the null character (when talking in bytes, the number of ASCII characters + 1).
Carriage Return And New Line
These two characters are used to being a new line of text. If you take the names literally, the carriage return moves the cursor back to the start of the line, and the new line shifts the cursor down one line. These characters are reminiscent of the typewriter days.
In Linux style systems, only the new line character is used to begin a new line of text. In Windows, both the carriage return and new line characters are used. For embedded systems, I’ve seen either preference used (probably depending on what development OS the writer was using!). Most terminal programs support both approaches. The carriage return is inserted into code using \r
, and a new line using \n
. In Windows, the standard order is to write the carriage return first, and then the new line \r\n
.
In many other programming languages, you do not need to add the line endings manually as the print
style function will automatically do that for you (by default — you can still usually disable it if you don’t want that behaviour). C++ is half-way between the two with the standard way of using std::endl
.
Case Switching
The case of ASCII characters can be easily switched in code by inverting the 5th bit. This can be done by exclusive ORing with the space character, as shown below:
Special Characters
Special characters can be added to strings using the escape character \
followed by a single identifier.
Syntax | Special Character | Inserted Number (in Hex) |
---|---|---|
|
|
|
| Ascii char representing the three octal chars ‘ddd’. Note that \0 is a special case of this more generic notation | n/a |
| Carriage Return |
|
| New line |
|
| Backslash (since a single backslash is the escape character) |
|
Typically, both the carriage return and new line characters are used for making a new line (and in that order). This is normally appended at the end of strings to be printed with \r\n
.
Finding The Length
Use the strlen()
function provided by
Copying
strcpy()
is a standard library function for copying the contents of one C string to another.
Concatenating
Unlike many higher level languages, you cannot just concatenate C “strings” together like so: my_string_1 + my_string_2
(remember, they are just arrays of characters!). Instead you have to use the strcat()
function:
printf (and variants)
printf()
and it’s variants such as sprintf()
are some of the most common “something to string” functions. They can be used to convert signed integers, unsigned integers, floats, doubles and other types to strings. They can also be used to print already formatted strings, and/or concatenate existing strings together. They also support formatting options, which allow you to control how the numbers are displayed (e.g. number of decimal places, padding, display number as hex, e.t.c).
printf()
printf()
is the most commonly used string output function. It is a variadic function (it takes a variable number of arguments, note that this is not the same as function overloading, which is something that C does not support).
On most mainstream operating systems such as Linux, MacOS and Windows, calling printf()
will send the formatted string to the terminal (if the application was invoked from a terminal). On an embedded system there is no “standard output”, but increasingly on more modern embedded systems printf()
is routed to a default debug serial port.
If you want to print an already-formulated string using printf()
(with no additional arguments to be inserted), do not use the syntax printf(msg)
. Instead, use the format printf("%s", msg)
.
The printf()
function takes format specifiers which tell the function how you want the numbers displayed.
Most C compiler installations include standard C libraries for manipulating strings. A common one is stdio.h
, usually included into a C file using the syntax #include <stdio.h>
. This library contains string copy, concatenate, string build and many others. Most of them rely on null-terminated strings to function properly. Some of the most widely used ones are shown below.
Conversion Specifiers
Conversion specifiers determine how printf(
) interprets the input variables and displays their value as a string. Conversion specifiers are added in the input string after the %
character (optional format specifiers may be added between the %
symbol and the conversion specifier).
Although the basic behaviour is defined in the ANSI standard, the exact implementation of printf()
is likely to vary slightly between C libraries.
Specifier | Description | Example |
---|---|---|
| Prints a single ASCII character. | |
| Prints a signed integer (whose exact width is implementation-specific, usually 16 or 32-bit). There is no difference between | |
| Prints an unsigned integer (whose exact width is implementation-specific, usually 16 or 32-bit). | |
| Prints a null-delimited string of ASCII characters (of arbitrary length). | |
| Prints a hexadecimal number. | |
| Prints a float (or double). All floats are converted to doubles anyway via default argument promotions. | |
| Use two |
Format Specifiers
There are plenty of format specifiers that you can use with printf()
which changes the way the text is formatted. Format specifiers go between the %
symbol and the conversion specifier, mentioned above. They are optional, but if used, have to be added in the correct order.
I have come across embedded implementations of printf()
which do not support string padding (e.g. %5s
or %-6s
). This includes the version used with the PSoC 5.
Portable size_t Printing
For portability, you can use the z
format specifier when you want to print a value of size_t
(e.g. the number returned by sizeof()
).
This was introduced in ISO C99. Z
(upper-case z
) was a GNU extension predating this standard addition and should not be used in new code.
sprintf()/snprintf()
sprintf()
is a variant of printf()
which writes the formatted string to a buffer (which you provide a pointer to) rather than to the standard output. It is useful for building strings which you don’t want to go to standard output, or want to send to standard output at a later date.
snprintf()
is a safer version of sprintf()
(I recommend always using the safer “n” style printf()
functions) which takes an additional argument which specifies the size of the buffer. This prevents buffer overflows. If the buffer is not large enough to hold the formatted string, the string will be truncated to fit the buffer.
An example of snprintf()
(notice the use of sizeof()
to pass the size of the buffer into the function, this prevents overruns if the formatted string is larger than the bugger):
Keil C51 Compiler
In the Keil C51 compiler, the b
or B
length specifier is used to tell sprintf
that the number is 8-bit. If you don’t use this, 8-bit numbers stored in uint8_t
or char
data types will not print properly. You do not have to use the b/B
specifier when using GCC.
itoa()
itoa()
is a widely implemented but non-standard extension to the C programming language. Although widely implemented, it is not ubiquitous, as GCC on Linux does not support it (which has a huge share of the C compiler space). Even though it is not specified in the C programming standard, it is confusingly included via stdlib.h
as it complements the existing functions in that header2. It is not part of the C++ standard either3.
iota()
is typically defined as:
The radix is generally limited to octal (8), decimal (10) or hexadecimal (16)4.
Usage:
itoa()
can cause undefined behaviour if the buffer is not large enough to hold the string-representation of the passed in integer. If you have a restricted range of integers that are provided to itoa()
you can quite easily determine how big the buffer should be. If it could be any integer, you need a buffer that can handle INT_MIN
(and a trailing NULL
). A safer alternative (that is also portable) to itoa()
is to use snprintf()
.
String to Number Functions
printf()
and friends convert things to strings. But what about the other way around? The C standard library provides a number of different functions for converting strings to numbers. The most common ones are shown below.
atof()
atof()
is a historic way of converting a string to a double-precision float (yes, even though the function has f
in it’s name, it actually returns a double
).
A significant disadvantage with atof()
is that you cannot distinguish between the text input "0.0"
and when there is no valid number to convert. This is because atof()
returns 0.0
if it can’t find a valid float number in the input string. For example, it can’t tell the difference between "0.0"
and "poos"
:
Another issue is that atof()
does not allow you check if the input string was just a valid number, and did not contain garbage after the number. This is because it iterates through the characters in the string, and stops processing as soon as it finds an invalid character.
There is a better alternative strtod()
, which fixes both of the problems described above.
strtod()
This stands for (string-to-double). It is a safer way of converting strings to doubles than atof()
. The code example below shows how to use strtod()
to convert a string to a double and also how to check that the input string contained a valid number. Newer versions of C/C++ also provide strtof()
which performs the same function but returns a float
rather than a double
.
strtol()
strtol()
behaves very similarly to strtod()
except parses the string into a long int
rather than a double
.
Decoding/Encoding Strings
strtok()
is a standard function which is useful for decoding strings. It splits a string up into a subset of strings, where the strings are split at specific delimiters which are passed into the function. It is useful when decoding ASCII-based (aka human readable) communication protocols, such as the command-line interface, or the NMEA protocol. Read more about it on the C++ Reference site.
getopt()
is a standard function for finding command-line arguments passed into main()
as an array of strings. It is included in the GCC glibc library. The files are also downloadable locally here (taken from GCC gLibC v2.17).
Converting a Version String to Numbers
Sometimes you might want to convert a version string in the form v1.2.3
into the individual major, minor and patch numbers.
The following C code shows a function VersionStringToNumbers()
which can do this, along with an example main()
which runs some tests against the function to make sure it can handle valid and invalid input correctly.
strtol()
was used instead of atoi()
because atoi()
cannot distinguish between invalid input and "0"
(i.e. it can’t tell the difference between "0"
and "poop"
!). Also, the returned end
pointer is used to increment through the string, check that a .
follows immediately after the last number, and then to work out where to begin processing the next integer.
You can test this code live at https://replit.com/@gbmhunter/c-version-string-to-numbers.
References
Footnotes
-
Wikipedia (2023, Aug 23). printf. Retrieved 2023-12-25, from https://en.wikipedia.org/wiki/Printf. ↩
-
Wikibooks (2020, Apr 16). C Programming/stdlib.h/itoa. Retrieved 2023-12-26, from https://en.wikibooks.org/wiki/C_Programming/stdlib.h/itoa. ↩
-
cplusplus.com (2023). function - itoa. Retrieved 2023-12-26, from https://cplusplus.com/reference/cstdlib/itoa/. ↩
-
IBM (2021). z/OS Docs - itoa() - Convert int into a string. Retrieved 2023-12-26, from https://www.ibm.com/docs/en/zos/2.1.0?topic=functions-itoa-convert-int-into-string. ↩