
| Writing endian-independent code in C | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 摘自: IBM developerWorks Worldwide 被阅读次数: 888 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
由 yangyi 于 2007-05-14 21:56:04 提供 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Level: Intermediate Harsha S. Adiga (haradiga@in.ibm.com), Software Engineer, IBM 24 Apr 2007 Architectures, processors, network stacks, and communication protocols all have to define endianness at some point. This article explains how endianness affects code, how to determine endianness at run time, and how to write code that can reverse byte order and free you from being bound to a certain endian. To understand the concept of endianness (see Endianness), you need to be familiar, at a highly abstract level, with memory. All you need to know about memory is that it's one large array. The array contains bytes. In the computer world, people use address to refer to the array locations.
Each address stores one element of the memory array. Each element is typically one byte. In some memory configurations, each address stores something besides a byte. However, those are extremely rare so, for now, let's make the broad assumption that all memory addresses store bytes. I refer to 32 bits, which is the same as four bytes. Integers or single-precision floating point numbers are all 32-bits long. But since each memory address can store a single byte and not four bytes, let's split the 32-bit quantity into four bytes. For example, suppose you have a 32-bit quantity written as 12345678, which is hexadecimal. Since each hex digit is four bits, you need eight hex digits to represent the 32-bit value. The four bytes are: 12, 34, 56, and 78. There are two ways to store this in memory, as shown below.
Notice that they are in the reverse order. To remember which is which, think of the least significant byte being stored first (little-endian), or the most significant byte being stored first (big-endian). Endianness only makes sense when you're breaking up a multi-byte quantity and are trying to store the bytes at consecutive memory locations. However, if you have a 32-bit register storing a 32-bit value, it makes no sense to talk about endianness. The register is neither big-endian nor little-endian; it's just a register holding a 32-bit value. The rightmost bit is the least significant bit, and the leftmost bit is the most significant bit. Some people classify a register as a big-endian, because it stores its most significant byte at the lowest memory address. Endianness is the attribute of a system that indicates whether integers are represented from left to right or right to left. In today's world of virtual machines and gigahertz processors, why would a programmer care about such a minor topic? Unfortunately, endianness must be chosen every time a hardware or software architecture is designed. There isn't much in the way of natural law to help decide, so implementations vary. All processors must be designated as either big-endian or little-endian. For example, the 80x86 processors from Intel® and their clones are little-endian, while Sun's SPARC, Motorola's 68K, and the PowerPC® families are all big-endian. Why is endianness so important? Suppose you are storing integer values to a file, and you send the file to a machine that uses the opposite endianness as it reads in the value. This causes problems because of endianness; you'll read in reversed values that won't make sense. Endianness is also a big issue when sending numbers over the network. Again, if you send a value from a machine of one endianness to a machine of the opposite endianness, you'll have problems. This is even worse over the network because you might not be able to determine the endianness of the machine that sent you the data. Listing 1 shows an example of the dangers of programming while unaware of endianness. Listing 1. Example
The above code compiles properly on all machines. However, the output is
different on big-endian and little-endian machines. The program outputs, when
examined using the Listing 2. hexdump –C output on big-endian machines
Endianness doesn't apply to everything. If you do bitwise or bitshift operations on an int, you don't notice the endianness. The machine arranges the multiple bytes, so the least significant byte is still the least significant byte, and the most significant byte is still the most significant byte. Similarly, it's natural to wonder whether strings might be saved in some sort of strange order, depending on the machine. To understand this, let's go back to the basics of an array. A C-style string, after all, is still an array of characters. Each character requires one byte of memory, since characters are represented in ASCII. In an array, the address of consecutive array elements increases. Thus, &arr[i] is less than &arr[i+1]. Though it isn't obvious, if something is stored with increasing addresses in memory, it's going to be stored with increasing "addresses" in a file. When you write to a file, you usually specify an address in memory and the number of bytes you wish to write to the file starting at that address. For example, suppose you have a C-style string in memory called
Now imagine writing out this string to a file using some sort of
Given this explanation, it's clear that endianness doesn't matter with C-style strings. Endianness does matter when you use a type cast that depends on a certain endian being in use. One example is shown in Listing 4, but keep in mind that there are many different type casts that can cause problems. Listing 4. Forcing a byte order
What would be the value of If this is a little-endian system, the 0 and 1 is interpreted backwards and
seen as if it is 0,1. Since the high byte is 0, it doesn't matter
and the low byte is 1, so On the other hand, if it's a big-endian system, the high byte is 1
and the value of Determine endianness at run time One way to determine endiannes is to test the memory layout of a predefined constant. For example, you know that the layout of a 32-bit integer variable with a value of 1 is 00 00 00 01 for big-endian and 01 00 00 00 for little-endian. By looking at the first byte of the constant, you can tell the endianness of the running platform and then take the appropriate action.
Listing 5 tests the first byte of the multi-byte integer
Listing 5. Determine endianness
Another way to determine endiannes is to use a character pointer to the bytes of an int and then check its first byte to see if it is 0 or 1. Listing 6 shows an example. Listing 6. Character pointer
Network stacks and communication protocols must also define their endianness. Otherwise, two nodes of different endianness would be unable to communicate. This is a more substantial example of endianness affecting the embedded programmer. All of the protocol layers in the Transmission Control Protocol and the Internet Protocol (TCP/IP) suite are defined to be big-endian. Any 16-bit or 32-bit value within the various layer headers (such as an IP address, a packet length, or a checksum) must be sent and received with its most significant byte first. The multi-byte integer representation used by the TCP/IP protocols is sometimes called network byte order. Even if the computers at each end are little-endian, multi-byte integers passed between them must be converted to network byte order prior to transmission across the network and converted back to little-endian at the receiving end. Assume you want to establish a TCP socket connection to a computer whose IP address is 192.0.1.2. Internet Protocol version 4 (IPv4) uses a unique 32-bit integer to identify each network host. The dotted decimal IP address must be translated into such an integer. Suppose an 80x86-based PC is to talk to a SPARC-based server over the Internet. Without further manipulation, the 80x86 processor would convert 192.0.1.2 to the little-endian integer 0x020100C0 and transmit the bytes in the order 02 01 00 C0. The SPARC would receive the bytes in the order 02 01 00 C0, reconstruct the bytes into a big endian integer 0x020100c0, and misinterpret the address as 2.1.0.192. If the stack runs on a little-endian processor, it has to reorder, at run time, the bytes of every multi-byte data field within the various headers of the layers. If the stack runs on a big-endian processor, there's nothing to worry about. For the stack to be portable (so it runs on processors of both types), it has to decide whether or not to do this reordering, typically at compile time. To enable these conversions, sockets provide a set of macros to convert to and from host to network byte order, as shown below.
Consider the C program in Listing 7. Listing 7. Sample C program
This program shows how the long variable x with the value 112A380 (hexadecimal) is stored. When this program is executed on a little-endian processor, it outputs the information in Listing 8. Listing 8. Little-endian output
When you look at the individual bytes of x, you find the least significant
byte (0x80) in the lowest address. After you call
Listing 9 shows the output from executing the same program on a big-endian processor. Listing 9. Big-endian output
Here you see the most significant byte (0x1) in the lowest address. Calling
Now let's get down to writing some code that is not bound to a certain endian. There are many ways of doing this. The goal is to write code that doesn't fail, regardless of the endianness of the machine. You need to ensure that the file data is in the correct endian when read from or written to. It would also be nice to avoid having to specify conditional compilation flags and just let the code automatically determine the endianness of the machine. Let's write a set of functions that automatically reverse the byte order of a given parameter, depending on the endian of the machine. First, you need to deal with Listing 10. Method 1: Using bit shifting and bit ANDs
In the function below, you cast Listing 11. Method 2: Using pointer to an array of characters
Now let's handle Listing 12. Method 1:Using bit shifting and bit ANDs with int
This is more or less the same thing you did to reverse a Listing 13. Method 2: Using pointer to an array of characters with int
Again, this is exactly what you did to reverse a short, but here you swapped four bytes. Similarly, you can write code to reverse bytes of float, long, double, and so on, but that is outside the scope of this article. There seems to be no significant advantage in using one method of endianness over the other. Both are still common and different architectures use them. Little-endian based processors (and their clones) are used in most personal computers and laptops, so the vast majority of desktop computers today are little-endian. Endian issues do not affect sequences that have single bytes, because "byte" is considered an atomic unit from a storage point of view. On the other hand, sequences based on multi-byte are affected by endianness and you need to take care while coding.
Learn
Get products and technologies
Discuss
原文链接: http://www-128.ibm.com/developerworks/aix/library/au-endianc/?S_TACT=105AGX54&S_CMP=NLLX | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||