[go: nahoru, domu]

Endianness

This is an old revision of this page, as edited by Don Braffitt (talk | contribs) at 17:40, 24 August 2010 (→‎Endianness and operating systems on architectures: Add OpenVMS). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In computing, endianness is the ordering of individually addressable sub-units (words, bytes, or even bits) within a longer data word stored in external memory. The most typical cases are the ordering of bytes within a 16-, 32-, or 64-bit word, where endianness is often simply referred to as byte order. [1] The usual contrast is between most versus least significant byte first, called big-endian and little-endian respectively. Mixed forms are also possible; the ordering of bytes within a 16-bit word may be different from the ordering of 16-bit words within a 32-bit word, for instance; although rare, such cases are sometimes collectively referred to as mixed-endian or middle-endian.

Endianness may be seen as a low-level attribute of a particular representation format, for example, the order in which the two bytes of an UCS-2 character are stored in memory. Byte order is an important consideration in network programming, since two computers with different byte orders may be communicating. Failure to account for varying endianness when writing code for mixed platforms can lead to bugs that can be difficult to detect.

The terms Little-Endians and Big-Endians were introduced in 1980 by Danny Cohen in his paper "On Holy Wars and a Plea for Peace"[2]. Cohen's paper uses Gulliver's Travels by Jonathan Swift (1726), wherein two religious sects of Lilliputians argued over whether to crack open their soft-boiled eggs from the little end or the big end,[3] as an allegory for the byte order (aka Endianness) issue which became crucial when computers became interconnected with each other by networks.

Endianness and hardware

Most modern computer processors agree on bit ordering "inside" individual bytes (this was not always the case). This means that any single-byte value will be read the same on almost any computer one may send it to.

Integers are usually stored as sequences of bytes, so that the encoded value can be obtained by simple concatenation. The two most common of them are:

  • increasing numeric significance with increasing memory addresses or increasing time, known as little-endian, and
  • its opposite, most-significant byte first, called big-endian.[4]

Well known processor architectures that use the little-endian format include x86, 6502, Z80, VAX, and, largely, PDP-11. Processors using big-endian format are generally Motorola processors such as the 6800 and 68000 and PowerPC (used in the Macintosh line prior to the switch to x86) and System/370. The PDP-10 also used big-endian addressing for byte-oriented instructions and SPARC historically used big-endian until version 9, which is bi-endian just like the ARM architecture (see below).

Many serial protocols may be regarded as big-endian (at the bit- and/or byte-levels) in the sense that the most significant part of the data is sent first (see below). However, this is often transparent in the interface between the UART or communication controller and the host CPU, DMA controller, and/or system memory; this interface may be of any type and is often configurable.

Bi-endian hardware

Some architectures (including ARM, PowerPC, Alpha, SPARC V9, MIPS, PA-RISC and IA-64) feature switchable endianness. This feature can improve performance or simplify the logic of networking devices and software. The word bi-endian, said of hardware, denotes the capability to compute or pass data in either of two different endian formats.

Many of these architectures can be switched via software to default to a specific endian format (usually done when the computer starts up); however, on some systems the default endianness is selected by hardware on the motherboard and cannot be changed via software (e.g., the Alpha, which runs only in big-endian mode on the Cray T3E).

Note that "bi-endian" refers primarily to how a processor treats data accesses. Instruction accesses (fetches of instruction words) on a given processor may still assume a fixed endianness, even if data accesses are fully bi-endian.

Note, too, that some nominally bi-endian CPUs may actually employ internal "magic" (as opposed to really switching to a different endianness) in one of their operating modes. For instance, some PowerPC processors in little-endian mode act as little-endian from the point of view of the executing programs but they do not actually store data in memory in little-endian format (multi-byte values are swapped during memory load/store operations). This can cause problems when memory is transferred to an external device if some part of the software, e.g. a device driver, does not account for the situation.

Floating-point and endianness

On some machines, while integers were represented in little-endian form, floating point numbers were represented in big-endian form.[5] Because there are many floating point formats, and a lack of a standard "network" representation, no standard for transferring floating point values has been made. This means that floating point data written on one machine may not be readable on another, and this is the case even if both use IEEE 754 floating point arithmetic since the endianness of the memory representation is not part of the IEEE specification.[6]

Endianness and operating systems on architectures

Little-endian operating systems:

  • Linux on x86, x64, Alpha and Itanium
  • Mac OS on x86, x64
  • Solaris on x86, x64, PowerPC
  • Tru64 on Alpha
  • OpenVMS on VAX, Alpha and Itanium
  • Windows on x86, x64 and Itanium

Big-endian operating systems:

  • AIX on POWER
  • AmigaOS on PowerPC and 680x0
  • HP-UX on Itanium and PA-RISC
  • Linux on MIPS, SPARC, PA-RISC, POWER, PowerPC, and 680x0
  • Mac OS on PowerPC and 680x0
  • Solaris on SPARC

Etymology

 
An egg in an egg cup in the little-endian orientation.

The term big-endian originally comes from Jonathan Swift's satirical novel Gulliver’s Travels by way of Danny Cohen in 1980[2]. In 1726, Swift described tensions in Lilliput and Blefuscu: whereas royal edict in Lilliput requires cracking open one's soft-boiled egg at the small end, inhabitants of the rival kingdom of Blefuscu crack theirs at the big end (giving them the moniker Big-endians).[7] The terms little-endian and endianness have a similar intent.[8]

"On Holy Wars and a Plea for Peace"[2] by Danny Cohen ends with: "Swift's point is that the difference between breaking the egg at the little-end and breaking it at the big-end is trivial. Therefore, he suggests, that everyone does it in his own preferred way. We agree that the difference between sending eggs with the little- or the big-end first is trivial, but we insist that everyone must do it in the same way, to avoid anarchy. Since the difference is trivial we may choose either way, but a decision must be made."

History

The problem of dealing with data in different representations is sometimes termed the NUXI problem.[9] This terminology alludes to the issue that a value represented by the byte-string "UNIX" on a big-endian system may be stored as "NUXI" on a PDP-11 middle-endian system; UNIX was one of the first systems to allow the same code to run on, and transfer data between, platforms with different internal representations.

An often-cited (although technically irrelevant) argument in favor of big-endian is that it is consistent with the ordering commonly used in natural languages;[10] but that is far from universal, either in speech or writing: spoken languages have a wide variety of organizations of numbers. The decimal number 92 is/was spoken in English as ninety-two, in German and Dutch as two and ninety and in France as four-twenty-twelve with a similar system in Danish (two-and-half-five-twenty). Nowadays, though, numbers are written almost universally in the Hindu-Arabic numeral system, with the most significant digits written first. (Even when embedded in right-to-left language text, numbers appear with the most significant digits on the first line, if a line break is required.)

Optimization

The little-endian system has the property that the same value can be read from memory at different lengths without using different addresses (even when alignment restrictions are imposed). For example, a 32-bit memory location with content 4A 00 00 00 can be read at the same address as either 8-bit (value = 4A), 16-bit (004A), 24-bit (00004A), or 32-bit (0000004A), all of which retain the same numeric value. Although this little-endian property is rarely used directly by high-level programmers, it is often employed by code optimizers as well as by assembly language programmers.

On the other hand, in some situations it may be useful to obtain an approximation of a multi-byte or multi-word value by reading only its most-significant portion instead of the complete representation; a big-endian processor may read such an approximation using the same base-address that would be used for the full value.

Little-endian representation simplifies hardware in small-scale byte-addressable processors and microcontrollers: As carry propagation must start at the least significant byte, multi-byte addition can then be carried out with a monotonic incrementing address sequence, a simple operation already present in hardware. On a big-endian processor, its addressing unit has to be told how big the addition is going to be so that it can hop forward to the least significant byte, then count back down towards the most significant. However, high performance processors usually perform these operations atomically, regardless of byte ordering.

Diagram for mapping registers to memory locations

 
Mapping registers to memory locations

Using this chart, one can map an access (or, for a concrete example: "write 32 bit to address 0") from register to memory or from memory to register. To help in understanding that access, little and big endianness can be seen in the diagram as differing in their coordinate system's orientation. Big endianness's atomic units and memory coordinate system increases in the diagram from left to right, while little endianness's units increase from right to left.

Examples of storing the value 0A0B0C0Dh in memory

Note that hexadecimal notation is used.

To further illustrate the above notions this section provides example layouts of a 32-bit number in the most common variants of endianness. There is no general guarantee that a platform will use one of these formats but in practice there are only few exceptions.

All the examples refer to the storage in memory of the value 0A0B0C0Dh.

Big-endian

 

with 8-bit atomic element size and 1-byte (octet) address increment

increasing addresses  →
... 0Ah 0Bh 0Ch 0Dh ...

The most significant byte (MSB) value, which is 0Ah in our example, is stored at the memory location with the lowest address, the next byte value in significance, 0Bh, is stored at the following memory location and so on. This is akin to Left-to-Right reading in hexadecimal order.

With 16-bit atomic element size

increasing addresses  →
... 0A0Bh 0C0Dh ...

The most significant atomic element stores now the value 0A0Bh, followed by 0C0Dh.

Little-endian

 

With 8-bit atomic element size and 1-byte (octet) address increment

increasing addresses  →
... 0Dh 0Ch 0Bh 0Ah ...

The least significant byte (LSB) value, 0Dh, is at the lowest address. The other bytes follow in increasing order of significance.

With 16-bit atomic element size

increasing addresses  →
... 0C0Dh 0A0Bh ...

The least significant 16-bit unit stores the value 0C0Dh, immediately followed by 0A0Bh. Note that 0C0Dh and 0A0Bh represent integers, not bit layouts (see bit numbering).

With byte addresses increasing from right to left

The 16-bit atomic element byte ordering may look backwards as written above, but this is because little-endian is best written with addressing increasing towards the left. If we write the bytes this way then the ordering makes slightly more sense:

←  increasing addresses
... 0Ah 0Bh 0Ch 0Dh ...

The least significant byte (LSB) value, 0Dh, is at the lowest address. The other bytes follow in increasing order of significance.

←  increasing addresses
... 0A0Bh 0C0Dh ...

The least significant 16-bit unit stores the value 0C0Dh, immediately followed by 0A0Bh.

However, if one displays memory with addresses increasing to the left like this, then the display of Unicode (or ASCII) text is reversed from the normal display (for left-to-right languages). For example, the word "XRAY" displayed in the "little-endian-friendly" (8-bit atomic element ) manner just described is:

←  increasing addresses
... "Y" "A" "R" "X" ...

Using the 16-bit atomic element notation, it looks even stranger:

←  increasing addresses
... "A" "Y" "X" "R" ...

This conflict between the memory arrangements of binary data and text is intrinsic to the nature of the little-endian convention, but is a conflict only for languages written left-to-right (such as Indo-European languages like English(Roman), French(Roman), Russian(Cyrillic) and Hindi (Devanagari)). For right-to-left languages such as Arabic or Hebrew, there is no conflict of text with binary, and the preferred display in both cases would be with addresses increasing to the left. (On the other hand, right-to-left languages have a complementary intrinsic conflict in the big-endian system.)

Middle-endian

Still other architectures, generically called middle-endian or mixed-endian, may have a more complicated ordering; PDP-11, for instance, stored some 32-bit words, counting from the most significant, as: 2nd byte first, then 1st, then 4th, and finally 3rd.

  • storage of a 32-bit word on a PDP-11
increasing addresses  →
... 0Bh 0Ah 0Dh 0Ch ...

Note that this can be interpreted as storing the most significant "half" (16-bits) followed by the less significant half (as if big-endian) but with each half stored in little-endian format. This ordering is known as PDP-endianness.

The ARM architecture can also produce this format when writing a 32-bit word to an address 2 bytes from a 32-bit word alignment.

Endianness in networking

Many network protocols may be regarded as big-endian in the sense that the most significant part (at the bit and/or byte-levels) is sent first. The telephone network, historically and presently, send the most significant part first, the area code; doing so allows routing while a telephone number is being composed. The Internet Protocol defines big-endian as the standard network byte order used for all numeric values in the packet headers and by many higher level protocols and file formats that are designed for use over IP. The Berkeley sockets API defines a set of functions to convert 16- and 32-bit integers to and from network byte order: the htonl (host-to-network-long) and htons (host-to-network-short) functions convert 32-bit and 16-bit values respectively from machine (host) to network order; whereas the ntohl and ntohs functions convert from network to host order.

While the lowest network protocols may deal with sub-byte formatting, all the layers above them usually consider the byte (mostly meant as octet) as their atomic unit.

Endianness in files and byte swap

Endianness is a problem when a binary file created on a computer is read on another computer with different endianness. Some compilers have built-in facilities to deal with data written in other formats. For example, the Intel Fortran compiler supports the non-standard CONVERT specifier, so a file can be opened as

OPEN(unit,CONVERT='BIG_ENDIAN',...)

or

OPEN(unit,CONVERT='LITTLE_ENDIAN',...)

If the compiler does not support such conversion, the programmer needs to swap the bytes by an ad hoc code.

Fortran sequential unformatted files created with one endianness usually cannot be read on a system using the other endianness because Fortran usually implements a record (defined as the data written by a single Fortran statement) as data preceded and succeeded by count fields, which are integers equal to the number of bytes in the data. An attempt to read such file on a system of the other endianness then results in a run-time error, because the count fields are incorrect.

Application binary data formats, such as for example MATLAB .mat files, or the .BIL data format, used in topography, are usually endianness-independent. This is achieved by storing the data always in one fixed endianness, or carrying with the data a switch to indicate which endianness the data was written with. When reading the file, the application converts the endianness, transparently to the user.

This is the case of TIFF image files, which instructs in its header about endianess of their internal binary integers. If a file starts with the signature "MM" it means that integers are represented as big-endian while "II" means little-endian. Those signatures need a single 16 bit word each, and they are palindromes (that is, they read the same forwards than backwards), so they are endianness independent. "I" stands for Intel and "M" stands for Motorola, the respective CPU providers of the IBM PC compatibles and Apple Macintosh platforms in the 1980's. Intel CPUs are little-endian, while Motorola 680x0 CPUs are big-endian. This explicit signature allows a TIFF reader program to swap bytes if necessary when a given file was generated by a TIFF writer program running in a PC with a different endianness.

Note that since the required byte swap depends on the length of the variables stored in the file (two 2 byte integers require a different swap than one 4 byte integer), a general utility to convert endianness in binary files cannot exist.

"Bit endianness"

The terms bit endianness or bit-level endianness are seldom used when talking about the representation of a stored value, as they are only meaningful for the rare computer architectures where each individual bit has a unique address. They are used however to refer to the transmission order of bits over a serial medium. Most often that order is transparently managed by the hardware and is the bit-level analogue of little-endian (low-bit first), although protocols exist which require the opposite ordering (e.g. I²C). In networking, the decision about the order of transmission of bits is made in the very bottom of the data link layer of the OSI model.

Other meanings

Some authors extend the usage of the word "endianness", and of related terms, to entities such as street addresses, date formats and others. Such usages—basically reducing endianness to a mere synonym of ordering of the parts—are non-standard usage[citation needed] (e.g., ISO 8601:2004 talks about "descending order year-month-day", not about "big-endian format"), do not have widespread usage, and are generally (other than for date formats) employed in a metaphorical sense.

"Endianness" is sometimes used to describe the order of the components of a domain name, e.g. 'www.wikipedia.org' (the usual modern 'little-endian' form) versus the reverse-DNS 'org.wikipedia.www' ('big-endian', used for naming components, packages, or types in computer systems, for example Java packages, Macintosh ".plist" files, etc.). URLs could be considered 'middle-endian', as they start in the usual modern 'little-endian' form, but after the TLD, a 'big-endian' format is used to point to a specific file or folder.

References and notes

  1. ^ For hardware, the Jargon File also reports the less common expression byte sex [1]. It is unclear whether this terminology is also used when more than two orderings are possible. Similarly, the manual for the ORCA/M assembler refers to a field indicating the order of the bytes in a number field as NUMSEX, and the Mac OS X operating system refers to "byte sex" in its compiler tools [2].
  2. ^ a b c Danny Cohen (1980-04-01). On Holy Wars and a Plea for Peace. IEN 137. ...which bit should travel first, the bit from the little end of the word, or the bit from the big end of the word? The followers of the former approach are called the Little-Endians, and the followers of the latter are called the Big-Endians. Also published at IEEE Computer, October 1981 issue. Cite error: The named reference "HOLY" was defined multiple times with different content (see the help page).
  3. ^ Gulliver's Travels: Complete, Authoritative Text with Biographical and Historical Contexts, Palgrave Macmillan 1995 (part I).
  4. ^ Note that, in these expressions, the term "end" is meant as "extremity", not as "last part"; and that big and little say which extremity is written first.
  5. ^ "Floating point formats".
  6. ^ "pack - convert a list into a binary representation".
  7. ^ Jonathan Swift (1726). Gulliver's Travels. Which two mighty powers have, as I was going to tell you, been engaged in a most obstinate war for six-and-thirty moons past. (...) the primitive way of breaking eggs, before we eat them, was upon the larger end; (...) the emperor his father published an edict, commanding all his subjects, upon great penalties, to break the smaller end of their eggs. (...) Many hundred large volumes have been published upon this controversy: but the books of the Big-endians have been long forbidden (...)
  8. ^ David Cary. "Endian FAQ". Retrieved 2008-12-20.
  9. ^ "NUXI problem". The Jargon File. Retrieved 2008-12-20.
  10. ^ Cf. entries 539 and 704 of the Linguistic Universals Database

Further reading

This article is based on material taken from the Free On-line Dictionary of Computing prior to 1 November 2008 and incorporated under the "relicensing" terms of the GFDL, version 1.3 or later.