XOR swap algorithm

With three XOR operations the binary values 1010 and 0011 are exchanged between variables. — Using the XOR swap algorithm to exchange nibbles between variables without the use of temporary storage.

In computer programming, the XOR swap is an algorithm that uses the XOR bitwise operation to swap values of distinct variables having the same data type without using a temporary variable. "Distinct" means that the variables are stored at distinct memory addresses; the actual values of the variables do not have to be different.

The algorithm

Conventional swapping requires the use of a temporary storage variable. Using the XOR swap algorithm, however, no temporary storage is needed. The algorithm is as follows:

X := X XOR Y
Y := X XOR Y
X := X XOR Y

The algorithm typically corresponds to three machine code instructions. Since XOR is a commutative operation, X XOR Y can be replaced with Y XOR X in any of the lines. When coded in assembly language, this commutativity is often exercised in the second line. For example, in IBM System/370 assembly code:

XR    R1,R2
XR    R2,R1
XR    R1,R2

where R1 and R2 are distinct registers and each XR operation leaves its result in the register named in the first argument.

However, the algorithm fails if x and y use the same storage location, since the value stored in that location will be zeroed out by the first XOR instruction, and then remain zero; it will not be "swapped with itself". (Note that this is not the same as if x and y have the same values. The trouble only comes when x and y use the same storage location.)

Proof that the XOR swap works

The binary operation XOR over bit strings of length $N$ exhibits the following properties (where $\oplus$ denotes XOR):^[1]

L1. Commutativity: $A\oplus B=B\oplus A$
L2. Associativity: $(A\oplus B)\oplus C=A\oplus (B\oplus C)$
L3. Identity exists: there is a bit string, 0, (of length N) such that $A\oplus 0=A$ for any $A$
L4. Each element is its own inverse: for each $A$ , $A\oplus A=0$ .

Suppose that we have two distinct registers R1 and R2 as in the table below, with initial values A and B respectively. We perform the operations below in sequence, and reduce our results using the properties listed above.

Step	Operation	Register 1	Register 2	Reduction
0	Initial value	$\ A$	$\ B$	—
1	`R1 := R1 XOR R2`	$\ A\oplus B$	$\ B$	—
2	`R2 := R1 XOR R2`	$\ A\oplus B$	${\begin{aligned}(A\oplus B)\oplus B=&A\oplus (B\oplus B)\\=&A\oplus 0\\=&A\end{aligned}}$	L2 L4 L3
3	`R1 := R1 XOR R2`	${\begin{aligned}(A\oplus B)\oplus A=&A\oplus (A\oplus B)\\=&(A\oplus A)\oplus B\\=&0\oplus B\\=&B\end{aligned}}$	$\ A$	L1 L2 L4 L3

Code example

A C function that implements the XOR swap algorithm:

 void xorSwap (int *x, int *y) {
     if (x != y) {
         *x ^= *y;
         *y ^= *x;
         *x ^= *y;
     }
 }

Note that the code does not swap the integers passed immediately, but first checks if their locations are distinct. This is because the algorithm works only when x and y refer to distinct locations (otherwise, it will erroneously set *x = *y = 0).

The body of this function is sometimes seen incorrectly shortened to if (x != y) *x^=*y^=*x^=*y;. This code has undefined behavior, since it modifies the lvalue *x twice without an intervening sequence point.

Reasons for use in practice

In most practical scenarios, the trivial swap algorithm using a temporary register is more efficient. Limited situations in which it may be practical include:

On a processor where the instruction set encoding permits the XOR swap to be encoded in a smaller number of bytes;
In a region with high register pressure, it may allow the register allocator to avoid spilling a register.
In Microcontrollers where available RAM is very limited.

Because these situations are rare, most optimizing compilers do not generate XOR swap code.

Reasons for avoidance in practice

Most modern compilers can optimize away the temporary variable in the naive swap, in which case the naive swap uses the same amount of memory and the same number of registers as the XOR swap and is at least as fast, and often faster.^[2] The XOR swap is also much less readable, and can be completely opaque to anyone who isn't already familiar with the technique.

On modern CPU architectures, the XOR technique is considerably slower than using a temporary variable to do swapping. One reason is that modern CPUs strive to execute commands in parallel via instruction pipelines. In the XOR technique, the inputs to each operation depend on the results of the previous operation, so they must be executed in strictly sequential order. If efficiency is of tremendous concern, it is advised^{[citation needed]} to test the speeds of both the XOR technique and temporary variable swapping on the target architecture.

The XCHG instruction

Modern optimizing compilers work by translating the code they are given into an internal flow-based representation which they transform in many ways before producing their machine-code output. These compilers are more likely to recognize and optimize a conventional (temporary-based) swap than to recognize the high-level language statements that correspond to an XOR swap. Many times, what is written as a swap in high-level code is translated by the compiler into a simple internal note that two variables have swapped memory addresses, rather than any amount of machine code. Other times, when the target architecture supports it, the compiler can use a single XCHG (exchange) instruction which performs the swap in a single operation.

An XCHG operation was available as long ago as 1964, on the PDP-6 (where it was called EXCH) and in 1970 on the Datacraft 6024 series (where it was called XCHG). The Intel 8086, released in 1978, also included an instruction named XCHG. All three of these instructions swapped registers with registers, or registers with memory, but were unable to swap the contents of two memory locations. The Motorola 68000's EXG operation can only swap registers with registers. The PDP-10 inherited the PDP-6's EXCH instruction, but the PDP-11 (the machine on which the C programming language was developed) did not.

On the common x86 architecture, the XCHG instruction with a memory operand has an implicit LOCK prefix, so that the operation is atomic, requiring hundreds of cpu cycles to synchronize with every other device that can access main memory. By comparison, on many processors a single MOV instruction can be issued at the same time as other common instructions in a single clock cycle. Between two registers, XCHG may still be slower than three MOVs, but is smaller, so may be used where code size matters. x86's XCHG is primarily useful for its locking form, for writing locking primitives used in threaded or multiprocessing applications.

Aliasing

The XOR swap is also complicated in practice by aliasing. As noted above, if an attempt is made to XOR-swap the contents of some location with itself, the result is that the location is zeroed out and its value lost. Therefore, XOR swapping must not be used blindly in a high-level language if aliasing is possible.

Variations

The underlying principle of the XOR swap algorithm can be applied to any reversible binary operation. Replacing XOR by addition and subtraction gives a slightly different, but largely equivalent, formulation:

 void addSwap (int *x, int *y)
 {
     if (x != y) {
         *x = *x + *y;
         *y = *x - *y;
         *x = *x - *y;
     }
 }

Unlike the XOR swap, this variation requires that the underlying processor or programming language uses a method such as modular arithmetic or bignums to guarantee that the computation of X + Y cannot cause an error due to integer overflow. Therefore, it is seen even more rarely in practice than the XOR swap.

Notes

^ The first three properties, along with the existence of an inverse for each element, are the definition of an Abelian group. The last property is a structural feature of XOR not necessarily shared by other Abelian groups.
^ How to swap two integers in C++: http://big-bad-al.livejournal.com/98093.html