Saturday, November 14, 2009

Notes 64bit

Notes64bit

Contents
============
References
Definition
Article - The 64-Bit Advantage
Article - x86: registered offender
Itanium2
Feature Comparison
Registers
AMD K8 vs Conroe FPU
Porting to a 64-bit Intel® architecture
How to check if code is 32bit or 64bit
Limitations
Large Arrays



References
=============
http://www.intel.com/cd/ids/developer/asmo-na/eng/197664.htm?page=5


Definition
===========
Ref: http://en.wikipedia.org/wiki/64-bit
"64-bit" computer architecture generally has integer registers that are 64 bits wide, which allows it to support (both internally and externally) 64-bit "chunks" of integer data.



Size to consider are: registers, address buses, or data buses.


Most modern CPUs such as the Pentium and PowerPC have 128-bit vector registers used to store several smaller numbers, such as 4 32-bit floating-point numbers. A single instruction can operate on all these values in parallel (SIMD). They are 128-bit processors in the sense that they have 128-bit registers and in some cases a 128-bit ALU, but they do not operate on individual numbers that are 128 binary digits in length.


Article - The 64-Bit Advantage
===============================
Ref: http://www.pcmag.com/print_article/0,3048,a=116259,00.asp

The 32-bit Pentium-class chips that dominate today's desktops fetch and execute instructions from system memory in 32-bit chunks; 64-bit chips handle 64-bit instructions. And that's just what the workstation-class Intel Itanium 2 and HP Alpha chips do inside the TeraGrid's clusters.

New desktop-class 64-bit chips, such as the AMD Athlon64 and the Apple/ IBM PowerPC G5, can handle 64-bit instructions as well, but most PC apps—even the few that optimize some operations to exploit 64-bit processing—still rely on 32-bit instructions. A new generation of games and apps will no doubt take fuller advantage of 64-bit chips. But their ability to harness the new architecture fully may be hampered by the need to interact with Windows, since none of the desktop versions of the OS is yet slated for 64-bit optimization.

A major advantage to 64-bit processors over their 32-bit cousins is support for greater amounts of memory. In theory, a 64-bit processor can address exabytes (billions of billions of bytes) of RAM; 32-bit chips can use a maximum of 8GB of RAM. This breakthrough is used to good advantage at the National Center for Supercomputing Applications' (NCSA) TeraGrid, which allocates 12GB of system memory each to half of its 256 Itanium 2 processor nodes. It will be a while before anyone knows how fast Quake would run with that much memory, since PC motherboards don't exceed 8GB of RAM.

Future 64-bit apps will be able to chew on a class of computations known as floating-point operations far faster than 32-bit apps can. Necessary for 3-D rendering and animation of everything from molecular models to Halo aliens, floating-point calculations are so essential to complex scientific analysis that FLOPS (floating-point operations per second) are used as the unit of supercomputing performance. The ability of 64-bit chips to process floating-point operations faster and far more precisely than their 32-bit counterparts make them powerhouses for simulations and visualization.


Article - x86: registered offender
===================================
Ref: http://techreport.com/reviews/2005q1/64-bits/index.x?pg=2

"Another problem with the x86 ISA is the number of general-purpose registers (GPRs) available. Registers are fast, local slots inside a processor where programs can store values. Data stored in registers is quickly accessible for reuse, and registers are even faster than on-chip cache. The x86 ISA only provides eight general-purpose registers, and thus is generally considered register-poor. Most reasonably contemporary ISAs offer more. The PowerPC 604 RISC architecture, to give one example, has 32 general-purpose registers. Without a sufficient number of registers for the task at hand, x86 compilers must sometimes direct programs to spend time shuffling data around in order to make the right data available for an operation. This creates overhead that slows down computation.

To help alleviate this bottleneck, the x86-64 ISA brings more and better registers to the table. x86-64 packs 8 more general-purpose registers, for a total of 16, and they are no longer limited to 32-bit values—all 16 can store 64-bit datatypes. In addition to the new GPRs, x86-64 also includes 8 new 128-bit SSE/SSE2 registers, for a total of 16 of those. These additional registers bring x86 processors up to snuff with the competition, and they will quite likely bring the largest performance gains of any aspect of the move to the x86-64 ISA.

What is the magnitude of those performance gains? Well, it depends. Some tasks aren't constrained by the number of registers available now, while others will benefit greatly when recompiled for x86-64 because the compiler will have more slots for local data storage. The amount of "register pressure" presented by a program depends on its nature, as this paper on 64-bit technical computing with Fortran explains:

The performance gains from having 16 GPRs available will vary depending on the complexity of your code. Compute-intensive applications with deeply nested loops, as in most Fortran codes, will experience higher levels of register pressure than simpler algorithms that follow a mostly linear execution path. "

Summary -
x86 - 8x 32-bit General Purpose Registers
x86-64 - 16x 64-bit General Purpose Registers
Fortran - more do loops need bigger and more GPRs


Itanium2
=========
Ref: http://www.itmanagersjournal.com/feature/8611

Intel's Itanium 2, or IA64, is unlike any of the other 64-bit processors in production. It uses a Very Long Instruction Word (VLIW) design that depends on the software's compiler for performance. When the compiler creates program binaries for the Itanium 2, it predicts the most efficient method of execution, so the processor does less work when the program is running -- the software schedules its own resources beforehand, rather than forcing the hardware to do it on the fly. IA64 is used in the same kinds of workstations that UltraSPARC processors are used in, and can also scale up to 128 processors in high-powered servers. Silicon Graphics and Hewlett-Packard both sell computers based on the Itanium 2. GNU/Linux is generally the operating system of choice for IA64-based systems, but HP-UX and Windows 2003 Server will work on HP Itanium 2 servers.



Feature Comparison
==================

Size of Fetch and Execute Instructions - 64bit vs 32bit chunks
Number of General Purpose Registers (Fetch Registers?)
Memory Access - 18.4x10^9 GB vs 4GB
Floating Point Operations - faster with 64bit than 32bit

Vector Registers - eg Pentium has 128bit data registers which store up to 4 32bit data register.
ALU
FPU


Itanium2
- good for FP processing
- 2FPU (=1 FMAC or 2multiplication and 1 add), plus additional 2 FMACs for 3D processing.
- 64 bit address space
- a derivative of VLIW, dubbed Explicitly Parallel Instruction Computing (EPIC). It is theoretically capable of performing roughly 8 times more work per clock cycle than a non-superscalar CISC or RISC architecture due to its Parallel Computing Microarchitecture.
- support 128 integer, 128 floating point, 8 branch and 64 predicate registers (for comparison, IA-32 processors support 8 registers and other RISC processors support 32 registers


UltraSparc T1
- 8x Integer Cores share 1 FPU
- good for integer processing compared to Itanium


Throughout its history, Itanium has had the best floating point performance relative to fixed-point performance of any general-purpose microprocessor. This capability is not needed for most enterprise server workloads. Sun's latest server-class microprocessor, the UltraSPARC T1 acknowledges this explicitly, with performance dramatically skewed toward the improvement of integer processing at the expense of floating point performance (eight integer cores share a single FPU). Thus Itanium and Sun appear to be addressing separate subsets of the market. By contrast, IBM's cell microprocessor, with a single general-purpose POWER core controlling eight simpler cores optimized for floating point, may eventually compete against Itanium for floating-point workloads.


Registers
==========
Integer - can be used to store pointers
Floating Point - most CPUs also have FPUs
Other

examples:
x86 - has x87 FPU with 8 x 80bit registers
x86 with SSE - 8x 128bit FP registers
x86-64 - has SSE with 16x 128bit FP registers
Alpha - has 32x 64bit FP registers and 32x 64bit integer registers.
Itanium2 - 128x 64bit GPRs, 128x 82bit FPregisters, 64x 1bit predicates, 8x 64bit branch registers

AMD K8 vs Conroe FPU
======================
Possibly better floating point performance of K8 processors

http://www.xbitlabs.com/articles/cpu/display/amd-k8l_5.html


Porting to a 64-bit Intel® architecture
============================================
(ref: http://www.developers.net/intelisnshowcase/view/358)

Porting Application Source Code
The most significant issues that software developers should face in porting source code to the 64-bit world concern the changes in pointer size and fundamental integer types. As such, these differences should appear most prominently in C and C++ programs. Code written in Fortran, COBOL, Visual Basic, and most other languages (except assembly language, which must be completely rewritten), will need no modification. A simple recompilation is often all that is needed. Java code should not even need recompilation; Java classes should execute the same on a 64-bit JVM as on any 32-bit virtual machine.

C (from here on, C++ is included in all discussions of C) code, however, by allowing casting across types and direct access to machine-specific integral types will need some attention.

The first aspect is the size of pointers; 64-bit operating systems use 64-bit pointers. This means that the following will equal eight (and no longer four):

sizeof (ptrdiff_t)

As a result, structures that contain pointers will have different sizes as well. As such, if data laid out for these structures is stored on disk, reading it in or writing it out will cause errors. Likewise, unions with pointer fields will have different sizes and can cause unpredictable results.

The greatest effect, though, is felt wherever pointers are cast to integral types. This practice, which has been condemned for years as inimical to portability, will come back to haunt programmers who did not abandon it. The problems caused by it are traceable to the different widths used by pointers, integers and longs on the various platforms. Let's examine these.


How to check if code is 32bit or 64bit
========================================
use the dumpbin utility and look for the output under FILE HEADER VALUES.
eg.
dumpbin /headers hello.exe

Results if 64 bit:
FILE HEADER VALUES
8664 machine (x64)

Results if 32 bit:
FILE HEADER VALUES
14C machine (x86)

Limitations
=============
Virtual Address Limit - theoretical 16EB
Virtual Address Limit - practical
i) Windows use 44bits -> 16TB, apparently allow only 8TB to be used.


Large Arrays
=============
http://episteme.arstechnica.com/eve/forums/a/tpc/f/6330927813/m/420003239831/r/308002539831

BigArray, getting around the 2GB array size limit
http://blogs.msdn.com/joshwil/archive/2005/08/10/450202.aspx

No comments: