Fortran Debugging, Threading, Optimising articles

This post is a quick summary to some online articles which I find useful. Only the extract are given below. The full articles can be found in their original sources via their URL.

Threading Fortran applications for parallel performance on multi-core systems

Most processors now come with multiple cores, and future increases in performance are expected to come mostly from increases in core count. Performance sensitive applications that neglect the opportunities presented by additional cores will soon be left behind. This article discusses ways for an existing, serial Fortran application to take advantage of these opportunities on a single, shared memory system with multiple cores. Issues addressed include data layout, thread safety, performance and debugging. Intel provides software tools that can help in the development of robust, parallel applications that scale well.

Levels of Parallelism

1 SIMD instructions
2 Instruction level
3 Threading (usually shared memory)
4 Distributed memory clusters
5 “embarassingly parallel” multiprocessing

Ways to Introduce Threading

1 Threaded libraries, e.g. Intel® MKL
2 Auto-parallelization by the compiler
3 Asynchronous I/O (very specialized; see compiler documentation)
4 Native threads
5 OpenMP

Intel® Math Kernel Library
1 Many components of MKL have threaded versions
2 Link threaded or non-threaded interface
3 Set the number of threads

Example: PARDISO (Parallel Direct Sparse Solver)
1 Solver for large, sparse symmetric and antisymmetric systems of linear equations on shared memory systems
2 For algorithms, see http://www.pardiso-project.org

Auto-parallelization

1 The compiler can thread simple loops automatically
2 Based on the same RTL threading calls as OpenMP:

Conditions for Auto-parallelization

1 Loop count known at entry (no DO WHILE)
2 Loop iterations are independent
3 Enough work to amortize parallel overhead
4 Conditions for OpenMP loops are similar
5 Directives may be used to guide the compiler:

Example: matrix multiply

OPENMP - advantages

1 Standardized API based on compiler directives

OpenMP Programming Model

Fork-Join Parallelism:

1 Master thread spawns a team of threads as needed

Note that Intel’s implementation of OpenMP creates a separate monitor thread in addition to any user threads.

OPENMP – where to thread

1 Start by mapping out high level structure
2 Where does your program spend the most time?
3 Prefer data parallelism
4 Favor coarse grain (high level) parallelism

Example: Square_Charge

1 calculates the electrostatic potential at a series of points in a plane
   due to a uniform square distribution of charge

Openmp: How do threads interact?

1 OpenMP is a shared memory model
2 Unintended sharing of data causes race conditions:
3 To control race conditions…
4 Synchronization is expensive so…

OPENMP – data

1 Identify which data are shared between threads, which need a separate copy for each thread

2 It’s helpful (but not required) to make shared data explicitly global in Modules or common blocks,
   thread private data as local and automatic.

3 Dynamic allocation is OK (malloc, ALLOCATE)

4 Each thread gets its own private stack, but the heap is shared by all threads

OPENMP – data scoping

1 Distinguish lexically explicit parallel regions from the “dynamic extent” (Functions or subroutines called from within an explicit parallel region. These might contain no OpenMP directives or only “orphaned” OpenMP directives)

2 Lexically explicit: !$OMP PARALLEL to !$OMP END PARALLEL

Thread Safety

1 A threadsafe function can be called simultaneously from multiple threads, and still give correct results
2 ifort serial defaults:
3 When compiling with –openmp, default changes

Making a function thread safe

1 With the compiler
2 In source code
3 In either case:
4 OpenMP has various synchronization constructs to protect operations that are potentially unsafe

Thread Safe Libraries

1 The Intel® Math Kernel library is threadsafe
2 The Intel Fortran runtime library has two versions

Performance considerations

1 Start with optimized serial code, vectorized inner loops, etc. (-O3 –xsse4.2 –ipo …)
2 Ensure sufficient parallel work
3 Minimize data sharing between threads
4 Avoid false sharing of cache lines
5 Scheduling options

Timers for threaded apps

1 The Fortran standard timer CPU_TIME returns “processor time”
2 The Fortran intrinsic subroutine SYSTEM_CLOCK returns data from the real time clock
3 dclock (Intel-specific function) can also be used

Thread Affinity Interface

1 Allows OpenMP threads to be bound to physical or logical cores

NUMA considerations

1 Want memory allocated “close” to where it will be used
2 Remember to set KMP_AFFINITY

Common problems

1 Insufficient stack size
2 For whole program (shared + local data):
3 For individual thread (thread local data only)

Tips for Debugging OpenMP apps

1 Run with OMP_NUM_THREADS=1
2 Build with -openmp-stubs -auto
3 If works without –auto, implicates changed memory model
4 If debugging with PRINT statements
5 Debug with –O0 –openmp

Floating-Point Reproducibility

1 Runs of the same executable with different numbers of threads may give slightly different answers
2 Floating point reductions are still not strictly reproducible in OpenMP, even for same number of threads

Intel-specific Environment Variables

1 KMP_SETTINGS = 0 | 1
2 KMP_VERSION = off | on
3 KMP_LIBRARY = turnaround | throughput | serial
4 KMP_BLOCKTIME
5 KMP_AFFINITY (See main documentation for full API)
6 KMP_MONITOR_STACKSIZE
7 KMP_CPUINFO_FILE

Tools for Debugging OpenMP apps

1 The compiler source checker (‘parallel lint’)
2 Updated Intel Parallel Debugger, idb (Linux) and Intel Parallel Debugger Extension (on Windows)

Intel® Thread Checker
1 Unified set of tools that pinpoint hard-to-find errors in multi-threaded applications
2 Display data at the Linux command line or via a Windows GUI

Intel® Thread Profiler
1 Features & Benefits

Summary
Intel software tools provide extensive support for threading applications to take advantage of multi-core architectures.
Advice and background information are provided for a variety of issues that may arise when threading a Fortran application.

Tips for Debugging Run-time Failures in Applications Built with the Intel(R) Fortran Compiler

Your app builds successfully, but crashes at runtime. What Next? Try some useful Intel compiler diagnostic options before launching into lengthy debugger sessions.

1) Build with /traceback (Windows*) or –traceback (Linux* or Mac OS* X).

2) Build with /gen-interfaces /warn:interfaces (Windows) or –gen-interfaces –warn interfaces (Linux or Mac OS X).

3) Try building and running with /check (Windows) or –check (Linux and Mac OS X).

4) Build your program, including the main routine, with /fpe:0 (Windows) or –fpe0 (Linux or Mac OS X).

5) If your application fails early on with a segmentation fault, you might be exceeding the default maximum stack size. On Linux or Mac OS X, try setting
ulimit –s unlimited (bash) or limit stacksize unlimited (C shell)

6) Use the compiler provided interfaces. If you call run-time library functions, build with
   USE IFLPORT
If you call OpenMP run-time library functions, compile with
   USE OMP_LIB
If you call functions from MKL or IMSL*, USE the corresponding module(s).

7) Look carefully for any error messages in your output log file.

8) If you are building an application using OpenMP*, check out the advice under “Tips for Debugging OpenMP Apps” at http://software.intel.com/en-us/articles/threading-fortran-applications-for-parallel-performance-on-multi-core-systems/

9) For Windows, see the section Building Applications / Debugging in the main compiler documentation. For Linux or Mac OS X, see the documentation for the Intel(R) Debugger (idb).

xTechNotes - Technical Computing Programming Notes

Thursday, June 24, 2010