One of the most common concerns for
scientific applications is: how fast do they run? This is far
from a full reference for how to study performance - it just
provides a
few interesting questions you might want to ask when starting the
process.
First Steps
Understand these components of your
application. The better you understand these, the better we can
help you. The point is to understand where the application is
spending all its time. Then, you can worry about improving that
one function that takes 60% of the runtime. With that in mind,
the first two points are the most important, then comes the third.
How are you measuring time?
Relying on the scheduler to return accurate run-times is not
robust or accurate. There are a variety of tools one can tack
onto applications to track run times - but, the most portable tends to
be the various intrinsic 'time' commands. Getting real system
times entering and exiting a routine tends to be robust. Gprof is
also easy to use, and, does not require altering you code. It
will be just a little less tailored.
Where are you measuring times?
Taking a general global
time is important, but, not necessarily the whole story. Try to
have time measurements from different routines in the code -
particularly separate the time spent in actual computation verses time
spent in communication. This goes back to the previous point with
gprof and the time commands.
What is your communication pattern?
Understand what your communication is, not just how long it
takes. Are these all-to-all com munitions? Nearest
neighbor?
You need to understand these to understand what you can hope to expect.
Single CPU Performance
Single CPU performance is how well
each process performs on it processors ignoring
the effect of communication. Single CPU performance is a great mystery. Strive to
understand some basics, since to truly understand the performance you
would have to be the compiler writer with very good understanding of
the hardware. That being said, there are certain rules of thumb
that can make it
easier for the compiler to do good things, and, for the hardware to
work to your advantage:
Algorithm improvement
This is single greatest place for performance improvements.
Spacial and Temporal locality of data
Many assume that spacial locality of data is the most
important. Spatial locality
refers to chunks of data that are
actually adjacent in memory. It is true - being able to
copy a chunk of contiguous data saves the time of having to seek
through memory. Depending on how much you do, this can be
considerable.
Temporal locality is
probably of more importance than
spatial locality. If you have have chunk of data that you perform
on OP on, and, then load another (even adjacent) chunk, only to reload
that first chunk
later, you can be paying a large cost. You want to do as much as
possible on a chunk of data in one go. This prevents data from
being swapped in and out of cache. This is a huge time saver.
Certainly,
in specific cases there are more that can be done. In come cases
you might unroll loops. If you are on
an Intel box you might look into their limited vector directives.
A particular implementation of a math library function might be crappy,
etc. But, keep in mind, these alterations might not be portable,
even between versions of the same compiler. Also keep in mind
because of the mystery of what is happening below the hood with a lot
of compilers and chips, even making changes you are sure are for
the better might result in strange behavior in other parts of the code.
Should you pay attention to manufacturers 'peak' performance numbers?
The story is much longer than what I write here, but, basically,
manufacturers 'peak' numbers are measured using applications that are
almost perfect for maximizing FLOPS rates and bandwidth. It is
common to compare performance against these 'peak' values, but, for
example, 30% of peak FLOPS is considered incredibly good and uncommon
for a real application.
Parallel Performance
Many start evaluating an applications
performance by looking at the efficiency of the communication.
Here are a few of the basics you need to have to understand parallel
performance:
What is the communication pattern/algorithm?
How
one parallelizes the main algorithm will lean towards certain
communication patterns. Some patterns perform better than others,
and, sometimes algorithms can be changed to make use of a better
pattern.
Global and non-global communication
Frequency of messages
Distance messages are traveling
What are the sizes of messages being sent in the different
patterns?
The smaller the messages, the more overhead (cost) to sending
them.
What is the expected performance of the system you are using?
Tools like jumpshot and fpmpi can be
extremely useful in evaluating these points. They allow you get a
good deal of information with no changes to your code - if using
mpich. All you do is link in specific libraries.