Come 2008 and the market/economic gurus predict a lot of volatility in everything related to money and forex. Not only this, technology will be at it's best to make our lives better (remember Philips saying: Let's make things better) and give us more options to communicate and collaborate on a wider scale. With all this, the concern and efforts for greener and safer earth will still be a "Work-In-Progress" and all this for the good of human.
Not looking at the flip side though(you can say I'm being too optimistic of a gr8 2008)
Well, all said, these are some of the web sites and blogs I see myself carrying with me to 2008 for a great compilation of topics covering a lot of areas that I go thru on a daily basis. Most of the links are just feeds so you can add them directly to your reader (I use Google reader).
A list of tech-blogs:
paulbridger.net - C++, patterns, design
Dr.Dobb's C++ Articles
DevX: Latest C++ Content
CodeGuru.com
Netotto Blog : Another Software Developer's Blog
Recording my programming path
Coding Misadventures
Monkey Bites
The ones on Network Security:
Dark Reading: Dark Reading News Analysis
SecGuru
SecurityFocus News
A list of blogs on Productivity, professional achievement and morale boosting things
Lifehacker: How To
Great Solutions to Team Challenges
Dumb Little Man - Tips for Life
Achieve IT!
A list of blogs on personal finance:
AllFinancialMatters
GetRichSlowly
I Will Teach You To Be Rich
Moneycontrol Pers Fin
Other than the above some weblinks that are interesting:
Koders
TBB
Shuva's Photo blog
Discover Magazine: Updates on Science
The Site for Books and Readers
With this and a lot more to come in 2008.... I wish all a very happy and green Year 2008!
Monday, December 31, 2007
Tuesday, December 18, 2007
How to use TBB parallel_for
Well, now that we've installed and configured TBB libs on our (Linux) machine, we can start playing with various parallelism constructs provided by TBB and use them to gain meaningful efficiency in our day-to-day problem solving.
Why stress on "meaningful"? We'll talk about it later in this column.
Here's a working code with execution timing difference b/w a multiplication job using sequential for and a TBB parallel_for. I've put in comments to explain use of important statements in the code.
//These includes are the ones reqd to use TBB lib's parallel_for
#include "/tbb/tbb20_20070927oss_src/include/tbb/blocked_range.h"
#include "/tbb/tbb20_20070927oss_src/include/tbb/parallel_for.h"
#include "/tbb/tbb20_20070927oss_src/include/tbb/task_scheduler_init.h"
#include < iostream>
#include < stdlib>
#include < sys/time.h>
#define MULTIPLIER 3.1456
//Just something I found at koders for timing
#define TIMERSUB(a, b, result) \
do { \
(result)->tv_sec = (a)->tv_sec - (b)->tv_sec; \
(result)->tv_usec = (a)->tv_usec - (b)->tv_usec; \
if ((result)->tv_usec < 0) { \
--(result)->tv_sec; \
(result)->tv_usec += 1000000; \
} \
} while (0)
using namespace tbb;
using namespace std;
typedef long long mytime_t;
// This structure contains an important part needed to be defined
// for TBB's parallel_for. We need to have UDT's with overloaded
// operators wrapping up the serial functionality that we want
// to break into parallel execution. In this case, I use a struct
// (could have used a class too) that encapsulates an array of
// input and output float numbers and has operator() defined to
// perform a serial for operation on input.
struct myNumbers {
float* ptr_input;
float* ptr_output;
//Just a constructor with an initialization list for use in the parallel_for call
myNumbers(float* input, float* output):ptr_input(input),ptr_output(output){}
// This is the actual body that's called in parallel_for by the TBB runtime.
// This code comes as a struct/class definition as the compiler
// expands and inlines this code as part of the template process.
// The TBB runtime takes the range blocked_range and breaks
// up the for loop in parallel threads to fit the the no. of
// processors/cores. The no. of processors/cores and thus the no.
// of threads to break this operation into are calculated by the
// TBB runtime and thus the developer using the TBB library can
// just concentrate on the functionality w/o worrying about the
// parallelizing math involved.
void operator()(const blocked_range &range)const {
for (int i = range.begin(); i!= range.end(); i++)
ptr_output[i] = ptr_input[i] * MULTIPLIER;
}
};
int main(size_t argc, char* argv[]) {
// for timing execution
timeval t_start, t_end, t_result, tbb_start, tbb_end, tbb_result;
mytime_t singlethread_time, tbb_time;
int i = 0;
float* ptr_input;
float* ptr_outputSingle;
float* ptr_outputTBB;
// Initialize the TBB runtime...
task_scheduler_init init;
if( argc != 2 ) {
cout<<"Usage: "<<< " \n";
return 1;
}
int numElements = atoi(argv[1]);
if( numElements <= 1 ) {
cout<<"Array size "<<<" reqd an integer > 1\n";
return 1;
}
ptr_input = new float[numElements];
ptr_outputSingle = new float[numElements];
ptr_outputTBB = new float[numElements];
for(i = 0; i < numElements; i++) {
ptr_input[i] = i;
ptr_outputSingle[i] = 0;
ptr_outputTBB[i] = 0;
}
//Time the execution using plain sequential for
gettimeofday(&t_start,NULL);
for( i=0; i < numElements; i++ ) {
ptr_outputSingle[i] = ptr_input[i] * MULTIPLIER;
}
gettimeofday(&t_end,NULL);
TIMERSUB(&t_start,&t_end,&t_result);
singlethread_time = (mytime_t)(t_result.tv_sec + t_result.tv_usec);
//Time the execution using TBB parallel_for
gettimeofday(&tbb_start,NULL);
parallel_for(blocked_range(0,numElements),
myNumbers(ptr_input,ptr_outputTBB), auto_partitioner());
gettimeofday(&tbb_end,NULL);
TIMERSUB(&tbb_start,&tbb_end,&tbb_result);
tbb_time = (mytime_t)(tbb_result.tv_sec + tbb_result.tv_usec);
//Verify that the outputs match
for(i=0; i < numElements; i++) {
if( ptr_outputSingle[i] != ptr_outputTBB[i] ) {
cout << ptr_input[i] << " * " << MULTIPLIER <<" = " <<
ptr_outputSingle[i] << " AND " << ptr_outputTBB[i] << endl;
}
}
cout << "Sequential for execution time: " << singlethread_time << " units"<< endl;
cout << "TBB parallel_for execution time: " << tbb_time << " units" << endl;
return 0;
}
I used a Makefile for this, but we can also do it as (assuming the file as simple_for.cpp):
g++ -O2 -DNDEBUG -o ./simple_for simple_for.cpp -ltbb
Now, let's talk about "meaningful" efficiency gains:
I ran this code on my Linux machine with args (numElements) as: 10, 100, 1000 and found some performance improvement when using TBB parallel_for (assuming that execution timings are reported correctly). But when I ran this for numbers beyond 10000, I found that serial for does it in lesser time. No No No buddy.... don't even think about parallel_for having a constraint here in terms of breaking things up into smaller chunks of serial for(s). Intel (and others who support parallelism) specifically take into account Amdahl's Law and Gustafson's Law while proposing TBB to developers. So there's some level of optimization provided in TBB (based on practical load/figures and processor configuration that one's using). Like in this case I could overcome this limit of 10000 by providing a "grainsize" to the blocked_range() constructor as:
parallel_for(blocked_range(0,numElements,10),myNumbers(ptr_input,ptr_outputTBB));
Here, the 3rd arg to blocked_range() is grainsize and I saw more performance improvements for larger iterations if I keep reducing it (finally to 10) starting from an initial grainsize of 1000. Also observe that I've not used auto_partitioner arg in the call to parallel_for if I use grainsize in blocked_range constructor. Using a partitioner for deciding range of parallel chunks for your processing subsystem is one of the new features provided in TBB and with auto_partitioner the TBB runtime chooses a chunk size automatically optimized for parallelizing iterations on the underlying processing subsystem.
Refer TBB Getting Started Doc for more details on how to select the right grainsize for your iterations and for partitioner details.
In short: grainsize specifies the number of iterations for a “reasonable size”
chunk to feed the processor. If the iteration space has more than grainsize iterations, parallel_for splits it into separate subranges that are scheduled separately.
Yo! So we'd a gr8 start with TBB parallel_for demonstrating it's might in multicore/multi-processor environments. There's a lot more to parallel algos in TBB library. Not only that, there's a whole bunch of customized STL constructs that now work in tandem with multi-threaded code w/o the developer worrying about maintaining the threading infrastructure. Lemme explore some more of these features next week when I come back after the X-mas vacations. Till then, happy parallelizing!!!
Why stress on "meaningful"? We'll talk about it later in this column.
Here's a working code with execution timing difference b/w a multiplication job using sequential for and a TBB parallel_for. I've put in comments to explain use of important statements in the code.
//These includes are the ones reqd to use TBB lib's parallel_for
#include "/tbb/tbb20_20070927oss_src/include/tbb/blocked_range.h"
#include "/tbb/tbb20_20070927oss_src/include/tbb/parallel_for.h"
#include "/tbb/tbb20_20070927oss_src/include/tbb/task_scheduler_init.h"
#include < iostream>
#include < stdlib>
#include < sys/time.h>
#define MULTIPLIER 3.1456
//Just something I found at koders for timing
#define TIMERSUB(a, b, result) \
do { \
(result)->tv_sec = (a)->tv_sec - (b)->tv_sec; \
(result)->tv_usec = (a)->tv_usec - (b)->tv_usec; \
if ((result)->tv_usec < 0) { \
--(result)->tv_sec; \
(result)->tv_usec += 1000000; \
} \
} while (0)
using namespace tbb;
using namespace std;
typedef long long mytime_t;
// This structure contains an important part needed to be defined
// for TBB's parallel_for. We need to have UDT's with overloaded
// operators wrapping up the serial functionality that we want
// to break into parallel execution. In this case, I use a struct
// (could have used a class too) that encapsulates an array of
// input and output float numbers and has operator() defined to
// perform a serial for operation on input.
struct myNumbers {
float* ptr_input;
float* ptr_output;
//Just a constructor with an initialization list for use in the parallel_for call
myNumbers(float* input, float* output):ptr_input(input),ptr_output(output){}
// This is the actual body that's called in parallel_for by the TBB runtime.
// This code comes as a struct/class definition as the compiler
// expands and inlines this code as part of the template process.
// The TBB runtime takes the range blocked_range
// up the for loop in parallel threads to fit the the no. of
// processors/cores. The no. of processors/cores and thus the no.
// of threads to break this operation into are calculated by the
// TBB runtime and thus the developer using the TBB library can
// just concentrate on the functionality w/o worrying about the
// parallelizing math involved.
void operator()(const blocked_range
for (int i = range.begin(); i!= range.end(); i++)
ptr_output[i] = ptr_input[i] * MULTIPLIER;
}
};
int main(size_t argc, char* argv[]) {
// for timing execution
timeval t_start, t_end, t_result, tbb_start, tbb_end, tbb_result;
mytime_t singlethread_time, tbb_time;
int i = 0;
float* ptr_input;
float* ptr_outputSingle;
float* ptr_outputTBB;
// Initialize the TBB runtime...
task_scheduler_init init;
if( argc != 2 ) {
cout<<"Usage: "<
return 1;
}
int numElements = atoi(argv[1]);
if( numElements <= 1 ) {
cout<<"Array size "<
return 1;
}
ptr_input = new float[numElements];
ptr_outputSingle = new float[numElements];
ptr_outputTBB = new float[numElements];
for(i = 0; i < numElements; i++) {
ptr_input[i] = i;
ptr_outputSingle[i] = 0;
ptr_outputTBB[i] = 0;
}
//Time the execution using plain sequential for
gettimeofday(&t_start,NULL);
for( i=0; i < numElements; i++ ) {
ptr_outputSingle[i] = ptr_input[i] * MULTIPLIER;
}
gettimeofday(&t_end,NULL);
TIMERSUB(&t_start,&t_end,&t_result);
singlethread_time = (mytime_t)(t_result.tv_sec + t_result.tv_usec);
//Time the execution using TBB parallel_for
gettimeofday(&tbb_start,NULL);
parallel_for(blocked_range
myNumbers(ptr_input,ptr_outputTBB), auto_partitioner());
gettimeofday(&tbb_end,NULL);
TIMERSUB(&tbb_start,&tbb_end,&tbb_result);
tbb_time = (mytime_t)(tbb_result.tv_sec + tbb_result.tv_usec);
//Verify that the outputs match
for(i=0; i < numElements; i++) {
if( ptr_outputSingle[i] != ptr_outputTBB[i] ) {
cout << ptr_input[i] << " * " << MULTIPLIER <<" = " <<
ptr_outputSingle[i] << " AND " << ptr_outputTBB[i] << endl;
}
}
cout << "Sequential for execution time: " << singlethread_time << " units"<< endl;
cout << "TBB parallel_for execution time: " << tbb_time << " units" << endl;
return 0;
}
I used a Makefile for this, but we can also do it as (assuming the file as simple_for.cpp):
g++ -O2 -DNDEBUG -o ./simple_for simple_for.cpp -ltbb
Now, let's talk about "meaningful" efficiency gains:
I ran this code on my Linux machine with args (numElements) as: 10, 100, 1000 and found some performance improvement when using TBB parallel_for (assuming that execution timings are reported correctly). But when I ran this for numbers beyond 10000, I found that serial for does it in lesser time. No No No buddy.... don't even think about parallel_for having a constraint here in terms of breaking things up into smaller chunks of serial for(s). Intel (and others who support parallelism) specifically take into account Amdahl's Law and Gustafson's Law while proposing TBB to developers. So there's some level of optimization provided in TBB (based on practical load/figures and processor configuration that one's using). Like in this case I could overcome this limit of 10000 by providing a "grainsize" to the blocked_range
parallel_for(blocked_range
Here, the 3rd arg to blocked_range() is grainsize and I saw more performance improvements for larger iterations if I keep reducing it (finally to 10) starting from an initial grainsize of 1000. Also observe that I've not used auto_partitioner arg in the call to parallel_for if I use grainsize in blocked_range constructor. Using a partitioner for deciding range of parallel chunks for your processing subsystem is one of the new features provided in TBB and with auto_partitioner the TBB runtime chooses a chunk size automatically optimized for parallelizing iterations on the underlying processing subsystem.
Refer TBB Getting Started Doc for more details on how to select the right grainsize for your iterations and for partitioner details.
In short: grainsize specifies the number of iterations for a “reasonable size”
chunk to feed the processor. If the iteration space has more than grainsize iterations, parallel_for splits it into separate subranges that are scheduled separately.
Yo! So we'd a gr8 start with TBB parallel_for demonstrating it's might in multicore/multi-processor environments. There's a lot more to parallel algos in TBB library. Not only that, there's a whole bunch of customized STL constructs that now work in tandem with multi-threaded code w/o the developer worrying about maintaining the threading infrastructure. Lemme explore some more of these features next week when I come back after the X-mas vacations. Till then, happy parallelizing!!!
Thursday, December 13, 2007
Fun with Intel TBB!
Phew! With Linux there's always some amount of configuration/tweaking reqd. before you can build source or make use of a new library...And I must tell you, I luv this challenge! The best example of this is when you want a GNU app or a framework to help you do something more with your Linux box. Most of the time we'll download src code from GNU free s/w websites like sourceforge etc to get started. And then starts the process to configure, make the src and install it. That's not all, sometimes you've to go a step further and comment or fix some simple errors (like casting) in the C files of the app before you can successfully build the app and get the reqd. binaries.
Here I want to capture an experience I had with installing TBB on my Linux box (running SLES 10 with 2.6.16.21 kernel and gcc/++ version 4.1.0)
Let's go step-by-step from here:
1) copy the following tar.gz files to some folder like /tbb/
tbb20_20070927oss_src.tar.gz
tbb20_20070927oss_lin.tar.gz
2) Now, extract everything there itself using tar -zxvf filenames
3) This will give you two folders:
tbb20_20070927oss_src
tbb20_20070927oss_lin
4) From tbb20_20070927oss_lin copy the folder ia32 to tbb20_20070927oss_src directory (it's a 32 bit platform on an intel box)
5) If you're lucky enuf you'll get the libtbb.so and others for your kernel+glibc version in one of the four folders inside: tbb20_20070927oss_src/ia32
6) If not, we need to build the libtbb.so (the crux of everything) for your platform, so "cd /tbb/tbb20_20070927oss_src/src/tbb/"
7) Run "make" here and see if your luck strikes, to get a libtbb.so w/o errors.
8) If not, then try either of these things or both:
(a) If you see a make Error for task.cpp then you may be asked to fix this:
/src/tbb/task.cpp:396: warning: deprecated conversion from string constant
I know you can do this, so I won't fix it for you here ;)
(b) If it still doesn't work then figure out what else is preventing a successful make of libtbb.so and try resolve it.
Lastly, you can try using the libtbb.so from any of the ia32 folders like: tbb/tbb20_20070927oss_src/ia32/cc4.1.0_libc2.4_kernel2.6.16.21/lib
9) Once you've the right version of libtbb.so and libtbbmalloc.so for your platform, create their soft links in /usr/lib/
10) Now, we're ready to make a sample code supplied with TBB src.
Goto sample code folder "cd tbb/tbb20_20070927oss_src/examples/parallel_for/seismic" and do a make here.
11) Again, things are not that straight buddy!
You need to either add to Makefile the include path for files included in Seismic.cpp like /tbb/tbb20_20070927oss_src/include/tbb/parallel_for.h or edit the .cpp file to have absolute path to these .h files
12) After fixing all these make dependencies, you'll be able to build the binary and see it running on your Linux machine with figures telling you no. of fps with parallelism.
Now that we've the machine run this example successfully, why not try our own parallel_for which seems to be a good starting point to go parallel the Intel way!!
Coming up next -> How to use TBB parallel_for
Here I want to capture an experience I had with installing TBB on my Linux box (running SLES 10 with 2.6.16.21 kernel and gcc/++ version 4.1.0)
Let's go step-by-step from here:
1) copy the following tar.gz files to some folder like /tbb/
tbb20_20070927oss_src.tar.gz
tbb20_20070927oss_lin.tar.gz
2) Now, extract everything there itself using tar -zxvf filenames
3) This will give you two folders:
tbb20_20070927oss_src
tbb20_20070927oss_lin
4) From tbb20_20070927oss_lin copy the folder ia32 to tbb20_20070927oss_src directory (it's a 32 bit platform on an intel box)
5) If you're lucky enuf you'll get the libtbb.so and others for your kernel+glibc version in one of the four folders inside: tbb20_20070927oss_src/ia32
6) If not, we need to build the libtbb.so (the crux of everything) for your platform, so "cd /tbb/tbb20_20070927oss_src/src/tbb/"
7) Run "make" here and see if your luck strikes, to get a libtbb.so w/o errors.
8) If not, then try either of these things or both:
(a) If you see a make Error for task.cpp then you may be asked to fix this:
/src/tbb/task.cpp:396: warning: deprecated conversion from string constant
I know you can do this, so I won't fix it for you here ;)
(b) If it still doesn't work then figure out what else is preventing a successful make of libtbb.so and try resolve it.
Lastly, you can try using the libtbb.so from any of the ia32 folders like: tbb/tbb20_20070927oss_src/ia32/cc4.1.0_libc2.4_kernel2.6.16.21/lib
9) Once you've the right version of libtbb.so and libtbbmalloc.so for your platform, create their soft links in /usr/lib/
10) Now, we're ready to make a sample code supplied with TBB src.
Goto sample code folder "cd tbb/tbb20_20070927oss_src/examples/parallel_for/seismic" and do a make here.
11) Again, things are not that straight buddy!
You need to either add to Makefile the include path for files included in Seismic.cpp like /tbb/tbb20_20070927oss_src/include/tbb/parallel_for.h or edit the .cpp file to have absolute path to these .h files
12) After fixing all these make dependencies, you'll be able to build the binary and see it running on your Linux machine with figures telling you no. of fps with parallelism.
Now that we've the machine run this example successfully, why not try our own parallel_for which seems to be a good starting point to go parallel the Intel way!!
Coming up next -> How to use TBB parallel_for
Subscribe to:
Posts (Atom)