Author |
Message |
|
Running GPU-Grid without any problems until the last few days, now got over a dozen failed WU's, and with these messages;
28/03/2010 11:56:29 GPUGRID Computation for task a62-TONI_HERG79a-15-100-RND2348_0 finished
28/03/2010 11:56:29 GPUGRID Output file a62-TONI_HERG79a-15-100-RND2348_0_1 for task a62-TONI_HERG79a-15-100-RND2348_0 absent
28/03/2010 11:56:29 GPUGRID Output file a62-TONI_HERG79a-15-100-RND2348_0_2 for task a62-TONI_HERG79a-15-100-RND2348_0 absent
28/03/2010 11:56:29 GPUGRID Output file a62-TONI_HERG79a-15-100-RND2348_0_3 for task a62-TONI_HERG79a-15-100-RND2348_0 absent
28/03/2010 11:56:30 GPUGRID Started upload of a62-TONI_HERG79a-15-100-RND2348_0_0
28/03/2010 11:56:30 GPUGRID Started upload of a62-TONI_HERG79a-15-100-RND2348_0_4
28/03/2010 11:56:31 GPUGRID Finished upload of a62-TONI_HERG79a-15-100-RND2348_0_0
28/03/2010 11:56:31 GPUGRID Finished upload of a62-TONI_HERG79a-15-100-RND2348_0_4
28/03/2010 11:56:31 GPUGRID Started upload of a62-TONI_HERG79a-15-100-RND2348_0_7
28/03/2010 11:56:32 GPUGRID Finished upload of a62-TONI_HERG79a-15-100-RND2348_0_7
28/03/2010 11:57:20 GPUGRID Sending scheduler request: To fetch work.
28/03/2010 11:57:20 GPUGRID Reporting 1 completed tasks, requesting new tasks for GPU
28/03/2010 11:57:25 GPUGRID Scheduler request completed: got 1 new tasks
28/03/2010 11:57:27 GPUGRID Started download of a449-TONI_HERG79a-15-LICENSE
28/03/2010 11:57:27 GPUGRID Started download of a449-TONI_HERG79a-15-COPYRIGHT
28/03/2010 11:57:29 GPUGRID Finished download of a449-TONI_HERG79a-15-LICENSE
28/03/2010 11:57:29 GPUGRID Finished download of a449-TONI_HERG79a-15-COPYRIGHT
28/03/2010 11:57:29 GPUGRID Started download of a449-TONI_HERG79a-15-a449-TONI_HERG79a-14-100-RND5529_1
28/03/2010 11:57:29 GPUGRID Started download of a449-TONI_HERG79a-15-a449-TONI_HERG79a-14-100-RND5529_2
28/03/2010 11:57:33 GPUGRID Finished download of a449-TONI_HERG79a-15-a449-TONI_HERG79a-14-100-RND5529_1
28/03/2010 11:57:33 GPUGRID Finished download of a449-TONI_HERG79a-15-a449-TONI_HERG79a-14-100-RND5529_2
28/03/2010 11:57:33 GPUGRID Started download of a449-TONI_HERG79a-15-a449-TONI_HERG79a-14-100-RND5529_3
28/03/2010 11:57:33 GPUGRID Started download of a449-TONI_HERG79a-15-pdb_file
28/03/2010 11:57:36 GPUGRID Finished download of a449-TONI_HERG79a-15-a449-TONI_HERG79a-14-100-RND5529_3
28/03/2010 11:57:36 GPUGRID Started download of a449-TONI_HERG79a-15-psf_file
28/03/2010 11:57:37 GPUGRID Finished download of a449-TONI_HERG79a-15-psf_file
28/03/2010 11:57:37 GPUGRID Started download of a449-TONI_HERG79a-15-par_file
28/03/2010 11:57:40 GPUGRID Finished download of a449-TONI_HERG79a-15-pdb_file
28/03/2010 11:57:40 GPUGRID Started download of a449-TONI_HERG79a-15-conf_file_enc
28/03/2010 11:57:41 GPUGRID Finished download of a449-TONI_HERG79a-15-conf_file_enc
28/03/2010 11:57:41 GPUGRID Started download of a449-TONI_HERG79a-15-metainp_file
28/03/2010 11:57:42 GPUGRID Finished download of a449-TONI_HERG79a-15-metainp_file
28/03/2010 11:57:42 GPUGRID Started download of a449-TONI_HERG79a-15-a449-TONI_HERG79a-14-100-RND5529_7
28/03/2010 11:57:43 GPUGRID Finished download of a449-TONI_HERG79a-15-a449-TONI_HERG79a-14-100-RND5529_7
28/03/2010 11:57:52 GPUGRID Finished download of a449-TONI_HERG79a-15-par_file
28/03/2010 11:57:52 GPUGRID Starting a449-TONI_HERG79a-15-100-RND5529_0
28/03/2010 11:57:52 GPUGRID Starting task a449-TONI_HERG79a-15-100-RND5529_0 using acemd2 version 603
28/03/2010 11:58:30 GPUGRID Computation for task a449-TONI_HERG79a-15-100-RND5529_0 finished
28/03/2010 11:58:30 GPUGRID Output file a449-TONI_HERG79a-15-100-RND5529_0_1 for task a449-TONI_HERG79a-15-100-RND5529_0 absent
28/03/2010 11:58:30 GPUGRID Output file a449-TONI_HERG79a-15-100-RND5529_0_2 for task a449-TONI_HERG79a-15-100-RND5529_0 absent
28/03/2010 11:58:30 GPUGRID Output file a449-TONI_HERG79a-15-100-RND5529_0_3 for task a449-TONI_HERG79a-15-100-RND5529_0 absent
28/03/2010 11:58:31 GPUGRID Started upload of a449-TONI_HERG79a-15-100-RND5529_0_0
28/03/2010 11:58:31 GPUGRID Started upload of a449-TONI_HERG79a-15-100-RND5529_0_4
28/03/2010 11:58:33 GPUGRID Finished upload of a449-TONI_HERG79a-15-100-RND5529_0_0
28/03/2010 11:58:33 GPUGRID Finished upload of a449-TONI_HERG79a-15-100-RND5529_0_4
28/03/2010 11:58:33 GPUGRID Started upload of a449-TONI_HERG79a-15-100-RND5529_0_7
28/03/2010 11:58:34 GPUGRID Finished upload of a449-TONI_HERG79a-15-100-RND5529_0_7
Any ideas anyone ?
____________
|
|
|
|
core_client_version>6.10.18</core_client_version>
<![CDATA[
...and this, example from 1 invalid WU :<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 260"
# Clock rate: 1.51 GHz
# Total amount of global memory: 939524096 bytes
# Number of multiprocessors: 27
# Number of cores: 216
MDIO ERROR: cannot open file "restart.coor"
</stderr_txt>
]]>
Validate state Invalid
|
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Does the problem persist after rebooting?
Ps. Moving to the "GPU" thread. |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Before (working)
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 260"
# Clock rate: 1.35 GHz
After (not working)
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 260"
# Clock rate: 1.51 GHz <--------
|
|
|
|
The TONI_HERG tasks seem to be particularly problematic - see the hERG: information and issues thread.
[But since the run of errors I posted there, I have more recently had some successful runs - with no change in the host configuration] |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
He gets errors on all types of WUs. Please check clock rate. |
|
|
|
He gets errors on all types of WUs. Please check clock rate.
The clock rate is clearly a problem. But the message log in the OP, and hence the issue which prompted him to post in the first place, is exclusively about TONI_HERG.
I deliberately didn't speculate on the cause of the problem, just pointed out the correlation. From my POV, the jury's still out on whether T_H stresses GPUs more than other tasks, and hence selectively culls the weaker/hotter/badly configured specimens, or whether there's a bug in the application (a code-path which is only followed by particular parameter sets, for example). |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
I didn't mean to be rude. Certain types of WUs may indeed turn out to be more sensitive to a variety of factors (including exposing rare bugs in drivers/hardware combinations, which are close to impossible to spot).
My impression is that, at least since the new application, the global error rate of HERGs is in line with the others.
|
|
|
|
I had another task crash on me tonight. Guess which type it was...
a68-TONI_HERG77a-17-100-RND2481_1
In this case, very many thanks (and that's genuine, not sarcastic). The aftermath solved a SETI Beta problem which has been bugging me, and the BOINC Alpha bug-report mailing list, for the last three weeks. I learned something new to me, and I think largely forgotten by the BOINC developers. It's in an area of code which is about to undergo major change: hopefully the write-up I've been able to submit as a result of this crash will enable safeguards to be built into the new code to replace the old ones which will no longer function. |
|
|