Message boards : Number crunching : Problem of misassignment of cuda4.2 vs cuda3.1 tasks
Author | Message |
---|---|
I have made some changes to the server to add some debugging code and some other smaller changes. | |
ID: 26125 | Rating: 0 | rate:
![]() ![]() ![]() | |
Thank you! When should we expect the change to be fully effective? Should we wait a day to make sure any older 3.1 tasks have cleared the queue? | |
ID: 26127 | Rating: 0 | rate:
![]() ![]() ![]() | |
it's in effect now for all new requests. | |
ID: 26129 | Rating: 0 | rate:
![]() ![]() ![]() | |
On a sample of one (http://www.gpugrid.net/results.php?hostid=93580), last week's 3.1 allocation has been replaced by 4.2 | |
ID: 26130 | Rating: 0 | rate:
![]() ![]() ![]() | |
Good for now. | |
ID: 26132 | Rating: 0 | rate:
![]() ![]() ![]() | |
This task is 3.1 but should be 4.2 | |
ID: 26134 | Rating: 0 | rate:
![]() ![]() ![]() | |
The problem seems to be that your machine is marked as unreliable with the cuda4.2 application, so the server decides to give the cuda3.1 one which is reliable. | |
ID: 26137 | Rating: 0 | rate:
![]() ![]() ![]() | |
This host also gets 4.2 tasks. | |
ID: 26138 | Rating: 0 | rate:
![]() ![]() ![]() | |
Still getting a mix. ie http://www.gpugrid.net/results.php?hostid=124305 | |
ID: 26139 | Rating: 0 | rate:
![]() ![]() ![]() | |
Is a project reset needed following this mornings update? | |
ID: 26140 | Rating: 0 | rate:
![]() ![]() ![]() | |
It should not be required, but you never know. | |
ID: 26141 | Rating: 0 | rate:
![]() ![]() ![]() | |
The problem seems to be that your machine is marked as unreliable with the cuda4.2 application, so the server decides to give the cuda3.1 one which is reliable. Could this be the result of the high error count with ERROR: file deven.cpp line 1106: # Energies have become nan which some people got with the cuda4.2 app? I had several myself with my GTX 470 (host 43404). That's not a good host to generalise from, because I run it under app_info.xml, but in case it helps, here are my observations. For over 3 months, I was running the cuda3.1 app with a count of 0.5, and tasks from other projects running alongside GPUGrid on the same GPU (see thread 2897). A few tasks failed, but no more than usual. Then I swapped to cuda4.2 in the same configuration. The failure rate soared - to over 50%, by eye - and all errors were of the type 'Energies have become nan'. Finally, I set count=1 in app_info (so that GPUGrid has sole use of the GPU while running, although it is swapped out periodically so other projects can run). Since making that change, I haven't had a single error. So, perhaps, other apps in GPU memory cause a problem? I see someone else was talking about memory being a possible suspect in the news threads. All of which leads me to suspect a buffer overflow, or use of uninitialised memory, in the cuda4.2 app. I recently helped a developer on another project pin down an error which was causing invalid data to be processed: his comments after he'd found the bug were: I recall I always got some junk at the end of arrays (array size can be any but processing is vectorized to float4) .... The test which let us track that one down was: "If the host is regularly producing errors, perform a complete cold restart (to zero GPU RAM), and then allow tasks to run while avoiding any application which might load large amounts of data into VRAM" - so no games, video playback, photo editing etc. If the errors go away when VRAM is kept 'clean', that might be a pointer. | |
ID: 26142 | Rating: 0 | rate:
![]() ![]() ![]() | |
I got this error a few times, i solved it by raising the voltage a bit. Or not overclocking as much would help I would think too. | |
ID: 26145 | Rating: 0 | rate:
![]() ![]() ![]() | |
This host is also getting an mix of cuda31 and cuda42 tasks. | |
ID: 26147 | Rating: 0 | rate:
![]() ![]() ![]() | |
Not had a 3.1 task since my last post, so looking promising. | |
ID: 26148 | Rating: 0 | rate:
![]() ![]() ![]() | |
We have from now implemented a correcting suggested by David A. in the scheduler which according to him should fix the problem. | |
ID: 26149 | Rating: 0 | rate:
![]() ![]() ![]() | |
Any comment? Is the problem solved? | |
ID: 26151 | Rating: 0 | rate:
![]() ![]() ![]() | |
Just checked. Looks good. No new mixed tasks for me. | |
ID: 26152 | Rating: 0 | rate:
![]() ![]() ![]() | |
3 Jul 2012 | 16:41:51 UTC Thats the date of my last 31 sent. Its after your 10 oclock. But i must wait for more wus the current one is 42 but this means nothing ^^ 285gtx is slowing barely down on 42 apps so i need more time to wait :/ | |
ID: 26154 | Rating: 0 | rate:
![]() ![]() ![]() | |
This computer has not received any cuda 4.2 work units since updating the driver on 6/30/2012. The last one just downloaded a few minutes ago, it was cuda 3.1 also. Any suggestions. http://www.gpugrid.net/show_host_detail.php?hostid=79921 | |
ID: 26157 | Rating: 0 | rate:
![]() ![]() ![]() | |
Did you try a project reset? | |
ID: 26158 | Rating: 0 | rate:
![]() ![]() ![]() | |
Were sent 3.1 tasks at 7:58 UTC & 8:32 UTC. No more so far. | |
ID: 26159 | Rating: 0 | rate:
![]() ![]() ![]() | |
Look promising. No mixed task so far. | |
ID: 26160 | Rating: 0 | rate:
![]() ![]() ![]() | |
Haven't received any more on my 570 | |
ID: 26161 | Rating: 0 | rate:
![]() ![]() ![]() | |
I've received a CUDA3.1 task today on one of my hosts. However, my hosts receive much less CUDA3.1 tasks lately (btw most of them are turned off because we have a heatwave here in Hungary). | |
ID: 26163 | Rating: 0 | rate:
![]() ![]() ![]() | |
Doh! Just received a 3.1 task at 21:19:29 UTC. | |
ID: 26165 | Rating: 0 | rate:
![]() ![]() ![]() | |
I just got a 3.1 a few hours ago. | |
ID: 26166 | Rating: 0 | rate:
![]() ![]() ![]() | |
Guys, | |
ID: 26177 | Rating: 0 | rate:
![]() ![]() ![]() | |
This problem can be handled on the cruncher's side with my workaround. | |
ID: 26178 | Rating: 0 | rate:
![]() ![]() ![]() | |
Yea I got one (3.1) on my 570 again. Or course it always sneaks in when I'm sleeping. I too would like to know if the workaround is acceptable. I will be putting it in place later myself with your permission GDF. | |
ID: 26182 | Rating: 0 | rate:
![]() ![]() ![]() | |
what is the percentage of 4.2 that you get compared to 3.1? 95% or much less? | |
ID: 26191 | Rating: 0 | rate:
![]() ![]() ![]() | |
2 from the last 20 for me. So presently 90% | |
ID: 26192 | Rating: 0 | rate:
![]() ![]() ![]() | |
Ok, | |
ID: 26193 | Rating: 0 | rate:
![]() ![]() ![]() | |
Message boards : Number crunching : Problem of misassignment of cuda4.2 vs cuda3.1 tasks