Author |
Message |
|
Looking at my host, over 15 task, 6 failed (40%):
I believe this is not expected. I started to see this with the new batch (from 2024-11-17?). I previously paused tasks for ACEMD3, but re-started to run them again.
Luckily they fail early in the compute so there is not too much wasted resources, but it should probably be investingated.
Host: https://www.gpugrid.net/show_host_detail.php?hostid=611890
|
|
|
|
I have the same problem |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1378 Credit: 8,076,190,190 RAC: 2,978,245 Level
![Tyrosine - More than 5B credits Tyr](img/badges/aa/badge_tyr.png) Scientific publications
![Top 10% (64th/1022) contribution to Wang et al., ACS Cent. Sci. 2019 wat](img/badges/papers/badge_pub_emerald.png) ![Top 50% (273rd/672) contribution to Martinez-Rosell et al, JCIM 2020 wat](img/badges/papers/badge_pub_gold.png) ![Top 75% (1040th/1541) contribution to Rodriguez-Espigares et al., Nat Meth 2020 wat](img/badges/papers/badge_pub_silver.png) ![Top 10% (413th/6232) contribution to Herrera-Nieto et al, JCIM 2020 wat](img/badges/papers/badge_pub_emerald.png) ![Top 100% (294th/315) contribution to Cossu et al, JCIM 2020 wat](img/badges/papers/badge_pub_white.png) |
I have the same problem
You do not have the same problem referenced in this thread since you've haven't run any acemd3 tasks.
All your errors are the ATMML tasks. |
|
|
|
I upgraded my Game Ready drivers to v566.14 just in case.
Now I am at 15 errors for 37 ACEMD3 tasks, so still 40%.
The jobs fails rather early so "this is fine" but there still is a waste of resources.
If I can provide anything to help debug this, please let me know |
|
|
TofPeteSend message
Joined: 17 Mar 24 Posts: 15 Credit: 63,874,103 RAC: 45,377 Level
![Threonine - More than 50M credits Thr](img/badges/aa/badge_thr.png) Scientific publications
![No publications yet wat](img/badges/papers/badge_pub_default.png) |
I have the same problem.
Most of the acemd3 tasks failed due to memory leak or unknown error:
It's a bit annoying that 32 tasks were failing from my recent 54 tasks.
It's 59 % of failing rate.
Anyone can help me to solve this?
|
|
|
den777Send message
Joined: 29 Apr 13 Posts: 1 Credit: 71,060,506 RAC: 57,925 Level
![Threonine - More than 50M credits Thr](img/badges/aa/badge_thr.png) Scientific publications
![Top 75% (910th/1283) contribution to Doerr et al. JCTC 2014 wat](img/badges/papers/badge_pub_silver.png) ![Top 50% (961st/2838) contribution to Stanley et al, Nat Commun 2014 wat](img/badges/papers/badge_pub_gold.png) ![Top 75% (2313th/4128) contribution to Ferruz et al., Sci Rep 2016 wat](img/badges/papers/badge_pub_silver.png) ![Top 75% (2757th/4815) contribution to Stanley et al., Sci Rep 2016 wat](img/badges/papers/badge_pub_silver.png) ![Top 90% (3560th/4730) contribution to Noe et al., Nat Chem 2017 wat](img/badges/papers/badge_pub_bronze.png) ![Top 50% (2127th/4634) contribution to Martinez-Rosell et al, JCIM 2018 wat](img/badges/papers/badge_pub_gold.png) ![Top 90% (1119th/1450) contribution to Herrera-Nieto et al, Sci Rep 2020 wat](img/badges/papers/badge_pub_bronze.png) |
Same problem here.
The worst thing is that Windows shows popup about memory access violation and until I manually click OK, the task won't finish and will just keep being idle. |
|
|
|
Same problem - two error messages, the first being the major one:
(unknown error) (0) - exit code 195 (0xc3)
(unknown error) (87) - exit code 195 (0xc3)
Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
|
|
|
Same problem - two error messages, the first being the major one:
(unknown error) (0) - exit code 195 (0xc3)
(unknown error) (87) - exit code 195 (0xc3)
Michael.
unhide your hosts so that the whole error can be seen. any "195" code is not helpful. that's just the generic error from the BOINC app or wrapper. the actual reason for failure could be more embedded in the stderr output and could very well be related to your hardware and software configuration (such as incorrect drivers)
____________
|
|
|
TofPeteSend message
Joined: 17 Mar 24 Posts: 15 Credit: 63,874,103 RAC: 45,377 Level
![Threonine - More than 50M credits Thr](img/badges/aa/badge_thr.png) Scientific publications
![No publications yet wat](img/badges/papers/badge_pub_default.png) |
I have the same problem, the error rate is about 50% (16 errors / 33 total tasks) which is annoying!
I use this computer for other computing projects as well but there are errors at the ACEMD 3 of GPUGrid tasks only.
I can see unknown error and memory leak in the logs:
https://www.gpugrid.net/result.php?resultid=37934054
All of the operation system and graphic card driver updates are installed on my machine, so I don't know what else I can do to solve these memory leak errors.
How can I "unhide" my host to see more details about this problem?
unhide your hosts so that the whole error can be seen. any "195" code is not helpful. that's just the generic error from the BOINC app or wrapper. the actual reason for failure could be more embedded in the stderr output and could very well be related to your hardware and software configuration (such as incorrect drivers)
|
|
|
|
How can I "unhide" my host to see more details about this problem?
Log in to your home page on this website (https://www.gpugrid.net/myaccount.php).
Under 'Preferences', choose GPUGRID preferences (https://www.gpugrid.net/prefs.php?subset=project).
[don't worry about the error messages - it still works]
Edit the top group - 'Primary (default) preferences'.
Check 'Should GPUGRID show your computers on its web site?' and update. That's all. |
|
|
|
My host has been crunching these units non-stop without issue.
I initially thought that perhaps this was a Windows-specific problem, since that was the common factor with the people in this thread that complained and had their hosts visible. However, the work units in question were re-assigned to other Windows hosts, and those other processed their respective tasks to completion without issue.
The other possibility that came to mind was the task failing due to running out of VRAM. It's a possibility (especially if the host is being used for other graphical things or is simultaneously running other GPU work units), but in past instances I've seen of this, the task failed with a specific out-of-memory message.
Given that these tasks aren't universally failing, there's also the possibility that those who are reporting issues simply have failing or unreliable hardware. A quick (if perhaps disruptive) way of testing this is to power- or clock-limit the GPU and see if the errors stop. |
|
|
kcharusoSend message
Joined: 7 Oct 13 Posts: 3 Credit: 827,282,608 RAC: 340,380 Level
![Glutamic Acid - More than 750M credits Glu](img/badges/aa/badge_glu.png) Scientific publications
![Top 50% (490th/1283) contribution to Doerr et al. JCTC 2014 wat](img/badges/papers/badge_pub_gold.png) ![Top 50% (1400th/2838) contribution to Stanley et al, Nat Commun 2014 wat](img/badges/papers/badge_pub_gold.png) ![Top 50% (1143rd/3183) contribution to Lauro et al., JCIM 2014 wat](img/badges/papers/badge_pub_gold.png) ![Top 50% (1404th/3611) contribution to Ferruz et al., JCIM 2015 wat](img/badges/papers/badge_pub_gold.png) ![Top 25% (587th/4128) contribution to Ferruz et al., Sci Rep 2016 wat](img/badges/papers/badge_pub_ruby.png) ![Top 75% (2533rd/4815) contribution to Stanley et al., Sci Rep 2016 wat](img/badges/papers/badge_pub_silver.png) ![Top 10% (408th/4730) contribution to Noe et al., Nat Chem 2017 wat](img/badges/papers/badge_pub_emerald.png) ![Top 25% (902nd/4634) contribution to Martinez-Rosell et al, JCIM 2018 wat](img/badges/papers/badge_pub_ruby.png) ![Top 75% (503rd/672) contribution to Martinez-Rosell et al, JCIM 2020 wat](img/badges/papers/badge_pub_silver.png) ![Top 90% (1197th/1541) contribution to Rodriguez-Espigares et al., Nat Meth 2020 wat](img/badges/papers/badge_pub_bronze.png) ![Top 75% (890th/1450) contribution to Herrera-Nieto et al, Sci Rep 2020 wat](img/badges/papers/badge_pub_silver.png) ![Top 10% (209th/6232) contribution to Herrera-Nieto et al, JCIM 2020 wat](img/badges/papers/badge_pub_emerald.png) |
my fail rate is ranging from 1-2 tasks to 7-9 tasks a day for ACEMD 3. usually tasks failed within a few minutes from the beginning so not much resources were used. better to have none tho. i also noticed that gpu time and run time used were similar like a hundred seconds different or so. is this normal? |
|
|
TofPeteSend message
Joined: 17 Mar 24 Posts: 15 Credit: 63,874,103 RAC: 45,377 Level
![Threonine - More than 50M credits Thr](img/badges/aa/badge_thr.png) Scientific publications
![No publications yet wat](img/badges/papers/badge_pub_default.png) |
I checked the settings mentioned and it's already checked.
How can I "unhide" my host to see more details about this problem?
Log in to your home page on this website (https://www.gpugrid.net/myaccount.php).
Under 'Preferences', choose GPUGRID preferences (https://www.gpugrid.net/prefs.php?subset=project).
[don't worry about the error messages - it still works]
Edit the top group - 'Primary (default) preferences'.
Check 'Should GPUGRID show your computers on its web site?' and update. That's all.
|
|
|
TofPeteSend message
Joined: 17 Mar 24 Posts: 15 Credit: 63,874,103 RAC: 45,377 Level
![Threonine - More than 50M credits Thr](img/badges/aa/badge_thr.png) Scientific publications
![No publications yet wat](img/badges/papers/badge_pub_default.png) |
I understand this but my problem is that I don't know what settings need to be changed to solve these fails.
Sometimes I only lose several minutes, but there are tasks which needed about 9800 seconds to fail.
I think that 32 GB RAM, 3 GHz CPU clock and an Nvidia GTX 1050 Ti with 4 GB VRAM is enough for this kind of tasks. I use my cpu and video card with their's normal settings, I don't use overclocking, etc. And the strange thing is that the problem occurs only with ACEMD3 tasks.
And the logs says that there were memory leaks.
Why?
What settings should I change to prevent the leaking?
My host has been crunching these units non-stop without issue.
I initially thought that perhaps this was a Windows-specific problem, since that was the common factor with the people in this thread that complained and had their hosts visible. However, the work units in question were re-assigned to other Windows hosts, and those other processed their respective tasks to completion without issue.
The other possibility that came to mind was the task failing due to running out of VRAM. It's a possibility (especially if the host is being used for other graphical things or is simultaneously running other GPU work units), but in past instances I've seen of this, the task failed with a specific out-of-memory message.
Given that these tasks aren't universally failing, there's also the possibility that those who are reporting issues simply have failing or unreliable hardware. A quick (if perhaps disruptive) way of testing this is to power- or clock-limit the GPU and see if the errors stop.
|
|
|
PascalSend message
Joined: 15 Jul 20 Posts: 90 Credit: 2,185,398,912 RAC: 3,127,175 Level
![Phenylalanine - More than 2B credits Phe](img/badges/aa/badge_phe.png) Scientific publications
![No publications yet wat](img/badges/papers/badge_pub_default.png) |
bonjour
vous devriez essayer en augmentant la memoire virtuelle a 50 gb.
increasing the pagefile size in Windows to around 50-60GB.
https://forums.cnetfrance.fr/tutoriels-windows-10/575813-windows-10-augmenter-la-memoire-de-pagination-ou-memoire-virtuelle
https://answers.microsoft.com/fr-fr/windows/forum/all/restauration-pagefilesys/628b8a32-f8cd-4481-95a1-2ebd1ef08ce1
Cela a marcher pour moi avant mon passage a linux.
It worked for me before my passage to linux
____________
|
|
|
|
And I can see your computer (host 619264) and tasks just fine - not sure why others were having problems.
Your computer is completing ATM tasks OK, but failing ACEMD3 tasks. The logs show the underlying errors:
16:28:06 (6248): bin/acemd.exe exited; CPU time 0.000000
16:28:06 (6248): app exit status: 0xc0000005
09:46:12 (20156): bin/acemd.exe exited; CPU time 0.015625
09:46:12 (20156): app exit status: 0xc0000005
03:13:45 (10704): bin/acemd.exe exited; CPU time 0.000000
03:13:45 (10704): app exit status: 0xc0000005
17:47:38 (11904): bin/acemd.exe exited; CPU time 0.000000
17:47:38 (11904): app exit status: 0xc0000005
14:00:30 (25420): bin/acemd.exe exited; CPU time 0.000000
14:00:30 (25420): app exit status: 0xc0000005
The exit status (normally written 0xC0000005) is a Windows code defined as "STATUS_ACCESS_VIOLATION", which in full would be reported as 'The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.' - BOINC hasn't passed on those extra parameters.
Many online 'answers' to online searches will suggest that this could be caused by faulty computer RAM, but that's not the only answer - it can also be caused by bad programming. In your case, every example still visible occurs as the application starts or restarts. I'd recommend that you try to avoid pausing ACEMD3 tasks mid-run - try to let them run continuously to completion. See if that reduces the error rate to an acceptable level.
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1378 Credit: 8,076,190,190 RAC: 2,978,245 Level
![Tyrosine - More than 5B credits Tyr](img/badges/aa/badge_tyr.png) Scientific publications
![Top 10% (64th/1022) contribution to Wang et al., ACS Cent. Sci. 2019 wat](img/badges/papers/badge_pub_emerald.png) ![Top 50% (273rd/672) contribution to Martinez-Rosell et al, JCIM 2020 wat](img/badges/papers/badge_pub_gold.png) ![Top 75% (1040th/1541) contribution to Rodriguez-Espigares et al., Nat Meth 2020 wat](img/badges/papers/badge_pub_silver.png) ![Top 10% (413th/6232) contribution to Herrera-Nieto et al, JCIM 2020 wat](img/badges/papers/badge_pub_emerald.png) ![Top 100% (294th/315) contribution to Cossu et al, JCIM 2020 wat](img/badges/papers/badge_pub_white.png) |
I'd just increase the Windows pagefile size first to 50-60GB and reboot to see if that fixes the issue.
If that fails I would start investigating your memory for errors. |
|
|
PascalSend message
Joined: 15 Jul 20 Posts: 90 Credit: 2,185,398,912 RAC: 3,127,175 Level
![Phenylalanine - More than 2B credits Phe](img/badges/aa/badge_phe.png) Scientific publications
![No publications yet wat](img/badges/papers/badge_pub_default.png) |
pour tester la mémoire il faut utiliser memtest et non le logiciel intégré a windows.Memtest est plus fiable.
to test memory you must use memtest and not the built-in software with windows. Memtest is more reliable.
https://www.memtest86.com/
____________
|
|
|
PascalSend message
Joined: 15 Jul 20 Posts: 90 Credit: 2,185,398,912 RAC: 3,127,175 Level
![Phenylalanine - More than 2B credits Phe](img/badges/aa/badge_phe.png) Scientific publications
![No publications yet wat](img/badges/papers/badge_pub_default.png) |
je vous conseille aussi de désactiver l'intégrité de la mémoire .
I also advise you to disable memory integrity.
Apres cela,si le probleme continue,cela dépasse mes connaissances.
After that, if the problem continues,it’s beyond my knowledge.
https://www.malekal.com/desactiver-isolation-noyau-windows-11-10/
____________
|
|
|
PascalSend message
Joined: 15 Jul 20 Posts: 90 Credit: 2,185,398,912 RAC: 3,127,175 Level
![Phenylalanine - More than 2B credits Phe](img/badges/aa/badge_phe.png) Scientific publications
![No publications yet wat](img/badges/papers/badge_pub_default.png) |
apres personne n'est a l'abri d'unites de travail qui semble avoir un bug enfin je suppose que c'est cela et non mon pc.
after no one is safe from work units that seems to have a bug finally I guess it’s this and not my pc.
https://www.gpugrid.net/workunit.php?wuid=31255135
n'oubliez pas de vérifier la température en fonctionnement de votre carte graphique au cas ou le ventilateur serait fatigué.
Remember to check the operating temperature of your graphics card in case the fan is tired.
Thermal and Power Specs:
97 97 97 Maximum GPU Temperature (in C)
https://www.nvidia.com/en-us/geforce/10-series/
____________
|
|
|
TofPeteSend message
Joined: 17 Mar 24 Posts: 15 Credit: 63,874,103 RAC: 45,377 Level
![Threonine - More than 50M credits Thr](img/badges/aa/badge_thr.png) Scientific publications
![No publications yet wat](img/badges/papers/badge_pub_default.png) |
Thanks for the replies.
Based on this report I don't think that the problem is with my computer, because there are many hosts which have similar problem.
However, I investigate my computer to check if everything is fine with it:
* GPU temperature is fine, the maximum is 76 Celsius
* CPU temperature is also fine, the maximum is 55 Celsius
* pagefile size has just increased, we will see if it helps...
* memory seems fine, I don't experience any problem with other apps which could caused by a memory issue, but I will run a memtest
* usually, I don't interrupt the calculation (BOINC is set to always run) and my computer is on in all day, but I will keep an eye on this too
Anyway, this could be a bug also because there are other affected hosts as well and I cannot image that all of these computers have memory problems.
Is it possible to ask the ACEMD developers to check the code in parallel?
|
|
|
bibiSend message
Joined: 4 May 17 Posts: 15 Credit: 16,497,175,243 RAC: 6,421,261 Level
![Tryptophan - More than 10B credit - Honorary cruncher Trp](img/badges/aa/badge_trp.png) Scientific publications
![Top 50% (302nd/1022) contribution to Wang et al., ACS Cent. Sci. 2019 wat](img/badges/papers/badge_pub_gold.png) ![Top 10% (39th/672) contribution to Martinez-Rosell et al, JCIM 2020 wat](img/badges/papers/badge_pub_emerald.png) ![Top 10% (70th/1541) contribution to Rodriguez-Espigares et al., Nat Meth 2020 wat](img/badges/papers/badge_pub_emerald.png) ![Top 1% (24th/6232) contribution to Herrera-Nieto et al, JCIM 2020 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 50% (104th/315) contribution to Cossu et al, JCIM 2020 wat](img/badges/papers/badge_pub_gold.png) |
Two crashes from today
https://www.gpugrid.net/result.php?resultid=38242445
https://www.gpugrid.net/result.php?resultid=38238912
All crashes from acemd3 in the last days have the same problem signature:
Problem Event Name: APPCRASH
Anwendungsname: acemd.exe
Anwendungsversion: 0.0.0.0
Anwendungszeitstempel: 66e42355
Fehlermodulname: acemd.exe
Fehlermodulversion: 0.0.0.0
Fehlermodulzeitstempel: 66e42355
Ausnahmecode: c0000005
Ausnahmeoffset: 0000000000075b6f
All crashes are on Windows 10. The tasks on wsl2 are running. |
|
|
TofPeteSend message
Joined: 17 Mar 24 Posts: 15 Credit: 63,874,103 RAC: 45,377 Level
![Threonine - More than 50M credits Thr](img/badges/aa/badge_thr.png) Scientific publications
![No publications yet wat](img/badges/papers/badge_pub_default.png) |
These crashes are not from my machine.
My machine sent another one this morning:
https://www.gpugrid.net/result.php?resultid=38275319
As I can see it crashed 5 seconds after the task started and there is no reason in the logs why it crashed (no memory leak entry as earlier).
The task was not interrupted, CPU and GPU temperatures are normal, memory is ok and the pagefile size was increased as suggested.
The only similar thing to the other crashes you mentioned is that there is windows 10 on my computer as well.
Any other idea?
Two crashes from today
https://www.gpugrid.net/result.php?resultid=38242445
https://www.gpugrid.net/result.php?resultid=38238912
All crashes from acemd3 in the last days have the same problem signature:
Problem Event Name: APPCRASH
Anwendungsname: acemd.exe
Anwendungsversion: 0.0.0.0
Anwendungszeitstempel: 66e42355
Fehlermodulname: acemd.exe
Fehlermodulversion: 0.0.0.0
Fehlermodulzeitstempel: 66e42355
Ausnahmecode: c0000005
Ausnahmeoffset: 0000000000075b6f
All crashes are on Windows 10. The tasks on wsl2 are running. |
|
|
|
My observations. The last 2 failed runs are less the 3 minutes running. You might just not worry about it. Neither of those 2 had any GPU processing so they failed either during the initial CPU stage or while it was porting over to the GPU. Your GPU would be downright cold at that point in the process. I set my app config to run a full 1.0 cpu for each task and don't run CPU projects. Maybe those things don't help at all but just seemed to me that I had more errors if I didn't. I have been skimming the thread so something could be a repeat. Did you swap memory stick positions? |
|
|
TofPeteSend message
Joined: 17 Mar 24 Posts: 15 Credit: 63,874,103 RAC: 45,377 Level
![Threonine - More than 50M credits Thr](img/badges/aa/badge_thr.png) Scientific publications
![No publications yet wat](img/badges/papers/badge_pub_default.png) |
I don't think that memory stick swapping could change anything.
There should be problems with other applications as well if there would be a hardware memory problem. |
|
|