ACEMD3 High error rates

Message boards : Number crunching : ACEMD3 High error rates

Author	Message
homer__simpsons Send message Joined: 17 Nov 15 Posts: 14 Credit: 136,767,025 RAC: 7,841 Level Scientific publications	Message 61952 - Posted: 24 Nov 2024 \| 13:58:39 UTC
	Looking at my host, over 15 task, 6 failed (40%): https://www.gpugrid.net/rsult.php?resultid=36785317 https://www.gpugrid.net/result.php?resultid=36785243 https://www.gpugrid.net/result.php?resultid=36774440 https://www.gpugrid.net/result.php?resultid=36774387 https://www.gpugrid.net/result.php?resultid=36774151 https://www.gpugrid.net/result.php?resultid=36759049 I believe this is not expected. I started to see this with the new batch (from 2024-11-17?). I previously paused tasks for ACEMD3, but re-started to run them again. Luckily they fail early in the compute so there is not too much wasted resources, but it should probably be investingated. Host: https://www.gpugrid.net/show_host_detail.php?hostid=611890
	ID: 61952 \| Rating: 0 \| rate: / Reply Quote

Paul Forsdick Send message Joined: 21 Feb 09 Posts: 1 Credit: 29,311,435 RAC: 139,634 Level Scientific publications	Message 61953 - Posted: 24 Nov 2024 \| 17:58:52 UTC - in response to Message 61952.
	I have the same problem
	ID: 61953 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1378 Credit: 8,076,190,190 RAC: 2,978,245 Level Scientific publications	Message 61956 - Posted: 25 Nov 2024 \| 8:04:55 UTC - in response to Message 61953.
	I have the same problem You do not have the same problem referenced in this thread since you've haven't run any acemd3 tasks. All your errors are the ATMML tasks.
	ID: 61956 \| Rating: 0 \| rate: / Reply Quote

homer__simpsons Send message Joined: 17 Nov 15 Posts: 14 Credit: 136,767,025 RAC: 7,841 Level Scientific publications	Message 61975 - Posted: 29 Nov 2024 \| 10:41:01 UTC
	I upgraded my Game Ready drivers to v566.14 just in case. Now I am at 15 errors for 37 ACEMD3 tasks, so still 40%. The jobs fails rather early so "this is fine" but there still is a waste of resources. If I can provide anything to help debug this, please let me know
	ID: 61975 \| Rating: 0 \| rate: / Reply Quote

TofPete Send message Joined: 17 Mar 24 Posts: 15 Credit: 63,874,103 RAC: 45,377 Level Scientific publications	Message 62012 - Posted: 10 Dec 2024 \| 8:44:09 UTC - in response to Message 61952.
	I have the same problem. Most of the acemd3 tasks failed due to memory leak or unknown error: memory leak: https://www.gpugrid.net/result.php?resultid=37054183 unknown error: https://www.gpugrid.net/result.php?resultid=37053274 It's a bit annoying that 32 tasks were failing from my recent 54 tasks. It's 59 % of failing rate. Anyone can help me to solve this?
	ID: 62012 \| Rating: 0 \| rate: / Reply Quote

den777 Send message Joined: 29 Apr 13 Posts: 1 Credit: 71,060,506 RAC: 57,925 Level Scientific publications	Message 62013 - Posted: 10 Dec 2024 \| 13:01:42 UTC
	Same problem here. The worst thing is that Windows shows popup about memory access violation and until I manually click OK, the task won't finish and will just keep being idle.
	ID: 62013 \| Rating: 0 \| rate: / Reply Quote

Michael H.W. Weber Send message Joined: 9 Feb 16 Posts: 73 Credit: 656,229,684 RAC: 22,080 Level Scientific publications	Message 62076 - Posted: 24 Dec 2024 \| 14:59:15 UTC Last modified: 24 Dec 2024 \| 14:59:55 UTC
	Same problem - two error messages, the first being the major one: (unknown error) (0) - exit code 195 (0xc3) (unknown error) (87) - exit code 195 (0xc3) Michael. ____________ President of Rechenkraft.net - Germany's first and largest distributed computing organization.
	ID: 62076 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1085 Credit: 40,330,187,595 RAC: 738,522 Level Scientific publications	Message 62077 - Posted: 24 Dec 2024 \| 18:39:12 UTC - in response to Message 62076.
	Same problem - two error messages, the first being the major one: (unknown error) (0) - exit code 195 (0xc3) (unknown error) (87) - exit code 195 (0xc3) Michael. unhide your hosts so that the whole error can be seen. any "195" code is not helpful. that's just the generic error from the BOINC app or wrapper. the actual reason for failure could be more embedded in the stderr output and could very well be related to your hardware and software configuration (such as incorrect drivers) ____________
	ID: 62077 \| Rating: 0 \| rate: / Reply Quote

TofPete Send message Joined: 17 Mar 24 Posts: 15 Credit: 63,874,103 RAC: 45,377 Level Scientific publications	Message 62160 - Posted: 22 Jan 2025 \| 8:49:43 UTC - in response to Message 62077.
	I have the same problem, the error rate is about 50% (16 errors / 33 total tasks) which is annoying! I use this computer for other computing projects as well but there are errors at the ACEMD 3 of GPUGrid tasks only. I can see unknown error and memory leak in the logs: https://www.gpugrid.net/result.php?resultid=37934054 All of the operation system and graphic card driver updates are installed on my machine, so I don't know what else I can do to solve these memory leak errors. How can I "unhide" my host to see more details about this problem? unhide your hosts so that the whole error can be seen. any "195" code is not helpful. that's just the generic error from the BOINC app or wrapper. the actual reason for failure could be more embedded in the stderr output and could very well be related to your hardware and software configuration (such as incorrect drivers)
	ID: 62160 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1631 Credit: 9,819,914,649 RAC: 5,216,304 Level Scientific publications	Message 62161 - Posted: 22 Jan 2025 \| 10:45:53 UTC - in response to Message 62160.
	How can I "unhide" my host to see more details about this problem? Log in to your home page on this website (https://www.gpugrid.net/myaccount.php). Under 'Preferences', choose GPUGRID preferences (https://www.gpugrid.net/prefs.php?subset=project). [don't worry about the error messages - it still works] Edit the top group - 'Primary (default) preferences'. Check 'Should GPUGRID show your computers on its web site?' and update. That's all.
	ID: 62161 \| Rating: 0 \| rate: / Reply Quote

William Albert Send message Joined: 22 Sep 24 Posts: 4 Credit: 159,119,851 RAC: 150,333 Level Scientific publications	Message 62162 - Posted: 22 Jan 2025 \| 17:13:05 UTC
	My host has been crunching these units non-stop without issue. I initially thought that perhaps this was a Windows-specific problem, since that was the common factor with the people in this thread that complained and had their hosts visible. However, the work units in question were re-assigned to other Windows hosts, and those other processed their respective tasks to completion without issue. The other possibility that came to mind was the task failing due to running out of VRAM. It's a possibility (especially if the host is being used for other graphical things or is simultaneously running other GPU work units), but in past instances I've seen of this, the task failed with a specific out-of-memory message. Given that these tasks aren't universally failing, there's also the possibility that those who are reporting issues simply have failing or unreliable hardware. A quick (if perhaps disruptive) way of testing this is to power- or clock-limit the GPU and see if the errors stop.
	ID: 62162 \| Rating: 0 \| rate: / Reply Quote

kcharuso Send message Joined: 7 Oct 13 Posts: 3 Credit: 827,282,608 RAC: 340,380 Level Scientific publications	Message 62163 - Posted: 23 Jan 2025 \| 1:15:59 UTC - in response to Message 62162. Last modified: 23 Jan 2025 \| 1:24:54 UTC
	my fail rate is ranging from 1-2 tasks to 7-9 tasks a day for ACEMD 3. usually tasks failed within a few minutes from the beginning so not much resources were used. better to have none tho. i also noticed that gpu time and run time used were similar like a hundred seconds different or so. is this normal?
	ID: 62163 \| Rating: 0 \| rate: / Reply Quote

TofPete Send message Joined: 17 Mar 24 Posts: 15 Credit: 63,874,103 RAC: 45,377 Level Scientific publications	Message 62165 - Posted: 23 Jan 2025 \| 17:15:38 UTC - in response to Message 62161.
	I checked the settings mentioned and it's already checked. How can I "unhide" my host to see more details about this problem? Log in to your home page on this website (https://www.gpugrid.net/myaccount.php). Under 'Preferences', choose GPUGRID preferences (https://www.gpugrid.net/prefs.php?subset=project). [don't worry about the error messages - it still works] Edit the top group - 'Primary (default) preferences'. Check 'Should GPUGRID show your computers on its web site?' and update. That's all.
	ID: 62165 \| Rating: 0 \| rate: / Reply Quote

TofPete Send message Joined: 17 Mar 24 Posts: 15 Credit: 63,874,103 RAC: 45,377 Level Scientific publications	Message 62166 - Posted: 23 Jan 2025 \| 17:28:46 UTC - in response to Message 62162.
	I understand this but my problem is that I don't know what settings need to be changed to solve these fails. Sometimes I only lose several minutes, but there are tasks which needed about 9800 seconds to fail. I think that 32 GB RAM, 3 GHz CPU clock and an Nvidia GTX 1050 Ti with 4 GB VRAM is enough for this kind of tasks. I use my cpu and video card with their's normal settings, I don't use overclocking, etc. And the strange thing is that the problem occurs only with ACEMD3 tasks. And the logs says that there were memory leaks. Why? What settings should I change to prevent the leaking? My host has been crunching these units non-stop without issue. I initially thought that perhaps this was a Windows-specific problem, since that was the common factor with the people in this thread that complained and had their hosts visible. However, the work units in question were re-assigned to other Windows hosts, and those other processed their respective tasks to completion without issue. The other possibility that came to mind was the task failing due to running out of VRAM. It's a possibility (especially if the host is being used for other graphical things or is simultaneously running other GPU work units), but in past instances I've seen of this, the task failed with a specific out-of-memory message. Given that these tasks aren't universally failing, there's also the possibility that those who are reporting issues simply have failing or unreliable hardware. A quick (if perhaps disruptive) way of testing this is to power- or clock-limit the GPU and see if the errors stop.
	ID: 62166 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 90 Credit: 2,185,398,912 RAC: 3,127,175 Level Scientific publications	Message 62167 - Posted: 23 Jan 2025 \| 18:05:01 UTC - in response to Message 62166.
	bonjour vous devriez essayer en augmentant la memoire virtuelle a 50 gb. increasing the pagefile size in Windows to around 50-60GB. https://forums.cnetfrance.fr/tutoriels-windows-10/575813-windows-10-augmenter-la-memoire-de-pagination-ou-memoire-virtuelle https://answers.microsoft.com/fr-fr/windows/forum/all/restauration-pagefilesys/628b8a32-f8cd-4481-95a1-2ebd1ef08ce1 Cela a marcher pour moi avant mon passage a linux. It worked for me before my passage to linux ____________
	ID: 62167 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1631 Credit: 9,819,914,649 RAC: 5,216,304 Level Scientific publications	Message 62168 - Posted: 23 Jan 2025 \| 18:07:56 UTC - in response to Message 62165.
	And I can see your computer (host 619264) and tasks just fine - not sure why others were having problems. Your computer is completing ATM tasks OK, but failing ACEMD3 tasks. The logs show the underlying errors: 16:28:06 (6248): bin/acemd.exe exited; CPU time 0.000000 16:28:06 (6248): app exit status: 0xc0000005 09:46:12 (20156): bin/acemd.exe exited; CPU time 0.015625 09:46:12 (20156): app exit status: 0xc0000005 03:13:45 (10704): bin/acemd.exe exited; CPU time 0.000000 03:13:45 (10704): app exit status: 0xc0000005 17:47:38 (11904): bin/acemd.exe exited; CPU time 0.000000 17:47:38 (11904): app exit status: 0xc0000005 14:00:30 (25420): bin/acemd.exe exited; CPU time 0.000000 14:00:30 (25420): app exit status: 0xc0000005 The exit status (normally written 0xC0000005) is a Windows code defined as "STATUS_ACCESS_VIOLATION", which in full would be reported as 'The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.' - BOINC hasn't passed on those extra parameters. Many online 'answers' to online searches will suggest that this could be caused by faulty computer RAM, but that's not the only answer - it can also be caused by bad programming. In your case, every example still visible occurs as the application starts or restarts. I'd recommend that you try to avoid pausing ACEMD3 tasks mid-run - try to let them run continuously to completion. See if that reduces the error rate to an acceptable level.
	ID: 62168 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1378 Credit: 8,076,190,190 RAC: 2,978,245 Level Scientific publications	Message 62169 - Posted: 23 Jan 2025 \| 18:52:19 UTC
	I'd just increase the Windows pagefile size first to 50-60GB and reboot to see if that fixes the issue. If that fails I would start investigating your memory for errors.
	ID: 62169 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 90 Credit: 2,185,398,912 RAC: 3,127,175 Level Scientific publications	Message 62170 - Posted: 23 Jan 2025 \| 19:14:00 UTC - in response to Message 62169.
	pour tester la mémoire il faut utiliser memtest et non le logiciel intégré a windows.Memtest est plus fiable. to test memory you must use memtest and not the built-in software with windows. Memtest is more reliable. https://www.memtest86.com/ ____________
	ID: 62170 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 90 Credit: 2,185,398,912 RAC: 3,127,175 Level Scientific publications	Message 62171 - Posted: 23 Jan 2025 \| 19:19:32 UTC - in response to Message 62170.
	je vous conseille aussi de désactiver l'intégrité de la mémoire . I also advise you to disable memory integrity. Apres cela,si le probleme continue,cela dépasse mes connaissances. After that, if the problem continues,it’s beyond my knowledge. https://www.malekal.com/desactiver-isolation-noyau-windows-11-10/ ____________
	ID: 62171 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 90 Credit: 2,185,398,912 RAC: 3,127,175 Level Scientific publications	Message 62172 - Posted: 23 Jan 2025 \| 19:23:35 UTC - in response to Message 62171. Last modified: 23 Jan 2025 \| 19:52:11 UTC
	apres personne n'est a l'abri d'unites de travail qui semble avoir un bug enfin je suppose que c'est cela et non mon pc. after no one is safe from work units that seems to have a bug finally I guess it’s this and not my pc. https://www.gpugrid.net/workunit.php?wuid=31255135 n'oubliez pas de vérifier la température en fonctionnement de votre carte graphique au cas ou le ventilateur serait fatigué. Remember to check the operating temperature of your graphics card in case the fan is tired. Thermal and Power Specs: 97 97 97 Maximum GPU Temperature (in C) https://www.nvidia.com/en-us/geforce/10-series/ ____________
	ID: 62172 \| Rating: 0 \| rate: / Reply Quote

TofPete Send message Joined: 17 Mar 24 Posts: 15 Credit: 63,874,103 RAC: 45,377 Level Scientific publications	Message 62173 - Posted: 25 Jan 2025 \| 11:29:35 UTC - in response to Message 62172.
	Thanks for the replies. Based on this report I don't think that the problem is with my computer, because there are many hosts which have similar problem. However, I investigate my computer to check if everything is fine with it: * GPU temperature is fine, the maximum is 76 Celsius * CPU temperature is also fine, the maximum is 55 Celsius * pagefile size has just increased, we will see if it helps... * memory seems fine, I don't experience any problem with other apps which could caused by a memory issue, but I will run a memtest * usually, I don't interrupt the calculation (BOINC is set to always run) and my computer is on in all day, but I will keep an eye on this too Anyway, this could be a bug also because there are other affected hosts as well and I cannot image that all of these computers have memory problems. Is it possible to ask the ACEMD developers to check the code in parallel?
	ID: 62173 \| Rating: 0 \| rate: / Reply Quote

bibi Send message Joined: 4 May 17 Posts: 15 Credit: 16,497,175,243 RAC: 6,421,261 Level Scientific publications	Message 62174 - Posted: 27 Jan 2025 \| 12:04:21 UTC Last modified: 27 Jan 2025 \| 12:11:09 UTC
	Two crashes from today https://www.gpugrid.net/result.php?resultid=38242445 https://www.gpugrid.net/result.php?resultid=38238912 All crashes from acemd3 in the last days have the same problem signature: Problem Event Name: APPCRASH Anwendungsname: acemd.exe Anwendungsversion: 0.0.0.0 Anwendungszeitstempel: 66e42355 Fehlermodulname: acemd.exe Fehlermodulversion: 0.0.0.0 Fehlermodulzeitstempel: 66e42355 Ausnahmecode: c0000005 Ausnahmeoffset: 0000000000075b6f All crashes are on Windows 10. The tasks on wsl2 are running.
	ID: 62174 \| Rating: 0 \| rate: / Reply Quote

TofPete Send message Joined: 17 Mar 24 Posts: 15 Credit: 63,874,103 RAC: 45,377 Level Scientific publications	Message 62175 - Posted: 28 Jan 2025 \| 11:03:17 UTC - in response to Message 62174.
	These crashes are not from my machine. My machine sent another one this morning: https://www.gpugrid.net/result.php?resultid=38275319 As I can see it crashed 5 seconds after the task started and there is no reason in the logs why it crashed (no memory leak entry as earlier). The task was not interrupted, CPU and GPU temperatures are normal, memory is ok and the pagefile size was increased as suggested. The only similar thing to the other crashes you mentioned is that there is windows 10 on my computer as well. Any other idea? Two crashes from today https://www.gpugrid.net/result.php?resultid=38242445 https://www.gpugrid.net/result.php?resultid=38238912 All crashes from acemd3 in the last days have the same problem signature: Problem Event Name: APPCRASH Anwendungsname: acemd.exe Anwendungsversion: 0.0.0.0 Anwendungszeitstempel: 66e42355 Fehlermodulname: acemd.exe Fehlermodulversion: 0.0.0.0 Fehlermodulzeitstempel: 66e42355 Ausnahmecode: c0000005 Ausnahmeoffset: 0000000000075b6f All crashes are on Windows 10. The tasks on wsl2 are running.
	ID: 62175 \| Rating: 0 \| rate: / Reply Quote

KeithBriggs Send message Joined: 29 Aug 24 Posts: 39 Credit: 1,931,411,489 RAC: 12,373,655 Level Scientific publications	Message 62176 - Posted: 28 Jan 2025 \| 16:51:27 UTC - in response to Message 62175.
	My observations. The last 2 failed runs are less the 3 minutes running. You might just not worry about it. Neither of those 2 had any GPU processing so they failed either during the initial CPU stage or while it was porting over to the GPU. Your GPU would be downright cold at that point in the process. I set my app config to run a full 1.0 cpu for each task and don't run CPU projects. Maybe those things don't help at all but just seemed to me that I had more errors if I didn't. I have been skimming the thread so something could be a repeat. Did you swap memory stick positions?
	ID: 62176 \| Rating: 0 \| rate: / Reply Quote

TofPete Send message Joined: 17 Mar 24 Posts: 15 Credit: 63,874,103 RAC: 45,377 Level Scientific publications	Message 62189 - Posted: 2 Feb 2025 \| 16:14:00 UTC - in response to Message 62176.
	I don't think that memory stick swapping could change anything. There should be problems with other applications as well if there would be a hardware memory problem.
	ID: 62189 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : ACEMD3 High error rates

	About	Science	Volunteers	Performance	Forum	Join us	Donate