Thanks again to the folks who responded to my post. I finally manage to get jobs to be terminated after exceeding their memory location. Here is the configuration I used
slurm.conf EnforcePartLimits=ALL TaskPlugin=task/cgroup JobAcctGatherType=jobacct_gather/cgroup SelectTypeParameters=CR_CPU_Memory MemLimitEnforce=yes KillOnBadExit=1 cgroup.conf CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes TaskAffinity=no MaxSwapPercent=10 Running a job that simply allocates RAM in a loop: #! /bin/bash #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=1024MB ./bigmem 100000 Produced the following error once, the job exceeded 1GB RSS: slurmstepd: error: Detected 1 oom-kill event(s) in step 125.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. On Fri, Oct 25, 2019 at 11:26 AM Juergen Salk <juergen.s...@uni-ulm.de> wrote: > Hi Mike, > > IIRC, I once did some tests with the very same configuration as > your's, i.e. `JobAcctGatherType=jobacct_gather/linux´ and > `JobAcctGatherParams=OverMemoryKill´ and got this to work as expected: > Jobs were killed when they exceeded the requested amount of memory. > This was with Slurm 18.08.7. After some tests I went back > to memory enforcement with cgroups as this also keeps memory into > account that is consumed by writing data to a tmpfs filesystem, such > as /dev/shm. > > I have now restored the old configuration, that I think I've used to > experiment with Slurm's capabilities to enforce memory usage without > cgroups and then tried again in my test cluster (now running Slurm > 19.05.2). As far as I understand, the configuration above should > also work with 19.05.2. > > But I was surprised to see, that I can also reproduce the > behavior that you described: A process that exceeds the > requested amount of memory keeps happily running. > > Anyway, I think memory enforcement with cgroups is more reliable > and, thus, more commonly used these days. Recently there was an > interesting discussion on this list about how to get Slurm to cancel > the whole job if the memory is exceeded (not just oom-kill some > processes). Someone suggested setting `KillOnBadExit=1´ in > slurm.conf. Someone else suggested using `set -o errexit´ (or > #!/bin/bash -e instead of plain #!/bin/bash) in the job scripts, so > that the failure of any command within the script will cause the job > to stop immediately. > > You may find the thread in the list archive if you search for > "How to automatically kill a job that exceeds its memory limits". > > Best regards > Jürgen > > -- > Jürgen Salk > Scientific Software & Compute Services (SSCS) > Kommunikations- und Informationszentrum (kiz) > Universität Ulm > Telefon: +49 (0)731 50-22478 > Telefax: +49 (0)731 50-22471 > > > > * Mike Mosley <mike.mos...@uncc.edu> [191025 09:17]: > > Ahmet, > > > > Thank you for taking the time to respond to my question. > > > > Yes, the --mem=1GBB is a typo. It's correct in my script, I just > > fat-fingered it in the email. :-) > > > > BTW, the exact version I am using is 19.05.*2.* > > > > Regarding your response, it seems that that might be more than what I > > need. I simply want to enforce the memory limits as specified by the > user > > at job submission time. This seems to have been the behavior in > previous > > versions of Slurm. What I want is what is described in the 19.05 > release > > notes: > > > > > > > > *RELEASE NOTES FOR SLURM VERSION 19.0528 May 2019* > > > > > > > > *NOTE: slurmd and slurmctld will now fatal if two incompatible mechanisms > > for enforcing memory limits are set. This makes incompatible the use > > of task/cgroup memory limit enforcing (Constrain[RAM|Swap]Space=yes) > > with JobAcctGatherParams=OverMemoryKill, which could cause problems > > when a task is killed by one of them while the other is at the same > > time managing that task. The NoOverMemoryKill setting has been > > deprecated in favor of OverMemoryKill, since now the default is > *NOT* > > to have any memory enforcement mechanism.NOTE: MemLimitEnforce > > parameter has been removed and the functionality that was provided > > with it has been merged into a JobAcctGatherParams. It may be > enabled > > by setting JobAcctGatherParams=OverMemoryKill, so now job and steps > > killing by OOM is enabled from the same place.* > > > > > > > > So, is it really necessary to do what you suggested to get that > > functionality? > > > > If someone could post just a simple slurm.conf file that forces the > memory > > limits to be honored (and kills the job if they are exceeded), then I > could > > extract what I need from that. > > > > Again, thanks for the assistance. > > > > Mike > > > > > > > > On Thu, Oct 24, 2019 at 11:27 PM mercan <ahmet.mer...@uhem.itu.edu.tr> > > wrote: > > > > > Hi; > > > > > > You should set > > > > > > SelectType=select/cons_res > > > > > > and plus one of these: > > > > > > SelectTypeParameters=CR_Memory > > > SelectTypeParameters=CR_Core_Memory > > > SelectTypeParameters=CR_CPU_Memory > > > SelectTypeParameters=CR_Socket_Memory > > > > > > to open Memory allocation tracking according to documentation: > > > > > > https://slurm.schedmd.com/cons_res_share.html > > > > > > Also, the line: > > > > > > #SBATCH --mem=1GBB > > > > > > contains "1GBB". Is this same at job script? > > > > > > > > > Regards; > > > > > > Ahmet M. > > > > > > > > > 24.10.2019 23:00 tarihinde Mike Mosley yazdı: > > > > Hello, > > > > > > > > We are testing Slurm19.05 on Linux RHEL7.5+ with the intent to > migrate > > > > from it toTorque/Moab in the near future. > > > > > > > > One of the things our users are used to is that when their jobs > exceed > > > > the amount of memory they requested, the job is terminated by the > > > > scheduler. We realize the Slurm prefers to use cgroups to contain > > > > rather than kill the jobs but initially we need to have the kill > > > > option in place to transition our users. > > > > > > > > So, looking at the documentation, it appears that in 19.05, the > > > > following needs to be set to accomplish this: > > > > > > > > JobAcctGatherParams = OverMemoryKill > > > > > > > > > > > > Other possibly relevant settings we made: > > > > > > > > JobAcctGatherType = jobacct_gather/linux > > > > > > > > ProctrackType = proctrack/linuxproc > > > > > > > > > > > > We have avoided configuring any cgroup parameters for the time being. > > > > > > > > Unfortunately, when we submit a job with the following: > > > > > > > > #SBATCH --nodes=1 > > > > > > > > #SBATCH --ntasks-per-node=1 > > > > > > > > #SBATCH --mem=1GBB > > > > > > > > > > > > We see RSS ofthe job steadily increase beyond the 1GB limit and it > is > > > > never killed. Interestingly enough, the proc information shows the > > > > ulimit (hard and soft) for the process set to around 1GB. > > > > > > > > We have tried various settings without any success. Can anyone > point > > > > out what we are doing wrong? > > > > > > > > Thanks, > > > > > > > > Mike > > > > > > > > -- > > > > */J. Michael Mosley/* > > > > University Research Computing > > > > The University of North Carolina at Charlotte > > > > 9201 University City Blvd > > > > Charlotte, NC 28223 > > > > _704.687.7065 _ _ j/mmos...@uncc.edu <mailto:mmos...@uncc.edu>/_ > > > > > > > > > -- > > *J. Michael Mosley* > > University Research Computing > > The University of North Carolina at Charlotte > > 9201 University City Blvd > > Charlotte, NC 28223 > > *704.687.7065 * * jmmos...@uncc.edu <mmos...@uncc.edu>* > > > > -- > GPG A997BA7A | 87FC DA31 5F00 C885 0DC3 E28F BD0D 4B33 A997 BA7A > > -- *J. Michael Mosley* University Research Computing The University of North Carolina at Charlotte 9201 University City Blvd Charlotte, NC 28223 *704.687.7065 * * jmmos...@uncc.edu <mmos...@uncc.edu>*
smime.p7s
Description: S/MIME Cryptographic Signature