Fairshare and Job Priority
In order to ensure that all owner groups get their fair share of the cluster, we utilize SLURM's built-in job accounting and fairshare system. Every owner group is given quantity of shares based on the amount of SCU's they have purchased into the KU Community Cluster. The fairshare score of an owner group is then calculated based off of their share versus the amount of the cluster they have actually used. This fairshare score is then utilized to assign priority to their jobs relative to other users on the cluster. This keeps individual owner groups from monopolizing the resources in the sixhour partition, thus making it unfair to owner groups who have not used their fairshare for quite some time.
Fairshare Score
To see your fairshare score, run the command sshare
.
Account User RawShares NormShares RawUsage EffectvUsage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
root 1.000000 2321244304 1.000000 0.500000
ku parent 1.000000 2321244304 1.000000 0.500000
crc 2 0.005556 103440 0.000045 0.994453
crc r557e636 2 0.003704 103365 0.000045 0.991695
An account is the owner group's name. The CRC owns 2 nodes in the cluster, and thus their RawShares is equal to 2. The NormShares value simply the Account's RawShares divided by the total number of RawShares given to all Accounts on the cluster. There are 359 total RawShares for all Accounts, and thus 2 / 359 = .005556.
RawUsage is the amount of CPU seconds the Account or User has used. The RawUsage is also effected by the halflife that is set for the cluster, which is currently 7 days. Thus work done in the last 7 days counts at full cost, work done 14 days ago costs half, work done 21 days ago one-fourth, and so on.
The next column is EffectvUsage. EffectvUsage is the Account's RawUsage divided by the total RawUsage for the cluster. Thus EffectvUsage is the percentage of the cluster the Account has actually used. In this case, the user has used 0.0045% of the cluster.
Finally, we have the Fairshare score. The Fairshare score is calculated using the following formula f = 2^(-EffectvUsage/NormShares)
. From this one can see that there are five basic regimes for this score which are as follows:
- 1.0: Unused. The User has not run any jobs recently.
- 1.0 > f > 0.5: Underutilization. The User is underutilizing their granted Share. For example, when f=0.75 a lab has recently underutilized their Share of the resources 1:2
- 0.5: Average utilization. The User on average is using exactly as much as their granted Share.
- 0.5 > f > 0: Over-utilization. The User has overused their granted Share. For example, when f=0.25 a lab has recently over utilized their Share of the resources 2:1
- 0: No share left. The User has vastly overused their granted Share. If there is no contention for resources, the jobs will still start.
Since the usage of the cluster varies, the schedule does not stop Users from using more than their granted Share in their Account. Instead, the scheduler wants to fill idle cycles, so it will take whatever jobs it has available. Thus a User is essentially borrowing computing resource time in the future to use now. This will continue to drive down the Users's Fairshare score, but allow jobs for the User to still start. Eventually, another User with a higher Fairshare score will start submitting jobs and that labs jobs will have a higher priority because they have not used their granted Share.
Job Priority
Job Priority is an integer number that adjudicates the position of a job in the pending queue relative to other jobs. There are 3 components. Each component is multiplied by a weighting factor to have that component be more prominent in the scheduling of jobs.
- Partition: Jobs submitted to an owner group partition receive 20,000 priority versus 400 priority given to jobs in the sixhour partition. This ensures that any job submitted to an owner group partition will always be scheduled before a sixhour job, even if submitted after the sixhour job.
- Fairshare: The fairshare priority is given based on the usage of the cluster of the individual user.
- Age: All jobs once submitted start with a 0 priority for age. The age priority component increases as the job is in the PENDING state waiting for the available resources to become free.
100 PENDING Jobs
Only 100 jobs per user in the PENDING state will accrue age priority. This is to allow other jobs to cut in line in that partition if there are thousands of jobs pending from a single user.
You can view all PENDING jobs and their respective priorities using the sprio
command.