This article is contributed. See the original author and article here.
Thanks to Jayantha Das, who shared some of the common command we use in SQL Server on Linux Pacemaker Maker cluster. This blog contains the pacemaker settings that we can modify. Unless explicitly required it is recommended that we don’t modify the values as it can cause problem in the functionality of availability groups.
We can run “sudo pcs config” to list critical cluster configuration settings
Master: ag_cluster-master
Meta Attrs: failure-timeout=30s notify=true
Resource: ag_cluster (class=ocf provider=mssql type=ag)
Attributes: ag_name=ag1
Operations: start interval=0s timeout=60 (ag_cluster-start-interval-0s)
stop interval=0s timeout=10 (ag_cluster-stop-interval-0s)
promote interval=0s timeout=60 (ag_cluster-promote-interval-0s)
demote interval=0s timeout=10 (ag_cluster-demote-interval-0s)
monitor interval=10 timeout=60 (ag_cluster-monitor-interval-10)
monitor interval=11 role=Master timeout=60 (ag_cluster-monitor-interval-11)
monitor interval=12 role=Slave timeout=60 (ag_cluster-monitor-interval-12)
Resource: virtualip (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=xx.xx.xx.xx
Operations: start interval=0s timeout=20s (virtualip-start-interval-0s)
stop interval=0s timeout=20s (virtualip-stop-interval-0s)
monitor interval=10s timeout=20s (virtualip-monitor-interval-10s)
Interval
Interval for the above operations mean how frequently (in seconds) to perform the operation. A value of 0 means never. A positive value defines a recurring action, which is typically used with monitor.
Timeout
Timeout for the above operation means how long to wait before declaring the action has failed
Role
Role for the above operations (like the one in the monitor) mean run the operation only on node(s) that the cluster thinks should be in the specified role. This only makes sense for recurring monitor operations. Allowed (case-sensitive) values: Stopped, Started, and in the case of multi-state resources, Slave and Master.
The usual monitor actions are insufficient to monitor a multi-state resource, because pacemaker needs to verify not only that the resource is active, but also that its actual role matches its intended one.
For example,
Define two monitoring actions: the usual one will cover the slave role, and an additional one with role=”master” will cover the master role. In the case of Availability Groups the master is the primary and Slave indicates the secondary node.
<op id=“public-ip-slave-check” name=“monitor” interval=“60”/><op id=“public-ip-master-check” name=“monitor” interval=“61” role=“Master”/>
It is crucial that every monitor operation has a different interval! Pacemaker currently differentiates between operations only by resource and interval; so if (for example) a master/slave resource had the same monitor interval for both roles, Pacemaker would ignore the role when checking the status — which would cause unexpected return codes, and therefore unnecessary complications.
Connection timeout can be found in the corosync logs
monitor: 2019/09/26 13:48:55 ag-helper invoked with hostname [localhost]; port [1433]; ag-name [ag1]; credentials-file [/var/opt/mssql/secrets/passwd]; application-name [monitor-ag_cluster-monitor]; connection-timeout [30];
The SQL query timeout is same as the connection timeout.
Following table list the some of the critical parameters and brief description and how to use it.
NAME |
Description |
How to use it |
cluster-recheck-interval |
Polling interval for time-based changes to options, resource parameters and constraints. A Pacemaker cluster is an event-driven system. As such, it won’t recalculate the best place for resources to run unless something (like a resource failure or configuration change) happens. If time-based rules are needed, the cluster-recheck-interval cluster option (which defaults to 15 minutes) is essential.This tells the cluster to periodically recalculate the ideal state of the cluster. For example, starting a resource between a certain period of time. |
sudo pcs property set cluster-recheck-interval=2min |
corosync totem token timeout |
This is to make sure corosync is resilent against intermittent network glitches; the default is 1000 ms, or 1 second for a 2 node cluster, increasing by 650ms for each additional member; To use a value other than the default, add or edit the token line in /etc/corosync/corosync.conf e.g. totem { token: 5000} |
cat /etc/corosync/corosync.conf totem { version: 2 cluster_name: pcmk secauth: off transport: udpu token: 10000 }
|
start-failure-is-fatal |
Indicates whether a failure to start a resource on a particular node prevents further start attempts on that node. When set to false, the cluster will decide whether to try starting on the same node again based on the resource’s current failure count and migration threshold. |
sudo pcs property set start-failure-is-fatal=true |
failure-timeout |
cluster-recheck-interval indicates the polling interval at which the cluster checks for changes in the resource parameters, constraints or other cluster options. If a replica goes down, the cluster tries to restart the replica at an interval that is bound by the failure-timeout value and the cluster-recheck-interval value. For example, if failure-timeout is set to 60 seconds and cluster-recheck-interval is set to 120 seconds, the restart is tried at an interval that is greater than 60 seconds but less than 120 seconds. We recommend that you set failure-timeout to 60s and cluster-recheck-interval to a value that is greater than 60 seconds. Setting cluster-recheck-interval to a small value is not recommended. |
sudo pcs resource update resource_name meta failure-timeout=60s |
monitor_policy |
Monitoring policy options are:
1) SERVER_UNRESPONSIVE_OR_DOWN: Fail if the SQL Server instance is unresponsive (unable to establish a connection) or down (the process is not running)
|
sudo pcs resource update resource_name monitor_policy=2
This shows up as health-threshold [2]; in corosync logs |
Connection_timeout |
This is the timeout of the connection request to the sql server by the pacemaker. The timeout can be set using the pcs command. |
sudo pcs resource update resource_name connection_timeout=90
|
demote action timeout (s) |
Demote action demotes relevant resources that are running in master mode to slave mode. The timeout can be set using the pcs command. |
sudo pcs resource update resource_name op demote timeout=20
|
promote action timeout (s) |
Promote action promotes relevant resource to the master mode. The timeout can be set using the pcs command. |
sudo pcs resource update resource_name op promote timeout=60 |
start action timeout (s) |
Start action starts the resource. The timeout can be set using the pcs command. |
sudo pcs resource update resource_name op start timeout=60 |
stop action timeout (s) |
Stop action stops the resource. The timeout can be set using the pcs command. |
sudo pcs resource update resource_name op stop timeout=60 |
monitor action timeout (s) |
Monitor action checks the resource’s state. |
sudo pcs resource update resource_name op monitor interval=11 role=Master timeout=11
Monitor timeout should be greater than the connection timeout.
|
Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.
Recent Comments