This article discusses the CnosDB system clock failure and proposes a solution to the system clock failure in a stand-alone environment and a cluster environment, which has very important guiding significance.
Database fault tolerance: system clock failure
Discussions arising from situations that may not exist
One day, CnosDB intern Xiao Shao was designing time-sequential data processing logic. The generation of time-series data generally follows the following rules:
1. Generally, time series data consists of timelines, and behind each timeline is an independent data source.
2. The generation of time series data is often real-time, and the data on the timeline are also arranged in chronological order. The underlying data storage design of most time series databases is also based on this.
3. In some special cases, the chronological order on a timeline may be destroyed, and data from a period of time ago may only arrive in the database now; some databases will refuse such data writing because it is too troublesome to process.
The problem lies in “some special situations”. Xiao Shao whimsically proposed a possible situation that I had not thought of:
1. If the data sent by the client does not have a timestamp, the timestamp of the data is determined by the system timestamp of the database server.
2. If the system clock of the database server is modified, for example, it is originally 5 o'clock but is adjusted to 3 o'clock, time-out-of-sequence data will appear.
It has to be said that this is a more special situation than “some special situations”, and a typical time-out-of-sequence data scenario is like this; a remote device continuously sends data with its own timestamp to the database, there may be two This situation occurs:
1. A network failure caused the data sent from the device to not arrive according to the time when the data was sent.
2. The data sent by the device is lost or damaged before reaching the database, and the device will resend the data to the database at a later time.
After a period of thinking, I found that the situation Xiao Shao mentioned should not be discussed in the issue of time-out-of-sequence data, but a problem that needs to be dealt with separately.
System clock failure
Compared with common and serious faults such as power outages and network failures, changes in the system clock are rarely considered. Generally, the main impacts are log timestamps and the like, and will not directly affect users. part.
In recent years, due to the popularization of the concept of NewSQL, many distributed databases have chosen to rely on the system clock and NTP to implement hybrid logical clocks. At this time, system clock failure may really be widely discussed. However, CnosDB is a typical native time series database, not a splicing and transformation database project, and does not support transactions, so the issue of mixed logical clocks is not discussed here.
definition
Before discussing in detail, let us first give a clear definition of system clock failure: it refers to a sudden and large change in the system clock of the database deployment environment due to malicious attacks, user misoperations, etc. For example, the system clock was 17 in the previous second: 00, and the next second it became 3:19.
Since it is impossible to confirm whether this change is correct and in line with user expectations from a database perspective, this change is considered a failure.
Situation in the cluster
The situation on a single machine is relatively simple, as defined above, while the situation in a cluster can be roughly divided into three types:
1. The system clock of only a few nodes changes.
2. The system clocks of more than half of the nodes change.
3. All nodes switch to the new and identical system clock synchronously.
Impact on time series databases
Just like the “special circumstances” mentioned by Xiao Shao above, system clock failure will affect data that does not have its own timestamp. Behind this is the time series database's strategy for processing such data:
1. All data written by users must have their own timestamp.
2. For data without timestamps, the client's system timestamp is automatically added.
3. For data without timestamps, the system timestamp of the database server is automatically added.
According to my observation, strategy three should be the more mainstream solution at present, so it is best for a time series database to consider how to deal with system clock failures. Otherwise, once a problem occurs, a large amount of “timestamp abnormal” data will be written into the database, which will affect the performance of the database. However, the appearance of “dirty” data from the user's perspective will be a big trouble.
Troubleshooting
Based on past experience and knowledge, a sufficiently robust system needs to follow the following four principles when handling failures:
1. Fail-Fast: Able to detect faults as soon as possible.
2. Fail-Safe: Minimize the impact of failures.
3. Fail-Over: In the event of a failure, active/standby switching can be performed, and the backup/standby node that has not failed will continue to provide services.
4. Fail-Back: After the primary copy/master node recovers from a fault state to a normal state, it can be switched back to the primary copy/master node to provide services.
Based on this principle, the following designs were made for both stand-alone and cluster situations.
Single machine: A fault is discovered and then the service is degraded
In a stand-alone environment, after the system detects a fault, the following operations need to be performed:
1. Periodically obtain the current system timestamp and compare it with the last obtained value. If the deviation between the two exceeds the tolerance threshold, it is determined that a clock failure has occurred.
2. Users can check whether the current CnosDB node is in a clock failure state by checking an item in the system table. This state will not change due to the restart of the CnosDB process.
3. After a failure occurs, only data writing with a timestamp specified by the user is allowed, and data writing without a specified timestamp is rejected.
4. The user can execute a command with administrator privileges to relieve the clock failure state of the current CnosDB node.
When we deal with a single machine failure, we can use the following user interface design to define:
Unified naming rules
clock_fault_XX
System variables/parameters
1. clock_fault_check_interval
a. Get the system timestamp and check the time interval.
b. The default value is 5 minutes, which can be changed online.
2. clock_fault_last_timestamp
a. The last obtained system timestamp.
b. Read only.
3. clock_fault_check_threshold
a. Trigger threshold for clock failure.
b. When the difference between the actual timestamp of the current system and clock_fault_last_timestamp + clock_fault_check_interval exceeds this threshold, the current CnosDB node switches to the clock fault state.
c. The default value is 1h and can be changed online.
4.clock_fault_state
a. Is the current CnosDB node in a clock failure state?
b. In the clock failure state, this value is error; otherwise, it is normal.
c. When CnosDB sets this value to error, it needs to be recorded in the log at the same time.
d. It can be changed online, but can only be manually set to normal by the CnosDB administrator, which means that the CnosDB administrator has fully grasped the current situation and handled it accordingly.
System timeline
CLOCK_FAULT_TIMEMESTAMP: Record the system timestamp obtained at each check. Users can check this timeline to determine the specific time point of the clock failure.
Cluster: complex fault detection and active/standby switching mechanism
In a cluster environment, due to multiple nodes, the processing process will be more complicated than the single-machine mechanism, so there are also the following corresponding operations:
1. Based on single node failure handling of each CnosDB node.
2. When a node is in a clock failure state, data writes without specified timestamps will be automatically routed to other nodes for execution.
3. Regularly check the system clocks of all nodes in the cluster:
a. N is the total number of hosts (physical machines, virtual machines or containers) in the current cluster, μ is a subset of all hosts in the cluster, and the number of hosts included in μ exceeds N/2.
b. If there is a μ whose system timestamp difference between two nodes does not exceed a preset threshold, this μ can be called a clock reference host group.
c. The clock base host group should contain as many hosts as possible.
d. If all nodes deployed in the clock reference host group are not in a clock failure state, the collection of these nodes can become a clock reference node group.
e. When a clock reference node group cannot be found in the cluster, the entire cluster will be judged to be in a clock abnormal state.
f. Only when the clock reference node group can be found in the cluster, the CnosDB cluster administrator can release the cluster from the clock abnormal state.
g. When the difference between the system clock of a node and the system clock of a node in the clock reference host group exceeds a preset threshold, the cluster will set the clock_fault_state of the node to error (through an account within the cluster, General users cannot perform such operations).
When we deal with cluster failures, we can use the following user interface design to define:
Unified naming rules
cluster_clock_fault_XX
System variables/parameters
1. cluster_clock_fault_check_interval
a. The time interval for clock_fault_check within the cluster.
b. The default value is 5 minutes, which can be changed online.
2. cluster_clock_fault_check_threshold
a. The threshold used to determine whether the clock reference node exists.
b. The default value is 10s and can be changed online.
3. cluster_clock_fault_state
a. When the current CnosDB cluster is in a clock failure state, this value is error; otherwise, it is normal.
b. When CnosDB sets this value to error, it needs to be recorded in the log at the same time.
c. It can be changed online, but can only be manually set to normal by the CnosDB administrator, which means that the CnosDB administrator has fully understood the current situation and handled it accordingly.
System timeline
1. cluster_clock_fault_base_clock_node_list: records the node list included in the current base clock node group.
2. cluster_clock_fault_base_clock_host_list: records the node list included in the current base clock host group.
Conclusion
Although system clock failure is not a fault that the time series database first considers how to deal with, the detection and processing of clock failure at the cluster level may not need to be implemented directly at the time series database layer, but this thinking and general design The process was still very interesting, so I decided to share it with everyone.
If you have real cases of encountering this kind of problem, you can tell us so that the design in this article will be more rigorous and implemented in CnosDB faster.