Cassandra 日志
Cassandra 对日志记录提供了丰富的支持,并试图在最大限度地为操作人员提供对数据库的洞察力的同时,限制日志中的噪音。
常见日志文件
Cassandra 有三个主要的日志,system.log
、debug.log
和 gc.log
,它们分别保存一般日志消息、调试日志消息和 Java 垃圾回收日志。
默认情况下,这些日志位于 ${CASSANDRA_HOME}/logs
中,但大多数 Linux 发行版将日志重新定位到 /var/log/cassandra
。操作人员可以使用提供的 logback.xml
文件调整此位置以及记录的级别。
system.log
此日志是默认的 Cassandra 日志,是开始任何调查的好地方。一些记录到此日志中的活动示例
-
未捕获的异常。这些对于调试错误非常有用。
-
GCInspector
消息,指示长时间的垃圾回收器暂停。当发生长时间暂停时,Cassandra 会打印暂停时间以及系统在暂停时所处的状态(线程状态)。这有助于缩小容量问题(要么是堆内存不足,要么是空闲 CPU 不足)。 -
有关节点加入和离开集群以及令牌元数据(数据所有权)更改的信息。这对于调试网络分区、数据移动等很有用。
-
键空间/表创建、修改、删除。
-
StartupChecks
,确保操作系统配置最佳以运行 Cassandra -
有关某些后台操作任务(例如索引重新分配)的信息。
与任何应用程序一样,查找 ERROR
或 WARN
行可能是第一步
$ # Search for warnings or errors in the latest system.log
$ grep 'WARN\|ERROR' system.log | tail
...
$ # Search for warnings or errors in all rotated system.log
$ zgrep 'WARN\|ERROR' system.log.* | less
...
debug.log
此日志包含额外的调试信息,这些信息在故障排除时可能很有用,但可能比正常的 system.log
噪声更大。一些记录到此日志中的活动示例
-
有关压缩的信息,包括何时开始、包含哪些 SSTable 以及何时完成。
-
有关将 memtable 刷新到磁盘的信息,包括何时发生、刷新的大小以及刷新影响的哪些 commitlog 段。
此日志可能非常嘈杂,因此强烈建议使用 grep
和其他日志分析工具进行深入分析。例如
# Search for messages involving a CompactionTask with 5 lines of context
$ grep CompactionTask debug.log -C 5
# Look at the distribution of flush tasks per keyspace
$ grep "Enqueuing flush" debug.log | cut -f 10 -d ' ' | sort | uniq -c
6 compaction_history:
1 test_keyspace:
2 local:
17 size_estimates:
17 sstable_activity:
gc.log
gc 日志是标准的 Java GC 日志。使用默认的 jvm.options
设置,您可以在此日志中获得大量有价值的信息,例如应用程序暂停时间以及暂停的原因。这可能有助于将吞吐量或延迟问题缩小到 JVM 调整不当。例如,您可以查看最后几次暂停
$ grep stopped gc.log.0.current | tail
2018-08-29T00:19:39.522+0000: 3022663.591: Total time for which application threads were stopped: 0.0332813 seconds, Stopping threads took: 0.0008189 seconds
2018-08-29T00:19:44.369+0000: 3022668.438: Total time for which application threads were stopped: 0.0312507 seconds, Stopping threads took: 0.0007025 seconds
2018-08-29T00:19:49.796+0000: 3022673.865: Total time for which application threads were stopped: 0.0307071 seconds, Stopping threads took: 0.0006662 seconds
2018-08-29T00:19:55.452+0000: 3022679.521: Total time for which application threads were stopped: 0.0309578 seconds, Stopping threads took: 0.0006832 seconds
2018-08-29T00:20:00.127+0000: 3022684.197: Total time for which application threads were stopped: 0.0310082 seconds, Stopping threads took: 0.0007090 seconds
2018-08-29T00:20:06.583+0000: 3022690.653: Total time for which application threads were stopped: 0.0317346 seconds, Stopping threads took: 0.0007106 seconds
2018-08-29T00:20:10.079+0000: 3022694.148: Total time for which application threads were stopped: 0.0299036 seconds, Stopping threads took: 0.0006889 seconds
2018-08-29T00:20:15.739+0000: 3022699.809: Total time for which application threads were stopped: 0.0078283 seconds, Stopping threads took: 0.0006012 seconds
2018-08-29T00:20:15.770+0000: 3022699.839: Total time for which application threads were stopped: 0.0301285 seconds, Stopping threads took: 0.0003789 seconds
2018-08-29T00:20:15.798+0000: 3022699.867: Total time for which application threads were stopped: 0.0279407 seconds, Stopping threads took: 0.0003627 seconds
这显示了许多有价值的信息,包括应用程序暂停了多长时间(意味着在例如 33 毫秒的 JVM 暂停期间,没有用户查询得到服务),以及进入安全点的花费时间。您可以使用此原始数据来例如获取最长的暂停
$ grep stopped gc.log.0.current | cut -f 11 -d ' ' | sort -n | tail | xargs -IX grep X gc.log.0.current | sort -k 1
2018-08-28T17:13:40.520-0700: 1.193: Total time for which application threads were stopped: 0.0157914 seconds, Stopping threads took: 0.0000355 seconds
2018-08-28T17:13:41.206-0700: 1.879: Total time for which application threads were stopped: 0.0249811 seconds, Stopping threads took: 0.0000318 seconds
2018-08-28T17:13:41.638-0700: 2.311: Total time for which application threads were stopped: 0.0561130 seconds, Stopping threads took: 0.0000328 seconds
2018-08-28T17:13:41.677-0700: 2.350: Total time for which application threads were stopped: 0.0362129 seconds, Stopping threads took: 0.0000597 seconds
2018-08-28T17:13:41.781-0700: 2.454: Total time for which application threads were stopped: 0.0442846 seconds, Stopping threads took: 0.0000238 seconds
2018-08-28T17:13:41.976-0700: 2.649: Total time for which application threads were stopped: 0.0377115 seconds, Stopping threads took: 0.0000250 seconds
2018-08-28T17:13:42.172-0700: 2.845: Total time for which application threads were stopped: 0.0475415 seconds, Stopping threads took: 0.0001018 seconds
2018-08-28T17:13:42.825-0700: 3.498: Total time for which application threads were stopped: 0.0379155 seconds, Stopping threads took: 0.0000571 seconds
2018-08-28T17:13:43.574-0700: 4.247: Total time for which application threads were stopped: 0.0323812 seconds, Stopping threads took: 0.0000574 seconds
2018-08-28T17:13:44.602-0700: 5.275: Total time for which application threads were stopped: 0.0238975 seconds, Stopping threads took: 0.0000788 seconds
在这种情况下,任何等待查询的客户端都会在 17:13:41 遇到 56 毫秒的延迟。
请注意,GC 暂停并非_仅垃圾回收,尽管通常来说,快速安全点的长时间暂停表明 JVM 堆内存不足或 JVM GC 算法调整不当。长时间暂停和缓慢的安全点通常表明 JVM 在进入安全点时遇到问题,这通常表明磁盘驱动器速度慢(Cassandra 大量使用内存映射读取,JVM 不知道这些读取可能存在磁盘延迟,因此 JVM 安全点逻辑无法很好地处理阻塞的内存映射读取)。
使用这些日志,您甚至可以使用类似 histogram.py 的工具获取暂停分布
$ grep stopped gc.log.0.current | cut -f 11 -d ' ' | sort -n | histogram.py
# NumSamples = 410293; Min = 0.00; Max = 11.49
# Mean = 0.035346; Variance = 0.002216; SD = 0.047078; Median 0.036498
# each ∎ represents a count of 5470
0.0001 - 1.1496 [410255]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.1496 - 2.2991 [ 15]:
2.2991 - 3.4486 [ 5]:
3.4486 - 4.5981 [ 1]:
4.5981 - 5.7475 [ 5]:
5.7475 - 6.8970 [ 9]:
6.8970 - 8.0465 [ 1]:
8.0465 - 9.1960 [ 0]:
9.1960 - 10.3455 [ 0]:
10.3455 - 11.4949 [ 2]:
我们可以看到,在这种情况下,虽然我们的平均性能非常好,但某些东西会导致多秒的 JVM 暂停……在这种情况下,主要是由慢速磁盘导致的安全点暂停
$ grep stopped gc.log.0.current | cut -f 11 -d ' ' | sort -n | tail | xargs -IX grep X gc.log.0.current| sort -k 1
2018-07-27T04:52:27.413+0000: 187831.482: Total time for which application threads were stopped: 6.5037022 seconds, Stopping threads took: 0.0005212 seconds
2018-07-30T23:38:18.354+0000: 514582.423: Total time for which application threads were stopped: 6.3262938 seconds, Stopping threads took: 0.0004882 seconds
2018-08-01T02:37:48.380+0000: 611752.450: Total time for which application threads were stopped: 10.3879659 seconds, Stopping threads took: 0.0004475 seconds
2018-08-06T22:04:14.990+0000: 1113739.059: Total time for which application threads were stopped: 6.0917409 seconds, Stopping threads took: 0.0005553 seconds
2018-08-14T00:04:06.091+0000: 1725730.160: Total time for which application threads were stopped: 6.0141054 seconds, Stopping threads took: 0.0004976 seconds
2018-08-17T06:23:06.755+0000: 2007670.824: Total time for which application threads were stopped: 6.0133694 seconds, Stopping threads took: 0.0006011 seconds
2018-08-23T06:35:46.068+0000: 2526830.137: Total time for which application threads were stopped: 6.4767751 seconds, Stopping threads took: 6.4426849 seconds
2018-08-23T06:36:29.018+0000: 2526873.087: Total time for which application threads were stopped: 11.4949489 seconds, Stopping threads took: 11.4638297 seconds
2018-08-23T06:37:12.671+0000: 2526916.741: Total time for which application threads were stopped: 6.3867003 seconds, Stopping threads took: 6.3507166 seconds
2018-08-23T06:37:47.156+0000: 2526951.225: Total time for which application threads were stopped: 7.9528200 seconds, Stopping threads took: 7.9197756 seconds
有时读取和理解 Java GC 日志很困难,但您可以获取原始 GC 文件并使用 GCViewer 等工具对其进行可视化,这些工具将 Cassandra GC 日志作为输入,并向您显示有关垃圾回收性能的详细可视化信息。这包括暂停分析以及吞吐量信息。对于稳定的 Cassandra JVM,您可能希望将暂停时间控制在 200 毫秒以下,GC 吞吐量大于 99%。
Java GC 暂停是 Cassandra 中尾部延迟的主要原因之一(与驱动器延迟一起),因此有时这些信息在调试尾部延迟问题时至关重要。
获取更多信息
如果默认的日志记录级别不足,nodetool
可以使用 nodetool setlogginglevel
命令为各种包和类设置更高或更低的日志记录级别。首先查看当前级别
$ nodetool getlogginglevels
Logger Name Log Level
ROOT INFO
org.apache.cassandra DEBUG
也许 Gossiper
出现了问题,我们希望将其启用为 TRACE
级别以获得更多洞察力
$ nodetool setlogginglevel org.apache.cassandra.gms.Gossiper TRACE
$ nodetool getlogginglevels
Logger Name Log Level
ROOT INFO
org.apache.cassandra DEBUG
org.apache.cassandra.gms.Gossiper TRACE
$ grep TRACE debug.log | tail -2
TRACE [GossipStage:1] 2018-07-04 17:07:47,879 Gossiper.java:1234 - Updating
heartbeat state version to 2344 from 2343 for 127.0.0.2:7000 ...
TRACE [GossipStage:1] 2018-07-04 17:07:47,879 Gossiper.java:923 - local
heartbeat version 2341 greater than 2340 for 127.0.0.1:7000
请注意,以这种方式进行的任何更改都会在下次 Cassandra 进程重启时恢复。要使更改永久生效,请将相应的规则添加到 logback.xml
中。
diff --git a/conf/logback.xml b/conf/logback.xml
index b2c5b10..71b0a49 100644
--- a/conf/logback.xml
+++ b/conf/logback.xml
@@ -98,4 +98,5 @@ appender reference in the root level section below.
</root>
<logger name="org.apache.cassandra" level="DEBUG"/>
+ <logger name="org.apache.cassandra.gms.Gossiper" level="TRACE"/>
</configuration>
请注意,如果您希望获得比此工具提供的更多信息,还有其他实时捕获选项可用,例如 packet-capture
。