Cassandra 日志

Cassandra 对日志记录提供了丰富的支持，并试图在最大限度地为操作人员提供对数据库的洞察力的同时，限制日志中的噪音。

常见日志文件

Cassandra 有三个主要的日志，system.log、debug.log 和 gc.log，它们分别保存一般日志消息、调试日志消息和 Java 垃圾回收日志。

默认情况下，这些日志位于 ${CASSANDRA_HOME}/logs 中，但大多数 Linux 发行版将日志重新定位到 /var/log/cassandra。操作人员可以使用提供的 logback.xml 文件调整此位置以及记录的级别。

`system.log`

此日志是默认的 Cassandra 日志，是开始任何调查的好地方。一些记录到此日志中的活动示例

未捕获的异常。这些对于调试错误非常有用。
GCInspector 消息，指示长时间的垃圾回收器暂停。当发生长时间暂停时，Cassandra 会打印暂停时间以及系统在暂停时所处的状态（线程状态）。这有助于缩小容量问题（要么是堆内存不足，要么是空闲 CPU 不足）。
有关节点加入和离开集群以及令牌元数据（数据所有权）更改的信息。这对于调试网络分区、数据移动等很有用。
键空间/表创建、修改、删除。
StartupChecks，确保操作系统配置最佳以运行 Cassandra
有关某些后台操作任务（例如索引重新分配）的信息。

与任何应用程序一样，查找 ERROR 或 WARN 行可能是第一步

$ # Search for warnings or errors in the latest system.log
$ grep 'WARN\|ERROR' system.log | tail
...

$ # Search for warnings or errors in all rotated system.log
$ zgrep 'WARN\|ERROR' system.log.* | less
...

`debug.log`

此日志包含额外的调试信息，这些信息在故障排除时可能很有用，但可能比正常的 system.log 噪声更大。一些记录到此日志中的活动示例

有关压缩的信息，包括何时开始、包含哪些 SSTable 以及何时完成。
有关将 memtable 刷新到磁盘的信息，包括何时发生、刷新的大小以及刷新影响的哪些 commitlog 段。

此日志可能非常嘈杂，因此强烈建议使用 grep 和其他日志分析工具进行深入分析。例如

# Search for messages involving a CompactionTask with 5 lines of context
$ grep CompactionTask debug.log -C 5

# Look at the distribution of flush tasks per keyspace
$ grep "Enqueuing flush" debug.log | cut -f 10 -d ' ' | sort | uniq -c
    6 compaction_history:
    1 test_keyspace:
    2 local:
    17 size_estimates:
    17 sstable_activity:

`gc.log`

gc 日志是标准的 Java GC 日志。使用默认的 jvm.options 设置，您可以在此日志中获得大量有价值的信息，例如应用程序暂停时间以及暂停的原因。这可能有助于将吞吐量或延迟问题缩小到 JVM 调整不当。例如，您可以查看最后几次暂停

$ grep stopped gc.log.0.current | tail
2018-08-29T00:19:39.522+0000: 3022663.591: Total time for which application threads were stopped: 0.0332813 seconds, Stopping threads took: 0.0008189 seconds
2018-08-29T00:19:44.369+0000: 3022668.438: Total time for which application threads were stopped: 0.0312507 seconds, Stopping threads took: 0.0007025 seconds
2018-08-29T00:19:49.796+0000: 3022673.865: Total time for which application threads were stopped: 0.0307071 seconds, Stopping threads took: 0.0006662 seconds
2018-08-29T00:19:55.452+0000: 3022679.521: Total time for which application threads were stopped: 0.0309578 seconds, Stopping threads took: 0.0006832 seconds
2018-08-29T00:20:00.127+0000: 3022684.197: Total time for which application threads were stopped: 0.0310082 seconds, Stopping threads took: 0.0007090 seconds
2018-08-29T00:20:06.583+0000: 3022690.653: Total time for which application threads were stopped: 0.0317346 seconds, Stopping threads took: 0.0007106 seconds
2018-08-29T00:20:10.079+0000: 3022694.148: Total time for which application threads were stopped: 0.0299036 seconds, Stopping threads took: 0.0006889 seconds
2018-08-29T00:20:15.739+0000: 3022699.809: Total time for which application threads were stopped: 0.0078283 seconds, Stopping threads took: 0.0006012 seconds
2018-08-29T00:20:15.770+0000: 3022699.839: Total time for which application threads were stopped: 0.0301285 seconds, Stopping threads took: 0.0003789 seconds
2018-08-29T00:20:15.798+0000: 3022699.867: Total time for which application threads were stopped: 0.0279407 seconds, Stopping threads took: 0.0003627 seconds

这显示了许多有价值的信息，包括应用程序暂停了多长时间（意味着在例如 33 毫秒的 JVM 暂停期间，没有用户查询得到服务），以及进入安全点的花费时间。您可以使用此原始数据来例如获取最长的暂停

$ grep stopped gc.log.0.current | cut -f 11 -d ' ' | sort -n  | tail | xargs -IX grep X gc.log.0.current | sort -k 1
2018-08-28T17:13:40.520-0700: 1.193: Total time for which application threads were stopped: 0.0157914 seconds, Stopping threads took: 0.0000355 seconds
2018-08-28T17:13:41.206-0700: 1.879: Total time for which application threads were stopped: 0.0249811 seconds, Stopping threads took: 0.0000318 seconds
2018-08-28T17:13:41.638-0700: 2.311: Total time for which application threads were stopped: 0.0561130 seconds, Stopping threads took: 0.0000328 seconds
2018-08-28T17:13:41.677-0700: 2.350: Total time for which application threads were stopped: 0.0362129 seconds, Stopping threads took: 0.0000597 seconds
2018-08-28T17:13:41.781-0700: 2.454: Total time for which application threads were stopped: 0.0442846 seconds, Stopping threads took: 0.0000238 seconds
2018-08-28T17:13:41.976-0700: 2.649: Total time for which application threads were stopped: 0.0377115 seconds, Stopping threads took: 0.0000250 seconds
2018-08-28T17:13:42.172-0700: 2.845: Total time for which application threads were stopped: 0.0475415 seconds, Stopping threads took: 0.0001018 seconds
2018-08-28T17:13:42.825-0700: 3.498: Total time for which application threads were stopped: 0.0379155 seconds, Stopping threads took: 0.0000571 seconds
2018-08-28T17:13:43.574-0700: 4.247: Total time for which application threads were stopped: 0.0323812 seconds, Stopping threads took: 0.0000574 seconds
2018-08-28T17:13:44.602-0700: 5.275: Total time for which application threads were stopped: 0.0238975 seconds, Stopping threads took: 0.0000788 seconds

在这种情况下，任何等待查询的客户端都会在 17:13:41 遇到 56 毫秒的延迟。

请注意，GC 暂停并非_仅垃圾回收，尽管通常来说，快速安全点的长时间暂停表明 JVM 堆内存不足或 JVM GC 算法调整不当。长时间暂停和缓慢的安全点通常表明 JVM 在进入安全点时遇到问题，这通常表明磁盘驱动器速度慢（Cassandra 大量使用内存映射读取，JVM 不知道这些读取可能存在磁盘延迟，因此 JVM 安全点逻辑无法很好地处理阻塞的内存映射读取）。

使用这些日志，您甚至可以使用类似 histogram.py 的工具获取暂停分布

$ grep stopped gc.log.0.current | cut -f 11 -d ' ' | sort -n | histogram.py
# NumSamples = 410293; Min = 0.00; Max = 11.49
# Mean = 0.035346; Variance = 0.002216; SD = 0.047078; Median 0.036498
# each ∎ represents a count of 5470
    0.0001 -     1.1496 [410255]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
    1.1496 -     2.2991 [    15]:
    2.2991 -     3.4486 [     5]:
    3.4486 -     4.5981 [     1]:
    4.5981 -     5.7475 [     5]:
    5.7475 -     6.8970 [     9]:
    6.8970 -     8.0465 [     1]:
    8.0465 -     9.1960 [     0]:
    9.1960 -    10.3455 [     0]:
   10.3455 -    11.4949 [     2]:

我们可以看到，在这种情况下，虽然我们的平均性能非常好，但某些东西会导致多秒的 JVM 暂停……在这种情况下，主要是由慢速磁盘导致的安全点暂停

$ grep stopped gc.log.0.current | cut -f 11 -d ' ' | sort -n | tail | xargs -IX grep X  gc.log.0.current| sort -k 1
2018-07-27T04:52:27.413+0000: 187831.482: Total time for which application threads were stopped: 6.5037022 seconds, Stopping threads took: 0.0005212 seconds
2018-07-30T23:38:18.354+0000: 514582.423: Total time for which application threads were stopped: 6.3262938 seconds, Stopping threads took: 0.0004882 seconds
2018-08-01T02:37:48.380+0000: 611752.450: Total time for which application threads were stopped: 10.3879659 seconds, Stopping threads took: 0.0004475 seconds
2018-08-06T22:04:14.990+0000: 1113739.059: Total time for which application threads were stopped: 6.0917409 seconds, Stopping threads took: 0.0005553 seconds
2018-08-14T00:04:06.091+0000: 1725730.160: Total time for which application threads were stopped: 6.0141054 seconds, Stopping threads took: 0.0004976 seconds
2018-08-17T06:23:06.755+0000: 2007670.824: Total time for which application threads were stopped: 6.0133694 seconds, Stopping threads took: 0.0006011 seconds
2018-08-23T06:35:46.068+0000: 2526830.137: Total time for which application threads were stopped: 6.4767751 seconds, Stopping threads took: 6.4426849 seconds
2018-08-23T06:36:29.018+0000: 2526873.087: Total time for which application threads were stopped: 11.4949489 seconds, Stopping threads took: 11.4638297 seconds
2018-08-23T06:37:12.671+0000: 2526916.741: Total time for which application threads were stopped: 6.3867003 seconds, Stopping threads took: 6.3507166 seconds
2018-08-23T06:37:47.156+0000: 2526951.225: Total time for which application threads were stopped: 7.9528200 seconds, Stopping threads took: 7.9197756 seconds

有时读取和理解 Java GC 日志很困难，但您可以获取原始 GC 文件并使用 GCViewer 等工具对其进行可视化，这些工具将 Cassandra GC 日志作为输入，并向您显示有关垃圾回收性能的详细可视化信息。这包括暂停分析以及吞吐量信息。对于稳定的 Cassandra JVM，您可能希望将暂停时间控制在 200 毫秒以下，GC 吞吐量大于 99%。

Java GC 暂停是 Cassandra 中尾部延迟的主要原因之一（与驱动器延迟一起），因此有时这些信息在调试尾部延迟问题时至关重要。

获取更多信息

如果默认的日志记录级别不足，nodetool 可以使用 nodetool setlogginglevel 命令为各种包和类设置更高或更低的日志记录级别。首先查看当前级别

$ nodetool getlogginglevels

Logger Name                                        Log Level
ROOT                                                    INFO
org.apache.cassandra                                   DEBUG

也许 Gossiper 出现了问题，我们希望将其启用为 TRACE 级别以获得更多洞察力

$ nodetool setlogginglevel org.apache.cassandra.gms.Gossiper TRACE

$ nodetool getlogginglevels

Logger Name                                        Log Level
ROOT                                                    INFO
org.apache.cassandra                                   DEBUG
org.apache.cassandra.gms.Gossiper                      TRACE

$ grep TRACE debug.log | tail -2
TRACE [GossipStage:1] 2018-07-04 17:07:47,879 Gossiper.java:1234 - Updating
heartbeat state version to 2344 from 2343 for 127.0.0.2:7000 ...
TRACE [GossipStage:1] 2018-07-04 17:07:47,879 Gossiper.java:923 - local
heartbeat version 2341 greater than 2340 for 127.0.0.1:7000

请注意，以这种方式进行的任何更改都会在下次 Cassandra 进程重启时恢复。要使更改永久生效，请将相应的规则添加到 logback.xml 中。

diff --git a/conf/logback.xml b/conf/logback.xml
index b2c5b10..71b0a49 100644
--- a/conf/logback.xml
+++ b/conf/logback.xml
@@ -98,4 +98,5 @@ appender reference in the root level section below.
   </root>

   <logger name="org.apache.cassandra" level="DEBUG"/>
+  <logger name="org.apache.cassandra.gms.Gossiper" level="TRACE"/>
 </configuration>

请注意，如果您希望获得比此工具提供的更多信息，还有其他实时捕获选项可用，例如 packet-capture。

Cassandra 文档

版本