Remoting Monitoring with OpenTelemetry - Coding Phase 1

    Remoting Monitoring with OpenTelemetry

    Goal

    Goal of Remoting Monitoring with OpenTelemetry

    The goal of this project:

    • collect telemetry data(metrics, traces, logs) of remoting module with OpenTelemetry.

    • send the telemetry data to OpenTelemetry Protocol endpoint

    Which OpenTelemetry endpoint to use and how to visualize the data are up to users.

    OpenTelemetry

    OpenTelemetry Logo

    An observability framework for cloud-native software

    OpenTelemetry is a collection of tools, APIs, and SDKs. You can use it to instrument, generate, collect, and export telemetry data(metrics, logs, and traces) for analysis in order to understand your software’s performance and behavior.

    Phase 1 summary

    User survey

    Our team conducted a user survey to understand the pain point regarding Jenkins remoting.

    Fig 1. What agent type/plugins do you use?

    End user survey result

    Fig 1 shows what types of agent users use, and 17 unique respondents out of 28 use docker for agent. So I’m planning to publish a docker image to demonstrate how we can build Docker image with our monitoring feature.

    This survey and investigation of JIRA tickets of past two years also tell me five common causes of agent unavailability.

    • Configuration mistakes

      • Jenkins agent settings, e.g. misuse of "tunnel connection through" option.

      • Platform settings, e.g. invalid port setting of Kubernetes' helm template.

      • Network settings, e.g. Load balancer misconfiguration.

    • Uncontrolled shutdown of nodes for downscaling.

    • Timeout during provisioning a new node.

    • Firewall, antivirus software or other network component kill the connection

    • Lack of hardware resources, e.g. memory, temp space, etc…​

    We also heard valuable user voice in the survey.

    What areas would you like to see better in Jenkins monitoring?

    I have created a bunch of adhoc monitoring jobs to check on the agent’s health and send e-mail. Would be nice to have this consolidated.

    Having archive of nodes with the access to their logs/events would have been nice.

    I hope that implementing these feature with OpenTelemetry, which is expected to become the industry standard for observability, will bring great monitoring experience to Jenkins community.

    Proof of Concept

    How to deliver the monitoring program to agents

    1. Sending monitoring program to the agent over remoting

    Sending monitoring program via remoting

    In my first implementation, I prepared a Jenkins plugin and send the monitoring program from Jenkins controller. However, this approach have following disadvantages.

    1. We cannot collect telemetry data before the initial connection. We are likely to encounter a problem while provisioning a new node, so it’s important to observe agents' telemetry data from the beginning.

    2. Some agent restarters (e.g. UnixSlaveRestarter) restart agent completely when reconnecting. It means that the agent lost monitoring program every time the connection closed, and we cannot collect telemetry data after the connection is lost before a new connection is established.

    So we decided to take the next approach.

    2. Install monitoring engine when provisioning a new agent

    Installing monitoring engine when provisioning

    In this approach, user will download the monitoring program called monitoring engine, which is a JAR file, and place it in the agent node when provisioning.

    How to instrument remoting to produce remoting trace

    Add instrumentation extension point to remoting

    This approach makes the agent launch command more complicated, and we have to overcome this problem.

    Current State

    Metrics

    We currently support the following metrics and planning to support more.

    metrics

    unit

    label

    key

    description

    system.cpu.load

    1

    System CPU load. See com.sun.management.OperatingSystemMXBean.getSystemCpuLoad

    system.cpu.load.average.1m

    System CPU load average 1 minute See java.lang.management.OperatingSystemMXBean.getSystemLoadAverage

    system.memory.usage

    bytes

    state

    used, free

    see com.sun.management.OperatingSystemMXBean.getTotalPhysicalMemorySize and com.sun.management.OperatingSystemMXBean.getFreePhysicalMemorySize

    system.memory.utilization

    1

    System memory utilization, see com.sun.management.OperatingSystemMXBean.getTotalPhysicalMemorySize and com.sun.management.OperatingSystemMXBean.getFreePhysicalMemorySize. Report 0% if no physical memory is discovered by the JVM.

    system.paging.usage

    bytes

    state

    used, free

    see com.sun.management.OperatingSystemMXBean.getFreeSwapSpaceSize and com.sun.management.OperatingSystemMXBean.getTotalSwapSpaceSize.

    system.paging.utilization

    1

    see com.sun.management.OperatingSystemMXBean.getFreeSwapSpaceSize and com.sun.management.OperatingSystemMXBean.getTotalSwapSpaceSize. Report 0% if no swap memory is discovered by the JVM.

    process.cpu.load

    %

    Process CPU load. See com.sun.management.OperatingSystemMXBean.getProcessCpuLoad.

    process.cpu.time

    ns

    Process CPU time. See com.sun.management.OperatingSystemMXBean.getProcessCpuTime.

    runtime.jvm.memory.area

    bytes

    type

    used, committed, max

    see MemoryUsage

    area

    heap, non_heap

    runtime.jvm.memory.pool

    bytes

    type

    used, committed, max

    see MemoryUsage

    pool

    PS Eden Space, G1 Old Gen…​

    runtime.jvm.gc.time

    ms

    gc

    G1 Young Generation, G1 Old Generation, …​

    see GarbageCollectorMXBean

    runtime.jvm.gc.count

    1

    gc

    G1 Young Generation, G1 Old Generation, …​

    see GarbageCollectorMXBean

    Traces

    We tried several approaches to instrument remoting module, but good approach is not established yet.

    Here is a draft documentation of the spans to collect. Google Doc

    Logs

    Coming soon!

    Metric and span demo visualization

    Our team created a demo example with Docker compose and visualized the metrics and spans.

    Click to open in new tab

    prometheus metric visualization jaeger span visualization

    Google Summer of Code Midterm Demo

    Our project demo starts with 8:20

    Next Step

    • Log support

    • Alpha release!

    About the Author
    Akihiro Kiuchi
    Akihiro Kiuchi

    GSoC 2021 student (Jenkins Remoting Monitoring). Akihiro is a student in the Department of information and communication engineering at the University of Tokyo.