Recent Blog Posts

Oct
21
A sustainable pattern with shared library
Table of Contents Context The Problems The Solution Shared Library Duplication Documentation Scalability Installation Agnostic Feature Toggling This post will describe how I use a shared library in Jenkins. Typically when using multibranch pipeline. If possible (if not forced to) I implement the pipelines without multibranch. I previously wrote about how I do that with my Generic Webhook Trigger Plugin in a previous post. But this will be my second choice, If I am not allowed to remove the Jenkinsfile :s from the repositories entirely. Context Within an organization, you typically have a few different kinds of repositories. Each repository versioning one application. You may use different techniques for different kinds of applications. The Jenkins organization on GitHub is an example with 2300 repositories. The Problems Large Jenkinsfiles in every repository containing duplicated code. It seems common that the Jenkinsfile :s in every repository contains much more than just the things that are unique for that repository. The shared libraries feature may not be used, or it is used but not with an optimal pattern. Installation specific Jenkinsfile:s that only work with one specific Jenkins installation. Sometimes I see multiple Jenkinsfile :s, one for each purpose or Jenkins installation. No documentation and/or no natural place to write documentation. Development is slow. Adding new features to repositories is a time consuming task. I want to be able to push features to 1000+ repositories without having to update their Jenkinsfile :s. No flexible way of doing feature toggling. When maintaining a large number of repositories it is sometimes nice to introduce a feature to a subset of those repositories. If that works well, the feature is introduced to all repositories. The Solution My solution is a pattern that is inspired by how the Jenkins organization on GitHub does it with its buildPlugin(). But it is not exactly the same. Shared Library Here is how I organize my shared libraries. Jenkinsfile I put this in the Jenkinsfile :s: buildRepo() Default Configuration I provide a default configuration that any repository will get, if no other configuration is given in buildRepo(). I create a vars/getConfig.groovy with: def call(givenConfig = [:]) { def defaultConfig = [ /** * The Jenkins node, or label, that will be allocated for this build. */ "jenkinsNode": "BUILD", /** * All config specific to NPM repo type. */ "npm": [ /** * Whether or not to run Cypress tests, if there are any. */ "cypress": true ], "maven": [ /** * Whether or not to run integration tests, if there are any. */ "integTest": true ] ] // https://e.printstacktrace.blog/how-to-merge-two-maps-in-groovy/ def effectiveConfig merge(defaultConfig, givenConfig) println "Configuration is documented here: https://whereverYouHos/getConfig.groovy" println "Default config: " + defaultConfig println "Given config: " + givenConfig println "Effective config: " + effectiveConfig return effectiveConfig } Build Plan I construct a build plan as early as possible. Taking decisions on what will be done in this build. So that the rest of the code becomes more streamlined. I try to rely as much as possible on conventions. I may provide configuration that lets users turn off features, but they are otherwise turned on if they are detected. I create a vars/getBuildPlan.groovy with: def call(effectiveConfig = [:]) { def derivedBuildPlan = [ "repoType": "NOT DETECTED" "npm": [], "maven": [] ] node { deleteDir() checkout([$class: 'GitSCM', branches: [[name: '*/branchName']], extensions: [ [$class: 'SparseCheckoutPaths', sparseCheckoutPaths: [[$class:'SparseCheckoutPath', path:'package.json,pom.xml']] ] ], userRemoteConfigs: [[credentialsId: 'someID', url: 'git@link.git']] ]) if (fileExists('package.json')) { def packageJSON = readJSON file: 'package.json' derivedBuildPlan.repoType = "NPM" derivedBuildPlan.npm.cypress = effectiveConfig.npm.cypress && packageJSON.devDependencies.cypress derivedBuildPlan.npm.eslint = packageJSON.devDependencies.eslint derivedBuildPlan.npm.tslint = packageJSON.devDependencies.tslint } else if (fileExists('pom.xml')) { derivedBuildPlan.repoType = "MAVEN" derivedBuildPlan.maven.integTest = effectiveConfig.maven.integTest && fileExists('src/integtest') } else { throw RuntimeException('Unable to detect repoType') } println "Build plan: " + derivedBuildPlan deleteDir() } return derivedBuildPlan } Public API This is the public API, this is what I want the users of this library to actually invoke. I implement a buildRepo() method that will use that default configuration. It can also be called with a subset of the default configuration to tweak it. I create a vars/buildRepo.groovy with: def call(givenConfig = [:]) { def effectiveConfig = getConfig(givenConfig) def buildPlan = getBuildPlan(effectiveConfig) if (effectiveConfig.repoType == 'MAVEN') buildRepoMaven(buildPlan); } else if (effectiveConfig.repoType == 'NPM') buildRepoNpm(buildPlan); } } A user can get all the default behavior with: buildRepo() A user can also choose not to run Cypress, even if it exists in the repository: buildRepo([ "npm": [ "cypress": false ] ]) Supporting Methods This is usually much more complex, but I put some code here just to have a complete implementation. I create a vars/buildRepoNpm.groovy with: def call(buildPlan = [:]) { node(buildPlan.jenkinsNode) { stage("Install") { sh "npm install" } stage("Build") { sh "npm run build" } if (buildPlan.npm.tslint) { stage("TSlint") { sh "npm run tslint" } } if (buildPlan.npm.eslint) { stage("ESlint") { sh "npm run eslint" } } if (buildPlan.npm.cypress) { stage("Cypress") { sh "npm run e2e:cypress" } } } } I create a vars/buildRepoMaven.groovy with: def call(buildPlan = [:]) { node(buildPlan.jenkinsNode) { if (buildPlan.maven.integTest) { stage("Verify") { sh "mvn verify" } } else { stage("Package") { sh "mvn package" } } } } Duplication The Jenkinsfile :s are kept extremely small. It is only when they, for some reason, diverge from the default config that they need to be changed. Documentation There is one single point where documentation is written, the getConfig.groovy -file. It can be referred to whenever someone asks for documentation. Scalability This is a highly scalable pattern. Both with regards to performance and maintainability in code. It scales in performance because the Jenkinsfile :s can be used by any Jenkins installation. So that you can scale by adding several completely separate Jenkins installations, not only nodes. It scales in code because it adds just a tiny Jenkinsfile to repositories. It relies on conventions instead, like the existence of attributes in package.json and location of integration tests in src/integtest. Installation Agnostic The Jenkinsfile :s does not point at any implementation of this API. It just invokes it and it is up to the Jenkins installation to implement it, with a shared libraries. It can even be used by something that is not Jenkins. Perhaps you decide to do something in a Docker container, you can still parse the Jenkinsfile with Groovy or (with some magic) with any language. Feature Toggling The shared library can do feature toggling by: Letting some feature be enabled by default for every repository with name starting with x. Or, adding some default config saying"feature-x-enabled": false, while some repos change their Jenkinsfile :s to buildRepo(["feature-x-enabled": true]). Whenever the feature feels stable, it can be enabled for everyone by changing only the shared library.
Tomas Bjerre
Dec
14
Generic Webhook Trigger Plugin
Table of Contents The Problem Code Duplication And Security A Branch Is Not A Feature Documentation The Solution Code Duplication And Security A Branch Is Not A Feature Documentation This post will describe some common problems I’ve had with Jenkins and how I solved them by developing Generic Webhook Trigger Plugin. The Problem I was often struggling with the same issues when working with Jenkins: Code duplication and security - Jenkinsfiles in every repository. A branch is not a feature - Parameterized jobs on master branch often mix parameters relevant for different features. Poorly documented trigger plugins - Proper documented services but poorly documented consuming plugins. Code Duplication And Security Having Jenkinsfiles in every Git repository allows developers to let those files diverge. Developers pushes forward with their projects and it is hard to maintain patterns to share code. I have, almost, solved code duplication with shared libraries but it does not allow me to setup a strict pattern that must be followed. Any developer can still decide to not invoke the features provided by the shared library. There is also the security aspect of letting developers run any code from the Jenkinsfiles. Developers might, for example, print passwords gathered from credentials. Letting developers execute any code on the Jenkins nodes just does not seem right to me. A Branch Is Not A Feature In Bitbucket there are projects and each project has a collection of git repositories. Something like this: PROJ_1 REPO_1 REPO_2 PROJ_2 REPO_3 Lets think about some features we want to provide for these repositories: Pull request verification Building snapshot (or pre release if you will) Building releases If the developers are use to the repositories being organized like this in Bitbucket, should we not organize them the same way in Jenkins? And if they browse Jenkins should they not find one job per feature, like pull-request, snapshot and release? Each job with parameters only relevant for that feature. I think so! Like this: / - Jenkins root /PROJ_1 - A folder, lists git repositories /PROJ_1/REPO_1 - A folder, lists jobs relevant for that repo. /PROJ_1/REPO_1/release - A job, performs releases. /PROJ_1/REPO_1/snapshot - A job, performs snapshot releases. /PROJ_1/REPO_1/pull-request - A job, verifies pull requests. … In this example, both snapshot and release jobs might work with the same git branch. The difference is the feature they provide. Their parameters can be well documented as you don’t have to mix parameters relevant for releases and those relevant for snapshots. This cannot be done with Multibranch Pipeline Plugin where you specify parameters as properties per branch. Documentation Webhooks are often well documented in the services providing them. See: Bitbucket Cloud Bitbucket Server GitHub GitLab Gogs and Gitea Assembla Jira It bothered me that, even if I understood these webhooks, I was unable to use them. Because I needed to perform development in the plugin I was using in order to provide whatever value from the webhook to the build. That process could take months from PR to actual release. Such a simple thing should really not be an issue. The Solution My solution is pretty much back to basics : We have an automation server (Jenkins) and we want to trigger it on external webhooks. We want to gather information from that webhook and provide it to our build. In order to support it I have created the Generic Webhook Trigger Plugin. The latest docs are available in the repo and I also have a fully working example with GitLab implemented using configuration-as-code. See the repository here. Code Duplication And Security I establish a convention that all developers must follow. Instead of letting the developers explicitly invoke the infrastructure from Jenkinsfiles. There are rules to follow, like: All git repositories should be built from the root of the repo. If it contains a gradlew Build is done with./gradlew build Release is done with./gradlew release … and so on If it contains a package.json Build is done with npm run build Release is done with npm run release … and so on With these rules, pipelines can be totally generic and no Jenkinsfiles are needed in the repositories. Some git repositories may, for some reason, need to disable test cases. That can be solved by allowing repositories to add a special file, perhaps jenkins-settings.json, let the infrastructure discover and act on its content. This also helps the developers even when not doing CI. When they clone a new, to them unknown, repository they will know what commands can be issued and their semantics. A Branch Is Not A Feature I implement: Jenkins job configurations - With Job DSL. Jenkins build process - With Pipelines and Shared Library. By integrating with the git service from Job DSL I can automatically find the git repositories. I create jobs dynamically organized in folders. Also invoking the git service to setup webhooks triggering those jobs. The jobs are ordinary pipelines, not multibranch, and they don’t use Jenkinsfile from Git but instead Jenksinfile configured in the job using Job DSL. So that all job configurations and pipelines are under version control. This is all happening here. Documentation The plugin uses JSONPath, and also XPath, to extract values from JSON and provide them to the build. Letting the user pick whatever is needed from the webhook. It also has a regular expression filter to allow not triggering for some conditions. The plugin is not very big, just being the glue between the webhook, JSONPath / XPath and regular expression. All these parts are very well documented already and I do my best supporting the plugin. That way this is a very well documented solution to use!
Tomas Bjerre
Sept
10
Scaling Network Connections from the Jenkins Controller
Oleg Nenashev and I will be speaking at DevOps World | Jenkins World in San Francisco this year about Scaling Network Connections from the Jenkins Controller. Over the years there have been many efforts to analyze, optimize, and fortify the “Remoting channel” that allows a controller to orchestrate agent activity and receive build results. Techniques such as tuning the agent launcher can improve service, but qualitative change can only come from fundamentally reworking what gets transmitted and how. In March, jira:27035[] introduced a framework for inspecting the traffic on a Remoting channel at a high level. Previously, developers could only use generic low-level tools such as Wireshark, which cannot identify the precise piece of Jenkins code responsible for traffic. Over the past few months, the Cloud Native SIG has been making progress in addressing root causes. The Artifact Manager on S3 plugin has been released and integrated with Jenkins Evergreen, allowing upload and download of large artifacts to happen entirely between the agent and Amazon servers. Prototype plugins allow all build log content generated by an agent (such as in sh steps) to be streamed directly to external storage services such as AWS CloudWatch Logs. Work has also begun on uploading JUnit-format test results, which can sometimes get big, directly from an agent to database storage. All these efforts can reduce the load on the Jenkins controller and local network without requiring developers to touch their Pipeline scripts. Other approaches are on the horizon. While “one-shot” agents run in fresh VMs or containers greatly improve reproducibility, they suffer from the need to transmit megabytes of Java code for every build, so Jenkins features will need to be built to precache most or all of it. Work is underway to use Apache Kafka to make channels more robust against network failures. Most dramatically, the proposed Cloud Native Jenkins MVP would eliminate the bottleneck of a single Jenkins controller service handling hundreds of builds. Come meet Jesse, Oleg, and other Cloud Native SIG members at Jenkins World on September 16-19th, register with the code JWFOSS for a 30% discount off your pass.
Jesse Glick
Feb
22
Project Cheetah - Faster, Leaner Pipeline That Can Keep Up With Demand
Table of Contents Introducing "Project Cheetah" Yes, but what does it DO? How Do I Set Speed/Durability Settings? 1. Globally, you can choose a global default durability setting: 2. Each Pipeline can get a custom Durability Setting: 3. Multibranch Projects can use a new BranchProperty to customize the Durability Setting. Will Performance-Optimized Mode Help Me? Other Goodies How Did You Do It? What Next? Since it launched, Pipeline has had a bit of a Dr. Jekyll and Mr. Hyde performance problem. In certain circumstances, Pipeline can turn from a mild-mannered CI/CD assistant into a monster. It will happily eat storage read/write capacity like popcorn without caring about the other concerns of our friendly butler. When combined with other additional factors, this can result in real-world stability problems. For example, combining slow storage with a spike in running Pipelines has brought down production Jenkins at more than one organization. Similarly, users see issues if a busy controller gets hit with an extra source of stress; past culprits have been heavy automated (ab)use of Jenkins APIs, now-solved user lookup bugs, backup jobs, and plugins run crazy that load excessive numbers of builds. Symptoms ranged from visible slowdowns in the UI to unresponsive jobs and "hung" controllers. Now I’m not saying this to scare people or to criticize the technology we’ve built. Implementing Pipeline scalability best practices coupled with SSD storage keeps Jenkins in a happy place. We just need context on the weaknesses to see why it’s important to address them. Introducing "Project Cheetah" Today we’re announcing the first major results of "Project Cheetah", our long-running effort to address these challenges and improve Pipeline scalability. More broadly, Cheetah aims to help in 3 places: Small-scale containers: Pipeline needs to run leanly in resource-constrained containers, to enable easy scale-out without consuming excessive resources on shared container hosts. Enterprise systems: Pipeline needs to effectively serve high-scale Jenkins instances that are central to many large companies. General case: run Pipelines a bit more quickly on average, and allow users to get much-stronger performance in worst-case scenarios. These changes are implemented across many of the Pipeline plugins. Yes, but what does it DO? Project Cheetah offers several things, but the most important is Durability Settings for all Pipelines, and especially the Performance-Optimized setting. This setting avoids several potentially unexpected performance "surprises" that may strike users. In the general case, it greatly reduces the disk IO needs for Pipeline. How much? Below is a graph of storage utilization with legacy Pipeline versions (think early 2017) and with the latest version using the Performance-Optimized mode. These are tested on an AWS instance backed by an EBS volume provisioned with 300 IOPs. Before and After: As you can see, storage utilization goes down by a lot. While the exact number will vary, across the benchmark testcases this results in Pipeline throughput of 2x to 6x the previous before becoming IO-bound. This also increases stability of Jenkins controllers because they will tolerate unexpected load. This comes with a major drop in CPU IOWait as well: And of course the rate at which data is written to disk and number of writes/s is also reduced: For enterprise users, timing stats often show 10-20% of normal builds is serializing the Program and writing the record of steps run ("FlowNodes") - the performance optimized durability setting will cut this to almost nothing (for standard pipelines, 1/100 or less) - so builds will complete faster, especially complex ones. Please see the Pipeline Scalability documentation for deeper information on the new Durability Settings, how to use them, and which plugin versions are required to gain these features. Also, users may see a reduction in hung Pipelines because new test utilities made it possible to identify and correct a variety of bugs. How Do I Set Speed/Durability Settings? There are 3 ways to configure the durability setting: 1. Globally, you can choose a global default durability setting: Under "Manage Jenkins" > "Configure System", labelled "Pipeline Speed/Durability Settings". You can override these with the more specific settings below. 2. Each Pipeline can get a custom Durability Setting: This is one of the job properties located at the top of the job configuration, labelled "Custom Pipeline Speed/Durability Level." This overrides the global setting. Or, use a "properties" step - the setting will apply to the NEXT run after the step is executed (same result). // Script // properties([durabilityHint('PERFORMANCE_OPTIMIZED')]) // Declarative // pipeline { agent any stages { stage('Example') { steps { echo 'Hello World' } } } options { durabilityHint('PERFORMANCE_OPTIMIZED') } } 3. Multibranch Projects can use a new BranchProperty to customize the Durability Setting. Under the SCM you can configure a custom Branch Property Strategy and add a property for Custom Pipeline Speed/Durability Level. This overrides the global Durability Setting and will apply to each branch at the next run. You can also use a "properties" step to override the setting, but remember that you may have to run the step again to undo this. Durability settings will take effect with the next applicable Pipeline run, not immediately. The setting will be displayed in the log. There is a slight durability trade-off for using the Performance-Optimized mode — the appropriate section of the Pipeline Scalability documentation has the specifics. For most uses we do not expect this to be important, but there are a few specific cases where users may wish to use a slower/higher-durability setting. The Best Practices are documented. We recommend using Performance-Optimized by default, but because it does represent a slight behavioral change the initial "Cheetah" plugin releases defaults to maintain previous behavior. We expect to switch this default in the future with appropriate notice once people have a chance to get used to the new settings. Will Performance-Optimized Mode Help Me? Yes, if your Jenkins instance uses NFS, magnetic storage, runs many Pipelines at once, or shows high iowait (above 5%) Yes, if you are running Pipelines with many steps (more than several hundred). Yes, if your Pipeline stores large files or complex data to variables in the script, keeps that variable in scope for future use, and then runs steps. This sounds oddly specific but happens more than you’d expect. For example: readFile step with a large XML/JSON file, or using configuration information from parsing such a file with One of the Utility Steps. Another common pattern is a "summary" object containing data from many branches (logs, results, or statistics). Often this is visible because you’ll be adding to it often via an add/append or Map.put() operations. Large arrays of data or Maps of configuration information are another common example of this situation. No, if your Pipelines spend almost all their time waiting for a few shell/batch steps to finish. This ISN’T a magic "go fast" button for everything! No, if Pipelines are writing massive amounts of data to logs (logging is unchanged). No, if you are not using Pipelines, or your system is loaded down by other factors. No, if you don’t enable higher-performance modes for pipelines. See above for how! Other Goodies Users can now set an optional job property so that individual Pipelines fail cleanly rather than resuming upon restarting the controller. This is useful for niche cases where some Pipelines are considered disposable and users would value a clean restart over Pipeline durability. We’ve reduced classloading and reflection quite significantly, which improves scaling and reduces CPU use: Script Security (as of version 1.41) has gotten optimizations to reduce the performance overhead of Sandbox mode and eliminate lock contention so Pipeline multithreads better. Pipeline Step data uses up less space on disk (regardless of the durability setting) - this should be 30% smaller. Assume it’s a few MB per 1000 steps - but for every build after the change. Even in the low-performance/high-durability modes, some redundant writes have been removed, which decreases the number of writes by 10-20%. How Did You Do It? That’s probably material for another blog post or Jenkins World talk. The short answer is: first we built a tool to simulate a full production environment and provide detailed metrics collection at scale. Then we profiled Jenkins to identify bottlenecks and attacked them. Rinse and repeat. What Next? The next big change, which I’m calling Cheetah Part 2 is to address Pipeline’s logging. For every Step run, Pipeline writes one or more small log files. These log files are then copied into the build log content, but are retained to make it possible to easily fetch logs for each step. This copying process means every log line is written twice, greatly reducing performance, and writing to many small files is orders of magnitude slower than appending to one big log file. We’re going to remove this duplication and data fragmentation and use a more efficient mechanism to find per-step logs. This should further improve the ability to run Pipelines on NFS mounts and hard-drive-backed storage, and should significantly improve performance at scale. Besides this, there’s a variety of different tactical improvements to improve scaling behavior and reduce resource needs. The Project Cheetah work doesn’t free users to completely ignore Pipeline scaling best practices and previous suggestions. Nor does it eliminate the need for efficient GC settings. But this and other enhancements from the last year can significantly improve the storage situation for most users and reduce the penalties for worst-case behaviors. When you add all the pieces together, the result is a faster, leaner, more reliable Pipeline experience.
Sam Van Oort
Feb
1
Best Practices for Scalable Pipeline Code
This is a guest post by Sam Van Oort, Software Engineer at CloudBees and contributor to the Jenkins project. Today I’m going to show you best practices to write scalable and robust Jenkins Pipelines. This is drawn from a combination of work with the internals of Pipeline and observations with large-scale users. Pipeline code works beautifully for its intended role of automating build/test/deploy/administer tasks. As it is pressed into more complex roles and unanticipated uses, some users hit issues. In these cases, applying the best practices can make the difference between: A single controller running hundreds of concurrent builds on low end hardware (4 CPU cores and 4 GB of heap) Running a couple dozen builds and bringing a controller to its knees or crashing it…even with 16+ CPU cores and 20+ GB of heap! This has been seen in the wild. Fundamentals To understand Pipeline behavior you must understand a few points about how it executes. Except for the steps themselves, all of the Pipeline logic, the Groovy conditionals, loops, etc execute on the controller. Whether simple or complex! Even inside a node block! Steps may use executors to do work where appropriate, but each step has a small on-controller overhead too. Pipeline code is written as Groovy but the execution model is radically transformed at compile-time to Continuation Passing Style (CPS). This transformation provides valuable safety and durability guarantees for Pipelines, but it comes with trade-offs: Steps can invoke Java and execute fast and efficiently, but Groovy is much slower to run than normal. Groovy logic requires far more memory, because an object-based syntax/block tree is kept in memory. Pipelines persist the program and its state frequently to be able to survive failure of the controller. From these we arrive at a set of best practices to make pipelines more effective. Best Practices For Pipeline Code Think of Pipeline code as glue: just enough Groovy code to connect together the Pipeline steps and integrate tools, and no more. This makes code easier to maintain, more robust against bugs, and reduces load on controllers. Keep it simple: limit the amount of complex logic embedded in the Pipeline itself (similarly to a shell script) and avoid treating it as a general-purpose programming language. Pipeline restricts all variables to Serializable types, so keeping Pipeline logic simple helps avoid a NotSerializableException - see appendix at the bottom. Use @NonCPS -annotated functions for slightly more complex work. This means more involved processing, logic, and transformations. This lets you leverage additional Groovy & functional features for more powerful, concise, and performant code. This still runs on controllers so be mindful of complexity, but is much faster than native Pipeline code because it doesn’t provide durability and uses a faster execution model. Still, be mindful of the CPU cost and offload to executors for complex work (see below). @NonCPS functions can use a much broader subset of the Groovy language, such as iterators and functional features, which makes them more terse and fast to write. @NonCPS functions should not use Pipeline steps internally, however you can store the result of a Pipeline step to a variable and use it that as the input to a @NonCPS function. Gotcha: It’s not guaranteed that use of a step will generate an error (there is an open RFE to implement that), but you should not rely on that behavior. You may see improper handling of exceptions, in particular. While normal Pipeline is restricted to serializable local variables (see appendix at bottom), @NonCPS functions can use more complex, nonserializable types internally (for example regex matchers, etc). Parameters and return types should still be Serializable, however. Gotcha: improper usages are not guaranteed to raise an error with normal Pipeline (optimizations may mask the issue), but it is unsafe to rely on this behavior. Prefer external scripts/tools for complex or CPU-expensive processing rather than Groovy language features. This offloads work from the controller to external executors, allowing for easy scale-out of hardware resources. It is also generally easier to test because these components can be tested in isolation without the full on-controller execution environment. Many software vendors will provide easy command-line clients for their tools in various programming languages. These are often robust, performant, and easy to use. Plugins offer another option (see below). Shell or batch steps are often the easiest way to integrate these tools, which can be written in any language. For example: sh “java -jar client.jar $endPointUrl $inputData” for a Java client, or sh “python jiraClient.py $issueId $someParam” for a Python client. Gotcha: especially avoid Pipeline XML or JSON parsing using Groovy’s XmlSlurper and JsonSlurper! Strongly prefer command-line tools or scripts. The Groovy implementations are complex and as a result more brittle in Pipeline use. XmlSlurper and JsonSlurper can carry a high memory and CPU cost in pipelines xmllint and xmlstartlet are command-line tools offering XML extraction via xpath jq offers the same functionality for JSON These extraction tools may be coupled to curl or wget for fetching information from an HTTP API Examples of other places to use command-line tools: Templating large files Nontrivial integration with external APIs (for bigger vendors, consider a Jenkins plugin if a quality offering exists) Simulations/complex calculations Business logic Consider existing plugins for external integrations. Jenkins has a wealth of plugins, especially for source control, artifact management, deployment systems, and systems automation. These can greatly reduce the amount of Pipeline code to maintain. Well-written plugins may be faster and more robust than Pipeline equivalents. Consider both plugins and command-line clients (above) — one may be easier than the other. Plugins may be of widely varying quality. Look at the number of installations and how frequently and recently updates appear in the changelog. Poorly-maintained plugins with limited installations may actually be worse than writing a little custom Pipeline code. As a last resort, if there is a good-quality plugin that is not Pipeline-enabled, it is fairly easy to write a Pipeline wrapper to integrate it or write a custom step that will invoke it. Assume things will go wrong: don’t rely on workspaces being clean of the remnants from previous executions, clean explicitly where needed. Make use of timeouts and retry steps (that’s what they’re there for). Within a git repository, git clean -fdx is a good way to accomplish this and reduces the amount of SCM cloning DO use parameterized Pipelines and variables to make your Pipeline scripts more reusable. Passing in parameters is especially helpful for handling different environments and should be preferred to applying conditional lookup logic; however, try to limit parameterized pipelines invoking each other. Try to limit business logic embedded in Pipelines. To some extent this is inevitable, but try to focus on tasks to complete instead, because this yields more maintainable, reusable, and often more performant Pipeline code. One code smell that points to a problem is many hard-coded constants. Consider taking advantage of the options above to refactor code for better composability. For complex cases, consider using Jenkins integration options (plugins, Jenkins API calls, invoking input steps externally) to offload implementation of more complex business rules to an external system if they fit more naturally there. Please, think of these as guidelines, not strict rules – Jenkins Pipeline provides a great deal of power and flexibility, and it’s there to be used. Breaking enough of these rules at scale can cause controllers to fail by placing an unsustainable load on them. For additional guidance, I also recommend this Jenkins World talk on how to engineer Pipelines for speed and performance: Appendix: Serializable vs. Non-Serializable Types: To assist with Pipeline development, here are common serializable and non-serializable types, to assist with deciding if your logic can be CPS or should be in a @NonCPS function to avoid issues. Common Serializable Types (safe everywhere): All primitive types and their object wrappers: byte, boolean, int, double, short, char Strings enums Arrays of serializable types ArrayLists and normal Groovy Lists Sets: HashSet Maps: normal Groovy Map, HashMap, TreeMap Exceptions URLs Dates Regex Patterns (compiled patterns) Common non-Serializable Types (only safe in @NonCPS functions): Iterators: this is a common problem. You need to use C-style loop, i.e. for(int i=0; i Regex Matchers (you can use the built-in functions in String, etc, just not the Matcher itself) Important: JsonObject, JsonSlurper, etc in Groovy 2+ (used in some 2.x+ versions of Jenkins). This is due to an internal implementation change — earlier versions may serialize.
Sam Van Oort
Nov
21
Tuning Jenkins GC For Responsiveness and Stability with Large Instances
This is a cross post by Sam Van Oort, Software Engineer at CloudBees and contributor to the Jenkins project. Today I’m going to show you how easy it is to tune Jenkins Java settings to make your controllers more responsive and stable, especially with large heap sizes. The Magic Settings: Basics: -server -XX:+AlwaysPreTouch GC Logging: -Xloggc:$JENKINS_HOME/gc-%t.log -XX:NumberOfGCLogFiles=5 -XX:+UseGCLogFileRotation -XX:GCLogFileSize=20m -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCCause -XX:+PrintTenuringDistribution -XX:+PrintReferenceGC -XX:+PrintAdaptiveSizePolicy G1 GC settings: -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 Heap settings: set your minimum heap size ( -Xms) to at least 1/2 of your maximum size ( -Xmx). Now, let’s look at where those came from! We’re going to focus on garbage collection (GC) here and dig fast and deep to strike for gold; if you’re not familiar with GC fundamentals take a look at this source. Because performance tuning is data driven, I’m going to use real-world data selected three very large Jenkins instances that I help support. What we’re not going to do: Jenkins basics, or play with max heap. See the section "what should I do before tuning." This is for cases where we really do need a big heap and can’t easily split our Jenkins controllers into smaller ones. The Problem: Hangups Symptom: Users report that the Jenkins instance periodically hangs, taking several seconds to handle normally fast requests. We may even see lockups or timeouts from systems communicating with the Jenkins controller (build agents, etc). In long periods of heavy load, users may report Jenkins running slowly. Application monitoring shows that during lockups all or most of the CPU cores are fully loaded, but there’s not enough activity to justify it. Process and JStack dumps will reveal that the most active Java threads are doing garbage collection. With Instance A, they had this problem. Their Jenkins Java arguments are very close to the default, aside from sizing the heap: 24 GB max heap, 4 GB initial, default GC settings (ParallelGC) A few flags set (some coming in as defaults): -XX:-BytecodeVerificationLocal -XX:-BytecodeVerificationRemote -XX:+ReduceSignalUsage -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:-UseLargePagesIndividualAllocation After enabling garbage collection (GC) logging we see the following rough stats: . Diving deeper, we get this chart of GC pause durations: Key stats: Throughput: 99.64% (percent of time spent executing application code, not doing garbage collection) Average GC time: 348 ms (ugh!) GC cycles over 2 seconds: 36 (2.7%) Minor/Full GC average time: 263 ms / 2.803 sec Object creation & promotion rate: 42.4 MB/s & 1.99 MB/s Explanations: As you can see, young GC cycles very quickly clear away freshly-created garbage, but the deeper old-gen GC cycles run very slowly: 2-4 seconds. This is where our problems happen. The default Java garbage collection algorithm (ParallelGC) pauses everything when it has to collect garbage (often called a "stop the world pause"). During that period, Jenkins is fully halted: normally (with small heaps) these pauses are too brief to be an issue. With heaps of 4 GB or larger, the time required becomes long enough to be a problem: several seconds over short windows, and over a longer interval you occasionally see much longer pauses (tens of seconds, or minutes.) This is where the user-visible hangs and lock-ups happen. It also adds significant latency to those build/deploy tasks. In periods of heavy load, the system was even experiencing hangs of 30+ seconds for a single full GC cycle. This was long enough to trigger network timeouts (or internal Jenkins thread timeouts) and cause even larger problems. Fortunately there’s a solution: the concurrent low-pause garbage collection algorithms, Concurrent Mark Sweep (CMS) and Garbage First (G1). These attempt to do much of the garbage collection concurrently with application threads, resulting in much shorter pauses (at a slight cost in extra CPU use). We’re going to focus on G1, because it is slated to become the default in Java 9 and is the official recommendation for large heap sizes. Let’s see what happens when someone uses G1 on a similarly-sized Jenkins controller with Instance B (17 GB heap): Their settings: 16 GB max heap, 0.5 GB initial size Java flags (mostly defaults, except for G1): -XX:+UseG1GC -XX:+UseCompressedClassPointers -XX:+UseCompressedOops And the GC log analysis: Key stats: Throughput: 98.76% (not great, but still only slowing things down a bit) Average GC time: 128 ms GC cycles over 2 seconds: 11, 0.27% Minor/Full GC average time: 122 ms / 1 sec 232 ms Object creation & promotion rate: 132.53 MB/s & 522 KB/s Okay, much better : some improvement may be expected from a 30% smaller heap, but not as much as we’ve seen. Most of the GC pauses are well under 2 seconds, but we have 11 outliers - long Full GC pauses of 2-12 seconds. Those are troubling; we’ll take a deeper dive into their causes in a second. First, let’s look at the big picture and at how Jenkins behaves with G1 GC for a second instance. G1 Garbage Collection with Instance C (24 GB heap): Their settings: 24 GB max heap, 24 GB initial heap, 2 GB max metaspace Some custom flags: `-XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+UseStringDeduplication -XX:+UseCompressedClassPointers -XX:+UseCompressedOops ` Clearly they’ve done some garbage collection tuning and optimization. The AlwaysPreTouch pre-zeros allocated heap pages, rather than waiting until they’re first used. This is suggested especially for large heap sizes, because it trades slightly slower startup times for improved runtime performance. Note also that they pre-allocated the whole heap. This is a common optimization. They also enabled StringDeduplication, a G1 option introduced in Java 8 Update 20 that transparently replaces identical character arrays with pointers to the original, reducing memory use (and improving cache performance). Think of it like String.intern() but it silently happens during garbage collection. This is a concurrent operation added on to normal GC cycles, so it doesn’t pause the application. We’ll look at its impacts later. Looking at the basics: Similar picture to Instance B, but it’s hidden by the sheer number of points (this is a longer period here, 1 month). Those same occasional Full GC outliers are present! Key stats: Throughput: 99.93% Average GC time: 127 ms GC cycles over 2 seconds: 235 (1.56%) Minor/Full GC average time: 56 ms / 3.97 sec Object creation & promotion rate: 34.06 MB/s & 286 kb/s Overall fairly similar to Instance B: ~100 ms GC cycles, all the minor GC cycles are very fast. Object promotion rates sound similar. Remember those random long pauses? Let’s find out what caused them and how to get rid of them. Instance B had 11 super-long pause outliers. Let’s get some more detail, by opening GC Logs in GCViewer. This tool gives a tremendous amount of information. Too much, in fact — I prefer to use GCEasy.io except where needed. Since GC logs do not contain compromising information (unlike heap dumps or some stack traces), web apps are a great tool for analysis. What we care about are at the Full GC times in the middle (highlighted). See how much longer they are vs. the young and concurrent GC cycles up top (2 seconds or less)? Now, I lied a bit earlier - sorry! For concurrent garbage collectors, there are actually 3 modes: young GC, concurrent GC, and full GC. Concurrent GC replaces the Full GC mode in Parallel GC with a faster concurrent operation that runs in parallel with the application. But in a few cases, we are forced to fall back to a non-concurrent Full GC operation, which will use the serial (single-threaded) garbage collector. That means that even if we have 30+ CPU cores, only one is working. This is what is happening here, and on a large-heap, multicore system it is S L O W. How slow? 280 MB/s vs. 12487 MB/s for Instance B (for instance C, the difference is also about 50:1). What triggers a full GC instead of concurrent: Explicit calls to System.gc() (most common culprit, often tricky to trace down) Metadata GC Threshold: Metaspace (used for Class data mostly) has hit the defined size to force garbage collection or increase it. Documentation is terrible for this, Stack Overflow will be your friend. Concurrent mode failure: concurrent GC can’t complete fast enough to keep up with objects the application is creating (there are JVM arguments to trigger concurrent GC earlier) How do we fix this? For explicit GC: -XX:+DisableExplicitGC will turn off Full GC triggered by System.gc(). Often set in production, but the below option is safer. We can trigger a concurrent GC in place of a full one with -XX:+ExplicitGCInvokesConcurrent - this will take the explicit call as a hint to do deeper cleanup, but with less performance cost. Gotcha for people who’ve used CMS: if you have used CMS in the past, you may have used the option -XX:+ExplicitGCInvokesConcurrentAndUnloadsClasses — which does what it says. This option will silently fail in G1, meaning you still see the very long pauses from Full GC cycles as if it wasn’t set (no warning is generated). I have logged a JVM bug for this issue. For the Metadata GC threshold: Increase your initial metaspace to the final amount to avoid resizing. For example: -XX:MetaspaceSize=500M Instance C also suffered the same problem with explicit GC calls, with almost all our outliers accounted for (230 out of 235) by slow, nonconcurrent Full GC cycles (all from explicit System.gc() calls, since they tuned metaspace): Here’s what GC pause durations look like if we remove the log entries for the explicit System.gc() calls, assuming that they’ll blend in with the other concurrent GC pauses (not 100% accurate, but a good approximation): Instance B: The few long Full GC cycles at the start are from metaspace expansion — they can be removed by increasing initial Metaspace size, as noted above. The spikes? That’s when we’re about to resize the Java heap, and memory pressure is high. You can avoid this by setting the minimum/initial heap to at least half of the maximum, to limit resizing. Stats: Throughput: 98.93% Average GC time: 111 ms GC cycles over 2 seconds: 3 Minor & Full or concurrent GC average time: 122 ms / 25 ms (yes, faster than minor!) Object creation & promotion rate: 132.07 MB/s & 522 kB/s Instance C: Stats: Throughput: 99.97% Average GC time: 56 ms GC cycles over 2 seconds: 0 (!!!) Minor & Full or concurrent GC average time: 56 ms & 10 ms (yes, faster than minor!) Object creation & promotion rate: 33.31 MB/s & 286 kB/s Side point: GCViewer is claiming GC performance of 128 GB/s (not unreasonable, we clear ~10 GB of young generation in under 100 ms usually) Okay, so we’ve tamed the long worst-case pauses! But What About Those Long Minor GC Pauses We Saw? Okay, now we’re in the home stretch! We’ve tamed the old-generation GC pauses with concurrent collection, but what about those longer young-generation pauses? Lets look at stats for the different phases and causes again in GCViewer. Highlighted in yellow we see the culprit: the remark phase of G1 garbage collection. This stop-the-world phase ensures we’ve identified all live objects, and process references ( more info). Let’s look at a sample execution to get more info: 2016-09-07T15:28:33.104+0000: 26230.652: [GC remark 26230.652: [GC ref-proc, 1.7204585 secs], 1.7440552 secs] [Times: user=1.78 sys=0.03, real=1.75 secs] This turns out to be typical for the GC log: the longest pauses are spent in reference processing. This is not surprising because Jenkins internally uses references heavily for caching, especially weak references, and the default reference processing algorithm is single-threaded. Note that user (CPU) time matches real time, and it would be higher if we were using multiple cores. So, we add the GC flag -XX:+ParallelRefProcEnabled which enables us to use the multiple cores more effectively. Tuning young-generation GC further based on Instance C: Back to GCViewer we go, to see what’s time consuming with the GC for Instance C. That’s good, because most of the time is just sweeping out the trash (evacuation pause). But the 1.8 second pause looks odd. Let’s look at the raw GC log for the longest pause: 2016-09-24T16:31:27.738-0700: 106414.347: [GC pause (G1 Evacuation Pause) (young), 1.8203527 secs] [Parallel Time: 1796.4 ms, GC Workers: 8] [GC Worker Start (ms): Min: 106414348.2, Avg: 106414348.3, Max: 106414348.6, Diff: 0.4] [Ext Root Scanning (ms): Min: 0.3, Avg: 1.7, Max: 5.7, Diff: 5.4, Sum: 14.0] [Update RS (ms): Min: 0.0, Avg: 7.0, Max: 19.6, Diff: 19.6, Sum: 55.9] [Processed Buffers: Min: 0, Avg: 45.1, Max: 146, Diff: 146, Sum: 361] [Scan RS (ms): Min: 0.2, Avg: 0.4, Max: 0.7, Diff: 0.6, Sum: 3.5] [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.2] [Object Copy (ms): Min: 1767.1, Avg: 1784.4, Max: 1792.6, Diff: 25.5, Sum: 14275.2] [Termination (ms): Min: 0.3, Avg: 2.4, Max: 3.5, Diff: 3.2, Sum: 19.3] [Termination Attempts: Min: 11, Avg: 142.5, Max: 294, Diff: 283, Sum: 1140] [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.3, Sum: 0.8] [GC Worker Total (ms): Min: 1795.9, Avg: 1796.1, Max: 1796.2, Diff: 0.3, Sum: 14368.9] [GC Worker End (ms): Min: 106416144.4, Avg: 106416144.5, Max: 106416144.5, Diff: 0.1] …oh, well dang. Almost the entire time (1.792 s out of 1.820) is walking through the live objects and copying them. And wait, what about this line, showing the summary statistics: Eden: 13.0G(13.0G)->0.0B(288.0M) Survivors: 1000.0M->936.0M Heap: 20.6G(24.0G)->7965.2M(24.0G)] Good grief, we flushed out 13 GB (!!!) of freshly-allocated garbage in one swoop and compacted the leftovers! No wonder it was so slow. I wonder how we accumulated so much… Oh, right… we set up for 24 GB of heap initially, and each minor GC clears most of the young generation. Okay, so we’ve set aside tons of space for trash to collect, which means longer but less frequent GC periods. This also gets the best performance from Jenkins memory caches which are using WeakReferences (survives until collected by GC) and SoftReferences (more long-lived). Those caches boost performance a lot. We could take actions to prevent those rare longer pauses. The best ways are to limit total heap size or reduce the value of -XX:MaxGCPauseMillis=200 from its default (200). A more advanced way (if those don’t help enough) is to explicitly set the maximum size of the young generation smaller (say -XX:G1MaxNewSizePercent=45 instead of the default of 60). We could also throw more CPUs at the problem. But if we look up, most pauses are around 100 ms (200 ms is the default value for MaxGCPauseMillis). For Jenkins on this hardware, this appears to work just fine and a rare longer pause is OK as long as they don’t get too big. Also remember, if this happens often, G1 GC will try to autotune for lower pauses and more predictable performance. A Few Final Settings We mentioned StringDeduplication was on with Instance C, what is the impact? This only triggers on Strings that have survived a few generations (most of our garbage does not), has limits on the CPU time it can use, and replaces duplicate references to their immutable backing character arrays. For more info, look here. So, we should be trading a little CPU time for improved memory efficiently (similarly to string interning). At the beginning, this has a huge impact: [GC concurrent-string-deduplication, 375.3K->222.5K(152.8K), avg 63.0%, 0.0 024966 secs] [GC concurrent-string-deduplication, 4178.8K->965.5K(3213.2K), avg 65.3%, 0 .0272168 secs] [GC concurrent-string-deduplication, 36.1M->9702.6K(26.6M), avg 70.3%, 0.09 65196 secs] [GC concurrent-string-deduplication, 4895.2K->394.9K(4500.3K), avg 71.9%, 0 .0114704 secs] This peaks at an average of about ~90%: After running for a month, less of an impact - many of the strings that can be deduplicated already are: [GC concurrent-string-deduplication, 138.7K->39.3K(99.4K), avg 68.2%, 0.0007080 secs] [GC concurrent-string-deduplication, 27.3M->21.5M(5945.1K), avg 68.1%, 0.0554714 secs] [GC concurrent-string-deduplication, 304.0K->48.5K(255.5K), avg 68.1%, 0.0021169 secs] [GC concurrent-string-deduplication, 748.9K->407.3K(341.7K), avg 68.1%, 0.0026401 secs] [GC concurrent-string-deduplication, 3756.7K->663.1K(3093.6K), avg 68.1%, 0.0270676 secs] [GC concurrent-string-deduplication, 974.3K->17.0K(957.3K), avg 68.1%, 0.0121952 secs] However it’s cheap to use: in average, each dedup cycle takes 8.8 ms and removes 2.4 kB of duplicates. The median takes 1.33 ms and removes 17.66 kB from the old generation. A small change per cycle, but in aggregate it adds up quickly — in periods of heavy load, this can save hundreds of megabytes of data. But that’s still small, relative to multi-GB heaps. Conclusion: turn string deduplication on string deduplication is fairly cheap to use, and reduces the steady-state memory needed for Jenkins. That frees up more room for the young generation, and should overall reduce GC time by removing duplicate objects. I think it’s worth turning on. Soft reference flushing: Jenkins uses soft references for caching build records and in pipeline FlowNodes. The only guarantee for these is that they will be removed instead of causing an OutOfMemoryError… however Java applications can slow to a crawl from memory pressure long before that happens. There’s an option that provides a hint to the JVM based on time & free memory, controlled by -XX:SoftRefLRUPolicyMSPerMB (default 1000). The SoftReferences become eligible for garbage collection after this many milliseconds have elapsed since last touch… per MB of unused heap (vs the maximum). The referenced objects don’t count towards that target. So, with 10 GB of heap free and the default 1000 ms setting, soft references stick around for ~2.8 hours (!). If the system is continuously allocating more soft references, it may trigger heavy GC activity, rather than clearing out soft references. See the open bug JDK-6912889 for more details. If Jenkins consumes excessive old generation memory, it may help to make soft references easier to flush by reducing -XX:SoftRefLRUPolicyMSPerMB from its default (1000) to something smaller (say 10-200). The catch is that SoftReferences are often used for objects that are relatively expensive to load, such lazy-loaded build records and pipeline FlowNode data. Caveats G1 vs. CMS: G1 was available on later releases of JRE 7, but unstable and slow. If you use it you absolutely must be using JRE 8, and the later the release the better (it’s gotten a lot of patches). Googling around will show horrible G1 vs CMS benchmarks from around 2014: these are probably best ignored, since the G1 implementation was still immature then. There’s probably a niche for CMS use still, especially on midsized heaps (1-3 GB) or where settings are already tuned. With appropriate tuning it can still perform generally well for Jenkins (which mostly generates short-lived garbage), but CMS eventually suffer from heap fragmentation and need a slow, non-concurrent Full GC to clear this. It also needs considerably more tuning than G1. General GC tuning caveats : No single setting is perfect for everybody. We avoid tweaking settings that we don’t have strong evidence for here, but there are of course many additional settings to tweak. One shouldn’t change them without evidence though, because it can cause unexpected side effects. The GC logs we enabled earlier will collect this evidence. The only setting that jumps out as a likely candidate for further tuning is G1 region size (too small and there are many humungous object allocations, which hurt performance). Running on smaller systems, I’ve seen evidence that regions shouldn’t be smaller than 4 MB because there are 1-2 MB objects allocated somewhat regularly — but it’s not enough to make solid guidance without more data. What Should I Do Before Tuning Jenkins GC: If you’ve seen Stephen Connolly’s excellent Jenkins World talk, you know that most Jenkins instances can and should get by with 4 GB or less of allocated heap, even up to very large sizes. You will want to turn on GC logging (suggested above) and look at stats over a few weeks (remember GCeasy.io). If you’re not seeing periodic longer pause times, you’re probably okay. For this post we assume we’ve already done the basic performance work for Jenkins: Jenkins is running on fast, SSD-backed storage. We’ve set up build rotation for your Jobs, to delete old builds so they don’t pile up. The weather column is already disabled for folders. All builds/deploys are running on build agents not on the controller. If the controller has executors allocated, they are exclusively used for backup tasks. We’ve verified that Jenkins really does need the large heap size and can’t easily be split into separate controllers. If not, we need to do that FIRST before looking at GC tuning, because those will have larger impacts. Conclusions We’ve gone from: Average 350 ms pauses (bad user experience) including less frequent 2+ second generation pauses To an average pause of ~50 ms, with almost all under 250 ms Reduced total memory footprint from String deduplication How: Use Garbage First (G1) garbage collection, which performs generally very well for Jenkins. Usually there’s enough spare CPU time to enable concurrent running. Ensure explicit System.gc() and metaspace resizing do not trigger a Full GC because this can trigger a very long pause Turn on parallel reference processing for Jenkins to use all CPU cores fully. Use String deduplication, which generates a tidy win for Jenkins Enable GC logging, which can then be used for the next level of tuning and diagnostics, if needed. There’s still a little unpredictability, but using appropriate settings gives a much more stable, responsive CI/CD server… even up to 20 GB heap sizes! Further Reading: G1GC fundamentals MechanicalSympathy: Garbage Collection Distilled Oracle Garbage First Garbage Collector Tuning One additional thing I’ve added -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 to our options above. This is covering a complex and usually infrequent case where G1 self-tuning can trigger bad performance for Jenkins — but that’s material for another post…
Sam Van Oort
Jun
15
Jenkins Pipeline Scalability in the Enterprise
This is a guest post by Damien Coraboeuf, Jenkins project contributor and Continuous Delivery consultant. Implementing a CI/CD solution based on Jenkins has become very easy. Dealing with hundreds of jobs? Not so much. Having to scale to thousands of jobs? Now this is a real challenge. This is the story of a journey to get out of the jungle of jobs… Start of the journey At the beginning of the journey there were several projects using roughly the same technologies. Those projects had several branches, for maintenance of releases, for new features. In turn, each of those branches had to be carefully built, deployed on different platforms and versions, promoted so they could be tested for functionalities, performances and security, and then promoted again for actual delivery. Additionally, we had to offer the test teams the means to deploy any version of their choice on any supported platform in order to carry out some manual tests. This represented, for each branch, around 20 jobs. Multiply this by the number of branches and projects, and there you are: more than two years after the start of the story, we had more than 3500 jobs. 3500 jobs. Half a dozen people to manage them all… Preparing the journey How did we deal with this load? We were lucky enough to have several assets: time - we had time to design a solution before the scaling went really out of control forecast - we knew that the scaling would occur and we were not taken by surprise tooling - the Jenkins Job DSL was available, efficient and well documented We also knew that, in order to scale, we’d have to provide a solution with the following characteristics: self-service - we could not have a team of 6 people become a bottleneck for enabling CI/CD in projects security - the solution had to be secure enough in order for it to be used by remote developers we never met and didn’t know simplicity - enabling CI/CD had to be simple so that people having never heard of it could still use it extensibility - no solution is a one-size-fits-all and must be flexible enough to allow for corner cases All the mechanisms described in this article are available through the Jenkins Seed plugin. Creating pipelines using the Job DSL and embedding the scripts in the code was simple enough. But what about branching? We needed a mechanism to allow the creation of pipelines per branch, by downloading the associated DSL and to run it in a dedicated folder. But then, all those projects, all those branches, they were mostly using the same pipelines, give or take a few configurable items. Going this way would have lead to a terrible duplication of code, transforming a job maintenance nightmare into a code maintenance nightmare. Pipeline as configuration Our trick was to transform this vision of "pipeline as code" into a "pipeline as configuration": by maintaining well documented and tested "pipeline libraries" by asking projects to describe their pipeline not as code, but as property files which would: define the name and version of the DSL pipeline library to use use the rest of the property file to configure the pipeline library, using as many sensible default values as possible Piloting the pipeline from the SCM Once this was done, the only remaining trick was to automate the creation, update, start and deletion of the pipelines using SCM events. By enabling SCM hooks (in GitHub, BitBucket or even in Subversion), we could: automatically create a pipeline for a new branch regenerate a pipeline when the branch’s pipeline description was modified start the pipeline on any other commit on the branch remove the pipeline when the branch was deleted Once a project wants to go in our ecosystem, the Jenkins team "seeds" the project into Jenkins, by running a job and giving a few parameters. It will create a folder for the project and grant proper authorisations, using Active Directory group names based on the project name. The hook for the project must be registered into the SCM and you’re up and running. Configuration and code Mixing the use of strong pipeline libraries configured by properties and the direct use of the Jenkins Job DSL is still possible. The Seed plugin supports all kinds of combinations: use of pipeline libraries only - this can even be enforced use a DSL script which can in turn use some classes and methods defined in a pipeline library use of a Job DSL script only Usually, we tried to have a maximum reuse, through only pipeline libraries, for most of our projects, but in other circumstances, we were less strict and allowed some teams to develop their own pipeline script. End of the journey In the end, what did we achieve? Self service ✔︎ Pipeline automation from SCM - no intervention from the Jenkins team but for the initial bootstrapping Getting a project on board of this system can be done in a few minutes only Security ✔︎ Project level authorisations No code execution on the controller Simplicity ✔︎ Property files Extensibility ✔︎ Pipeline libraries Direct job DSL still possible Seed and Pipeline plugin Now, what about the Pipeline plugin? Both this plugin and the Seed plugin have common functionalities: What we have found in our journey is that having a "pipeline as configuration" was the easiest and most secure way to get a lot of projects on board, with developers not knowing Jenkins and even less the DSL. The outcome of the two plugins is different: one pipeline job for the Pipeline plugin a list of orchestrated jobs for the Seed plugin If time allows, it would be probably a good idea to find a way to integrate the functionalities of the Seed plugin into the pipeline framework, and to keep what makes the strength of the Seed plugin: pipeline as configuration reuseable pipeline libraries, versioned and tested Links You can find additional information about the Seed plugin and its usage at the following links: the Seed plugin itself JUC London, June 2015 BruJUG Brussels, March 2016
Damien Coraboeuf