About myself
Hello, my name is Payal Priyadarshini. I am pursing my major in Computer
Science & Engineering at the Indian Institute of Technology Kharagpur, India. I
am very proficient in writing code in Python, C++, Java and currently getting
familiar and hopefully good in Groovy too.
I have internship experiences in renowned institutions like Google and VMware
where I worked with some exciting technologies for example Knowledge Graphs,
BigTable, SPARQL, RDF in Google. I am a passionate computer science student who
is always interested in learning and looking for new challenges and
technologies.That’s how I came across to Google Summer of Code where I am
working on some exciting data mining problems which you are going to encounter
below in this blog.
Project Overview
Jenkins has collected anonymous usage information of more than 100,000
installations which includes set of plugins and their versions etc and also
release history information of the upgrades. This data collection can be used
for various data mining experiments. The main goal of this project is to perform
various analysis and studies over the available dataset to discover trends
in data usage. This project will help us to learn more about the Jenkins
usage by solving various problems, such as:
Plugin versions installation trends, will let us know about the versions installation behaviour of a given plugin.
Spotting downgrades, which will warn us that something is wrong with the version from which downgrading was performed.
Correlating what users are saying (community rating) with what users are doing (upgrades/downgrades).
Distribution of cluster size, where clusters represents jobs, nodes count which approximates the size of installation.
Finding set of plugins which are likely to be used together, will setup pillar for plugin recommendation system.
As a part of the Google Summer of Code 2016, I will be working on the above
mentioned problems. My mentors for the project are Kohsuke Kawaguchi and Daniel Beck. Some analyses has already been done over this
data but those are outdated as charts can be more clearer and interactive. This project aims to improvise existing
statistics and generating new ones discussed above.
Use Cases
This project covers wide-range of the use-cases that has been derived from the
problems mentioned above.
Use Case 1: Upgrade/Downgrade Analysis
Understanding the trend in upgrades and downgrades have lots of utilities, some
of them have already been explained earlier which includes measuring the
popularity, spotting downgrades, giving warning about the wrong versions quickly
etc.
Use Case 1.1: Plugin versions installation trends
Here we are analysing the trend in the different version installations for a
given plugin. This use-case will help us to know about:
Trend in the upgrade to the latest version released for a given plugin.
Trend in the popularity decrement of the previous versions after new version release.
Find the most popular plugin version at any given point of time.
Use Case 1.2: Spotting dowgrades
Here we are interested to know, how many installations are downgraded from any
given version to previously used version. Far fetched goal of this analysis is
to give warning when something goes wrong with the new version release, which
can be sensed using downgrades performed by users. This analysis can be
accomplished by studying the monotonic property of the version number vs.
timestamp graph for a given plugin.
Use Case 1.3: Correlation with the perceived quality of Jenkins release
To correlate what users are saying to what users are doing, we have community
ratings which tells us about the ratings and reviews of the releases and has
following parameters:
Used the release on production site w/o major issues.
Don’t recommend to other.
Tried but rolled it back to the previous version.
First parameters can be calculated from the Jenkins usage data and third
parameter is basically spotting downgrades(use case 1.2). But the second
parameter is basically an expression which is not possible to calculate. This
analysis is just to get a subjective idea about the correlation.
Use Case 2: Plugin Recommendation System
This section involves setting up ground work for the plugin recommendation
system. The idea is to find out the set of plugins which are most likely to be
used together. Here we will be following both content based filtering as well as
collaborative filtering approach.
Collaborative Filtering
This approach is based upon analysing large amount of information on
installation’s behaviours and activities. We have implicit form of the data
about the plugins, that is for every install ids, we know the set of plugins
installed. We can use this information to construct plugin usage graph where
nodes are the plugins and the edges between them is the number of installations
in which both plugins are installed together.
Content-based Filtering
This method is based on a properties or the content of the item for example
recommending items that are similar to the those that a user liked in the past
or examining in the present based upon some properties. Here, we are utilizing
Jenkins
plugin dependency graph to learn about the properties of a plugin. This graph
tells us about dependent plugins on a given plugin as well as its dependencies
on others. Here is an example to show, how this graph is use for content based
filetring, suppose if a user is using “CloudBees Cloud Connector”, then we can
recommend them for “CloudBees Registration Plugin” as both plugins are dependent
on “CloudBees Credentials Plugin”.
Additional Details
You may find the complete project proposal along with the detailed design of the
use-cases with their implementation details here in the
design
document.
A complete version of the use-case 1: Upgrade & Downgrade Analysis should be
available in late June and basic version of plugin recommendation system will be
available in late July.
I do appreciate any kind of feedback and suggestions. You may add comments in
the
design
doc. I will be posting updates about the statistics generation status on the
jenkins-dev mailing
list and jenkins-infra mailing list.
Links:
Design Doc
Google Summer of Code
Github infra-stats
Jenkins statistics
Jenkins Plugin Dependency Graph
Github GSoC Jenkins Usage Statistics Analysis