Building an ETL pipeline for Jira

April 25, 2019

This is the first part of a two-part blog post that is concerned with extracting data from Jira and using it for further applications such as visualization, evaluation, and using the power of machine learning to gain valuable insights in the data.

Introduction

Jira is an issue tracking product with capabilities for agile project management, bug tracking, time logging and much more. mimacom has been using Jira for more than ten years now to manage its projects, so there is a sizeable amount of valuable data available in mimacom's Jira database. However, since its introduction in mimacom ten years ago, there were several big updates to the Jira product, and the way it is used to manage projects changed from update to update (e.g. introduction of new functionality like epics, sprints, components, etc.). For these reasons, the Jira data does not provide a very consistent record reflecting mimacom's history in managing projects. Data fields are often unstructured and almost exclusively in textual form. Even the language differs for some projects, although it can be narrowed down to mostly German and English. Additionally, Jira allows the creation of custom fields, which provide more flexibility at the cost of a very generic yet complex relational data model that has continuously evolved since the initial release of the Jira product in 2002.

Jira ERD Jiras entity relationship diagram

In this post I will provide an overview of how we can build an ETL (Extract, Transform, Load) pipeline for Jira so that we can get better and easier access to the data.

Defining a target structure

To be able to make better use of the data, we use a domain-driven design approach to define a clear domain model that serves as a basis for all further applications. The next steps will then be to extract all the data and transform it in this structure.

Domain model

Extracting the data

There are various ways to extract data from Jira. It is discouraged to directly access the database since the model is subject to change and there is never a guarantee that queries will work the same way in a future version of the product. Plugins such as the Jira CLI provide functionality to extract almost any kind of Jira data in CSV format. They are not free of charge though and come with several usage restrictions and are therefore not suitable for our application either.

For these reasons, we will use various Jira REST APIs that abstract the access to the underlying data.

The technology of choice to extract the data will be a Kotlin-powered Spring Boot app, because Kotlin is a really cool JVM language that provides numerous convenience features - you should try it!

Step 1: Create JSON mapper classes

Jira REST endpoints return data in the JSON format. To avoid painful parsing and mapping of the JSON data, we use the Json to Kotlin class IDEA Plugin and generate a Kotlin class which the JSON response will be automatically mapped to. We can then already do a first manual cleaning by removing the properties we are sure we do not need, such as roles, archive information and the 16x16 and 32x32 avatar URLs. After that, the generated class for Jira projects looks like this:

data class JiraProject(
        val id: Int,
        val key: String,
        val name: String,
        val description: String,
        val lead: Lead?,
        val projectCategory: ProjectCategory?,
        val projectTypeKey: String,
        val self: String,
        val avatarUrls: AvatarUrls?
) {
    data class Lead(
            val active: Boolean,
            val displayName: String,
            val key: String,
            val name: String,
            val self: String
    )

    data class ProjectCategory(
            val description: String,
            val id: String,
            val name: String,
            val self: String
    )

    data class AvatarUrls(
            @JsonProperty("48x48") val avatarUrl: String
    )
}

Note that the @JsonProperty annotation can be left out if the properties name is equal to the JSON name.

The same procedure is applied for Issues, Comments, Worklogs and Sprints. For pageable content like issues, we additionally create a wrapper class that contains info about the request itself. We don't want this information in the JiraIssue class since it doesn't logically belong there.

data class JiraIssueWrapper(
        val expand: String?,
        val issues: List<JiraIssue>,
        val maxResults: Int,
        val startAt: Int,
        val total: Int
)

Step 2: Providing persistence in a stage database

To store the extracted data, we use a Mongo database in addition to some Spring Boot magic. All we need is to define an interface extending the Spring Data MongoRepository and Spring Boot will automatically generate an implementation for us:

interface JiraProjectRepository : MongoRepository<JiraProject, String>

Easy as that! Once again, we do the same thing for all other entities of interest.

Of course we need a running MongoDB for this to work. I use Docker for this:

docker run -d mongo:latest

Step 3: Writing an extraction service

We create a service and inject the previously created JiraProjectRepository as a dependency. We load properties defined in a resource file, add some methods to fetch different types of data from the APIs and store it using the MongoRepositories created in the previous step:

@Service
class JiraService(
        restTemplateBuilder: RestTemplateBuilder,
        private val jiraProjectRepository: JiraProjectRepository,
        private val jiraIssueRepository: JiraIssueRepository,
        private val jiraCommentRepository: JiraCommentRepository,
        private val jiraWorklogRepository: JiraWorklogRepository,
        private val jiraSprintRepository: JiraSprintRepository
) {

    @Value("\${jira.apiurl}")
    private lateinit var apiurl: String

    @Value("\${jira.username}")
    private lateinit var username: String

    @Value("\${jira.password}")
    private lateinit var password: String

    private val restTemplate: RestTemplate by lazy {
        restTemplateBuilder.basicAuthentication(username, password).build()
    }
    
    private val logger = Logger.getLogger(this.javaClass.name)
    
    fun extractAndSaveProjects() {
        val projects = extractProjects()
        jiraProjectRepository.saveAll(projects)
        logger.info("saved ${projects.size} projects")
    }

    fun extractAndSaveIssuesForProjects() {
        val projects = jiraProjectRepository.findAll()
        projects.forEach {
            val issues = extractIssuesForProject(it.id)
            jiraIssueRepository.saveAll(issues)
            logger.info("project ${it.name}: saved ${issues.size} issues")
        }
    }
    
    private fun extractProjects(): Array<JiraProject> {
        val projectsUrl = "$apiurl/project?expand=lead,description"
        val projects = restTemplate.getForObject<Array<JiraProject>>(projectsUrl)
        return projects ?: arrayOf()
    }

    private fun extractIssuesForProject(projectId: Int): List<JiraIssue> {
        val issuesUrl = "$apiurl/search?jql=project=\"$projectId\""
        val issues = mutableListOf<JiraIssue>()
        var startAt = 0
        val maxResults = 1000 // we cannot get more than 1000 per request, that's why we need to iterate
        do {
            val issueWrapper = restTemplate.getForObject<JiraIssueWrapper>("$issuesUrl&startAt=$startAt&maxResults=$maxResults")
            issueWrapper?.issues?.let { issues.addAll(it) }
            startAt += maxResults
        } while (startAt < issueWrapper?.total ?: 0)
        return issues
    }
    
    // methods for Comments, Sprints, Worklogs etc. left out for brevity.
    
}

And this is all we need for the extraction process. Since the REST API queries and throughput are limited, fetching ten years worth of data is a rather time-consuming operation and takes an hour or two to complete. However, we now have a data source that provides the same data as the Jira REST API but is much easier to query and faster by several orders of magnitude!

Transforming the data into the target structure

After implementing the previously defined domain model in Kotlin, we are now ready to transform the data into the domain model. We achieve this by defining transformer objects in Kotlin that define the necessary steps to map a stored Jira object to an entity in the domain model. These transformers can easily be nested, so that there is a clear separation of concerns and every transformer transforms exactly one thing. The next code listing shows a very simple transformer example.

object JiraIssueToResolutionTransformer : JiraTransformer<JiraIssue, Resolution> {

    override fun transform(source: JiraIssue): Resolution {
        val resolution = source.fields.resolution?.name?.toLowerCase()
        return when (resolution) {
            "fixed", "done" -> FIXED
            "won't fix", "won't do", "invalid", "cannot reproduce", "incomplete" -> WONT_FIX
            "duplicate" -> DUPLICATE
            else -> NONE
        }
    }

}

When all necessary transformers are implemented, we bundle them in a service that executes them in the proper order as illustrated in the following sequence diagram:

Transformation sequence diagram

As can be derived from the the diagram of the example issue transformation, the transformed issues will be passed to the IssueApplicationService, which resides in the core part of the application and serves as an interface to the domain model that we defined. For now, calling addIssues() simply stores the issues in an abstract repository.

class IssueApplicationService(private val issueRepository: IssueRepository) {

    private val logger = Logger.getLogger(this.javaClass.name)

    fun addIssues(issues: List<Issue>) {
        issueRepository.saveAll(issues)
        logger.info("Saved ${issues.size} issues")
    }
    
}

Loading the transformed data to Elasticsearch

Now that the hard and time-consuming part is done, we can start experimenting with the fun stuff! There are three applications that are particularly interesting:

Visualizing the data and providing a easy-to-use query platform (using Kibana and Elasticsearch)
Calculating project related KPIs (Key Performance Indicators) for deeper insights in team and company performance
Using these ten years worth of data for specific machine learning problems and eventually forecast KPIs

Point 2 and 3 deserve their own blog post and will be covered in the next one. We can very quickly realize the first one though, now that our transformed data is ready.

Analogous to the MongoRepository, Spring Data Elastic provides an ElasticRepository interface that will automatically be implemented when we define an appropriate interface:

interface ElasticIssueRepository : ElasticsearchRepository<ElasticIssue, String>

Note the additional ElasticXXX classes to map the domain entities to corresponding Elasticsearch indexes. This is done to avoid pollution of the domain model with Elasticsearch related annotations. Also it allows for an easy conversion to types that are supported by Elasticsearch (for example to make sure dates are properly displayed). Using Kotlin data classes, this is very straightforward to do:

@Document(indexName = "issue", type = "issue", refreshInterval = "30s")
class ElasticIssue(
        val id: String,
        val title: String,
        val isReporterByCustomer: Boolean,
        val sprint: ElasticSprint?,
        val worklogs: List<ElasticWorklog>,
        val technologies: List<String>,
        // ... more properties
) {

    constructor(issue: Issue, sprint: Sprint?) : this(
            id = issue.id.value,
            title = issue.title,           
            isReporterByCustomer = issue.isReportedByCustomer(),
            sprint = if (sprint == null) null else ElasticSprint(sprint),
            worklogs = issue.worklogs.map { ElasticWorklog(it, issue) },
            technologies = issue.technologies,
            // ... more properties
    )

    class ElasticSprint(val name: String, val number: Int?, val type: String?, val endDate: String?) {

        constructor(sprint: Sprint) : this(
                sprint.name,
                sprint.number,
                sprint.type?.name,
                sprint.endDate?.format(DateTimeFormatter.ISO_DATE)
        )

    }

}

Almost done. We once again create a service to tape it all together:

@Service
class ElasticService(
        private val elasticIssueRepository: ElasticIssueRepository,
        private val issueApplicationService: IssueApplicationService) {

    private val logger = Logger.getLogger(this.javaClass.name)

    fun saveIssuesToElasticsearch() {
        logger.info("Saving issues and worklogs to Elasticsearch...")
        val issues = issueApplicationService.getAllIssues()

        val elasticIssues = issues.map { ElasticIssue(it.key, it.value) }
        elasticIssueRepository.saveAll(elasticIssues)
        logger.info("Saved ${elasticIssues.size} elasticIssues to Elasticsearch")
    }

}

Et voilà! One call to saveIssuesToElasticsearch() will store everything to Elasticsearch. Obviously you need an Elasticsearch instance running on your host. I defined a docker-compose.yml file to run both Elasticsearch and Kibana with the command docker-compose up -d:

version: '3.7'

services:
  elasticsearch:
    image: elasticsearch:6.5.0
    container_name: jira-etl-elasticsearch
    ulimits:
      nofile: 65536
    ports:
      - 9200:9200
      - 9300:9300
    environment:
      - cluster.name=elasticsearch
  kibana:
    image: kibana:6.5.0
    container_name: jira-etl-kibana
    ports:
      - 5601:5601
    depends_on:
      - elasticsearch

We can now go to localhost:9200 and visually and interactively explore our data! Let's see how the technologies of the issues are distributed for example:

Kibana visualization example

Conclusion

In this blog post we saw how an ETL pipeline for Jira can be built and used to transform data from a rather complex and heterogenous format to a clear structure that can now be used for many other applications.

In the next post we will investigate how project related KPIs can be calculated based on this dataset and how different machine learning algorithms can be applied to make forecasts for certain KPIs to give project managers a tool for more effective planning and controlling.