When your team grows and github project is not sufficient enough to manage your team and all the tickets, its time to migrate to other more comprehensive platform for project management. And Jira could be one of the choices.
Most of time you have to write a script to migrate them, otherwise manually moving hundreds or even thousands of tickets could take a long time!
However, the GitHub API and Jira developer docs are not really clear. By the time of this article, GitHub just updated their API and at the top of pretty much all of their documentations say that the content may not be up-to-date, whereas Jira just updated their api to only take in Jira's internal account id instead of username (and email) when targeting any users due to the GDPR compliance.
Therefore, to migrate from GitHub to Jira became try and error and research, and I couldn't find some really good articles/tutorials to do so.
Thus I come up with this tutorial to demonstrate how to implement such script.
This tutorial illustrates an implementation for migrating GitHub issues to Jira, but doesn't guarantee the completeness and robustness of the code.
There might be missing consideration for certain edge cases, and it is your responsibility to make sure your code is production ready.
When talking with GitHub API for repo, you should always specify owner and repo, this is normally specified in the url for any project.
For example, consider the new async python backend framework FastAPI, for an arbitrary issue url https://github.com/tiangolo/fastapi/issues/2562
, the owner is tiangolo and repo is fastapi. This also applies for private organizations and repos.
More discussions can be found here
Although you can access open source repos, you should create a Personal Access Token (PAT) to access any repo (its required for private repos anyway).
For details about how to generate PAT, you can check this link
# In this article I'll use the FastAPI repo as an example
# Since its an open source project, we don't need to use the username and PAT here
# so I'll comment out all the authentication part in the code but still show them
USERNAME = "your github username"
PAT = "Your generated PAT"
GITHUB_OWNER = "tiangolo"
GITHUB_REPO = "fastapi"
First of all checkout the GitHub doc for retrieving issues
The code is quite simple, lets first consider get an issue by the issue number.
# Import some libs for use in the script
import requests
from requests.auth import HTTPBasicAuth
import os
import json
import re
from pprint import pprint
def get_single_issue(issue_number):
"""Get specific issue data"""
url = f"https://api.github.com/repos/{GITHUB_OWNER}/{GITHUB_REPO}/issues/{issue_number}"
return requests.get(
url,
# auth=(USERNAME, PAT) # this is how you use the username and PAT
).json()
issue = get_single_issue(2562)
pprint(issue)
As you can see above we have retrieved all the details about this issue.
But what if we want to get all the issues?
From the GitHub doc we can see that there is a pagination, and the default pagination is 30 items and the max pagination is 100 items per page.
One thing to NOTE is that GitHub consider PRs to be issues too, so if you would like to just get all the issues excluding all the PRs you have to specifiy it.
Here is how we do it:
def get_all_issues(pagination=100):
assert 0 < pagination <= 100 # pagination size needs to be set properly
# Traversing with Pagination to get all issues
url = f"https://api.github.com/repos/{GITHUB_OWNER}/{GITHUB_REPO}/issues"
data = {"per_page": pagination, "page": 1} # max 100 results per page starting from first page
response = requests.get(
url,
# auth=(USERNAME, PAT),
params=data,
)
# Get all the issues excluding the PRs
# NOTE that if the "pull_request" is not set for an issue then it is not a PR
issues = [issue for issue in response.json() if not issue.get("pull_request")]
while 'next' in response.links.keys():
response = requests.get(
response.links['next']['url'],
# auth=(USERNAME, PAT),
)
issues.extend([issue for issue in response.json() if not issue.get("pull_request")])
return issues
print(len(get_all_issues()))
At the time of writing there are 395 issues, and our logic above is correct.
On some occations when we are doing migration, we don't really want to migrate all of the issues, and we would like to target specific issues, hence we need a way to get all issues from a particular labels.
We could get all of the labels first.
def get_all_labels():
# Get all issue labels
url = f"https://api.github.com/repos/{GITHUB_OWNER}/{GITHUB_REPO}/labels"
return requests.get(
url,
# headers={'Authorization': PAT}
).json()
labels = get_all_labels()
# Lets see the details for 1 label
pprint(labels[0])
# Lets see all the label names
print([label['name'] for label in labels])
If we know what lables we need to export, that could make the entire process much easier.
This is similar to get all tickets, which means it paginated. However, there is a gotcha:
ATTENTION: you can't do OR for labels. Suppose you have label "name" and "age", then the returned result is AND not OR. Its been requested to GitHub FOUR years ago but still haven't been addressed. Details can be found in this link
def get_all_issues_by_label(label, pagination=100):
assert 0 < pagination <= 100 # pagination size needs to be set properly
assert label # Labels cannot be None
# Get issues by labels
url = f"https://api.github.com/repos/{GITHUB_OWNER}/{GITHUB_REPO}/issues"
data = {"labels": label, "per_page": pagination, "page": 1}
response = requests.get(
url,
# auth=(USERNAME, PAT),
params=data
)
# PRs are considered as issues too so we should filter them out
return [issue for issue in response.json() if not issue.get("pull_request")]
issues = get_all_issues_by_label('answered')
issue_numbers = [issue['number'] for issue in issues]
print(len(issue_numbers))
# As mentioned above the filter by label in GitHub is AND, so if
# we want to do OR we have to loop them
issues = []
for label in ('answered', 'bug'):
issues.extend(get_all_issues_by_label(label))
print(len(issues))
Now we know how to get issues, the next step is to figure out how to get all the comments.
It will return a detailed list of each comment
# Since the comment url is in the issue's data, we can just use it to fetch the comments for the issue
comment_url = issue['comments_url']
comment_data = requests.get(
comment_url,
# auth=(USERNAME, PAT)
).json()
pprint(comment_data)
Now that we have enough information from GitHub, lets take a look at Jira.
After some research, I found that Jira accepts both CSV and JSON.
However, there are two entrance points:
Here we choose to use the CSV format for importing (JSON works the same), and here the documentation for importing data from CSV
Most of the field mapping is pretty straightforad, but the following can be super tricky:
There are basically THREE project fields needed for auto-mapping with projects:
project
project_type
project_key
These fields are well defined for any Jira project, you can find them in your Jira project settings
# We could initialize Jira mapping as follows
FIELDS = [
# Project
{'key': 'project', 'label': 'Project'},
{'key': 'project_type', 'label': 'Project Type'},
{'key': 'project_key', 'label': 'Project Key'},
# # Once issue_key is specified, its targeted for that specific issue
# # any further operations will be an update, so I just use it for testing
# {'key': 'issue_key', 'label': 'Issue Key'}, # Optional
# Issue
{'key': 'title', 'label': 'Summary'},
{'key': 'body', 'label': 'Description'},
{'key': 'assignee', 'label': 'Assignee'},
{'key': 'user', 'label': 'Reporter'}, # Issue creater
{'key': 'created_at', 'label': 'Date Created'},
{'key': 'updated_at', 'label': 'Date Modified'},
{'key': 'labels', 'label': 'Labels'}, # each label separated by space
]
So a lot of online resources have the solution to use username
, userKey
, or userEmail
for Jira users, but they probably won't work for now (or not stable).
Therefore we have to Convert usernames to user account IDs
To get a list of Jira account ids, we have to first generate an API token to talk to Jira. You can read this documentation to create a token.
Once you've created the token, you can follow this documention to get user data
Most of time we have user emails that we can map, but in Jira you can't see user emails unless you follow instructions from this article
I found that requesting the email access in Jira could require work from other department, so I worked around a little.
# Suppose we have a map of github login name and user email like follows
GITHUB_LOGIN_ID_EMAIL_MAP = {
"your-github-login-name": "your-github-login-EMAIL",
}
JIRA_API_TOKEN = "your generated jira api token"
JIRA_LOGIN_EMAIL = "your jira login email"
def get_jira_account_ids_with_display_name():
url = "https://your-company-jira-url/rest/api/3/users/search"
auth = HTTPBasicAuth(JIRA_LOGIN_EMAIL, JIRA_API_TOKEN)
headers = {
"Accept": "application/json"
}
response = requests.request(
"GET",
url,
headers=headers,
auth=auth
)
users = json.loads(response.text)
email_username_github_login_id_map = {v.split("@")[0]:k for k,v in GITHUB_LOGIN_ID_EMAIL_MAP.items()}
not_mapped_uses = {}
print("Mapped users:")
for user in users:
try:
print(f"{email_username_github_login_id_map[user['displayName'].split(' ')[0].lower()]} -> {user['accountId']}")
except KeyError:
not_mapped_uses.update({user['displayName']: user['accountId']})
if not_mapped_uses:
print("\nNot mapped users:")
for k,v in not_mapped_uses.items():
print(f"{k}: {v}")
The idea above is that most of time the company emial follows a pattern of using employee's name.
In the example here its the employee's first name.
And the Jira display name is often the first name as well.
So given a list of emails, we can extract the user name automatically and get the jira account id.
But this won't work if the email or Jira display name doesn't follow the pattern, for that case you could adjust the logic, or manually map them.
# Using above function, we can maually create a map of GitHub login ID and
# Jira account id map similar to follows
GITHUB_LOGIN_ID_JIRA_ACCOUNT_ID_MAP = {
"your-github-login-name": "your mapped jira account id",
}
# THEN we can write a function to get the jira account id given github login id
def get_jira_account_id_from_github_login_id(github_user_data):
if not github_user_data:
return ""
github_account = github_user_data['login']
if GITHUB_LOGIN_ID_JIRA_ACCOUNT_ID_MAP.get(github_account):
return GITHUB_LOGIN_ID_JIRA_ACCOUNT_ID_MAP[github_account]
else:
print(f"Github account <{github_account}> not in the map")
Jira suports all standard datetime format, but you'll have to make sure the format you use is consistent everywhere (all datetime variables, jira comments, etc)
Not like other platform where they use Markdown format, Jira somehow uses a format called Wiki Format (i.e. they use !
to wrap around the image url), which means all the inline images won't be displayed correctly, so we have to transform all the markdown format to the wiki format in order for them to display correctly in Jira.
Also, sometimes you can use html format inline images as well, so we also need to parse the html ones.
def transfer_markdown_image_to_wiki_image(string):
"""
Convert mardown foramt image link to wiki format image link
https://community.atlassian.com/t5/Jira-questions/How-to-config-JIRA-display-inline-image-attach-image-in-BEST-FIT/qaq-p/701224
"""
if '![' in string:
def replace_markdown_link(match):
text = match.group()
img_link = re.findall('(http[s]?://[^)]+)', text)[0]
return f"{img_link}|width=100%!"
markdown_img_pattern = re.compile(r'\[([^\]]+)\]\(([^)]+)\)', re.S)
string = markdown_img_pattern.sub(replace_markdown_link, string)
return string
string = (
"This is our inline image from github "
'![some image](https://user-images.githubusercontent.com/some_id/image_name.png)'
)
print("Original string:", string)
print("\nTransformed string:", transfer_markdown_image_to_wiki_image(string))
In addition, sometimes you can use html format inline images as well, so we also need to parse the html ones.
def transfer_html_image_to_wiki_image(string):
"""
Convert html img tag to wiki format image link
Since the default github link follows below pattern:
<img width="1439" alt="Screen Shot 2020-12-07 at 11 13 50 AM" src="https://user-images.githubusercontent.com/some_id/image_name.png">
we can simply use regex to match it
"""
if '<img' in string:
def replace_img_tag(match):
text = match.group()
img_link = re.findall('(https://user-images.githubusercontent.com.*?.png)', text)[0]
return f"!{img_link}|width=100%!"
img_tag_pattern = re.compile("(<img.*?>)", re.S)
string = img_tag_pattern.sub(replace_img_tag, string)
return string
string = (
"This is our inline image from github "
'<img width="1439" alt="Screen Shot 2020-12-07 at 11 13 50 AM" src="https://user-images.githubusercontent.com/some_id/image_name.png">'
)
print("Original string:", string)
print("\nTransformed string:", transfer_html_image_to_wiki_image(string))
Combine above two functions together, we get the final transform image link function as follows
def transform_image_link(string):
string = transfer_markdown_image_to_wiki_image(string)
string = transfer_html_image_to_wiki_image(string)
return string
The labels are pretty straightforward, but here are a few things to remember:
_
This part gets really tricky, I spent lots time researching and finally made it work.
"date time; commenter; coment content"
Some more details can be found here:
def get_comments(issue):
if int(issue.get('comments', 0)) > 0:
# If there are more than 2 comments, each comment in a separate column
# Get comment details
comment_url = issue['comments_url']
comment_data = requests.get(
comment_url,
# auth=(USERNAME, PAT)
).json()
# Comment format from jira:
# To preserve the comment author/date use format.
# E.g. "05/05/2010 11:20:30;adam; This is a comment."
comments = {
f"comment_{i}": (
f"{transform_to_datetime(comment['updated_at'])}; " # comment_date, make sure format is consistent
f"{get_jira_account_id_from_github_login_id(comment['user'])}; " # commentor jira account id
f"{transform_image_link(comment['body'])}" # transform inline image style
) for i, comment in enumerate(comment_data)
}
# Dynamically update the csv comment fields accordingly
comment_keys = (
k
for field in FIELDS # Generated csv fields
for k in field.values()
if isinstance(k, str) and k.startswith("comment_")
)
for new_comment_key in comments.keys():
if new_comment_key not in comment_keys:
FIELDS.extend(
[{'key': new_comment_key, 'label': 'Comment Body'}]
)
else:
comments = {"comment_0": ""} # default comment field name
return comments
This is probably the most tricky and difficult one to get, since there is no easy way to retrieve thiis info from GitHub, but its quite easy for Jira.
For GitHub tickets, the status from api is simply an indication of the ticket status (i.e. open/close etc). And what we see on the right side card of the ticket is which column the ticket belongs to in the project, and there is no easy way to get it.
What I discovered is a way to transverse the ticket's timeline history, and get which column it was moved to (the ticket has a timeline to show all the operations of the ticket, and if there is no operations it still shows the default action when it was created).
def get_timeline(issue_number):
timeline_url = f"https://api.github.com/repos/{GITHUB_OWNER}/{GITHUB_REPO}/issues/{issue_number}/timeline"
timeline_headers = {
"Accept": "application/vnd.github.mockingbird-preview" # Need to add this header for this api
}
timeline = requests.get(
timeline_url,
# auth=(USERNAME, PAT),
headers=timeline_headers
).json()
return timeline
Let's again use the ticket 2562 at the beginning of the article as an exmaple
get_timeline(2562)
The next step is to get all the event, and see if there is a move or create action from a project.
All the events urls are given in the payload, and we are looking for keywords moved_columns_in_project
and added_to_project
.
We look through the returned result in the reverse order, from the latest to the earliest.
If we still didn't find anything that means there is no operation to the ticket, it was still in the default column, so we can just use the first timeline.
def get_event_details(event):
event_url = event['url']
event_headers = {
"Accept": "application/vnd.github.starfox-preview+json" # Need to add this header for this api
}
event_detail = requests.get(
event_url,
# auth=(USERNAME, PAT),
headers=event_headers
).json()
return event_detail
def transform_issue_status(issue_number):
issue_status = ""
# First get timeline
timeline = get_timeline(issue_number)
# Then we transverse the list reversely to get the latest update
for index in range(len(timeline)-1, -1, -1):
event = timeline[index]
event_type = event['event']
if event_type == "moved_columns_in_project" or event_type == "added_to_project":
issue_status = get_event_details(event)['project_card']['column_name']
break # Quit once we found the latest status
if not issue_status:
# Status not found, default to the column when the ticket was created
if timeline:
issue_status = get_event_details(timeline[0])['project_card']['column_name']
return issue_status
Well we can't test here, but this should work most of the time.
However, this is not a super robust solution due to the time constraint. Maybe there is a better way to do this.
Similarly, when exporting all the issues for a particular project, we can use the same approach.
def get_issues_by_project(project_id):
project_issues = []
for issue in get_all_issues():
timeline = get_timeline(issue['number'])
for index in range(len(timeline)-1, -1, -1):
event = timeline[index]
event_type = event['event']
if event_type == "moved_columns_in_project" or event_type == "added_to_project":
event_detail = get_event_details(event)
issue_project_id = event_detail['project_card']['project_id']
if issue_project_id == int(project_id):
project_issues.append(issue)
print(issue['number'], "-->", event_detail['project_card']['column_name'])
break
else:
# Status not found, default to the column when the ticket was created
if timeline and timeline[0].get('url'):
event_detail = get_event_details(timeline[0])
issue_project_id = event_detail.get('project_card', {}).get('project_id')
if issue_project_id == int(project_id):
project_issues.append(issue)
print(issue['number'], "-->", event_detail['project_card']['column_name'])
print([i['number'] for i in project_issues])
print(len(project_issues))
However, above approach may not be very accurate when one issue is assigned in more than one project.
Here I'm only just showing you a way of making this happen.
This is the end of the tutorial, thanks for reading.
Hopefully this article can hep you save some time :)