Failing to optimize diff generation with local operations

Oct 29 2020

While looking at performance data for Stickler CI, I noticed that some reviews spent a surprising amount of time talking to the GitHub API. While Stickler CI spending a large amount of time talking to GitHub isn’t that shocking, what caught my interest was the amount of time spent fetching modified files. The pulls/:id/files endpoint is a paginated endpoint and review jobs would sometimes do 50+ API calls as all the files modified in a pull request were retrieved.

All these network requests seemed fairly redundant given that Stickler also has a clone of the repository available for running linters and automatic fixes on. Stickler CI uses the textual diffs of a pull request extensively. Diffs are used to only apply linters and fixes to files and chunks already changed in the pull request. This helps ensure that pull requests stay small and feedback is only added for what was changed.

My hypothesis was that I could easily optimize away all of the network requests and generate diffs locally using git diff <base>...<head>. This would save a ton of network time and help deliver reviews even faster.

Diff parsing isn’t that simple

Stickler has parsed diffs since the beginning, so I figured parsing more diffs would be simple. I started off parsing locally generated diffs with the existing diff parser. For each review, I would:

1. Clone the repository
2. Fetch the diff using the API as I always have.
3. Generate a local diff using the new code.
4. Compare the results. If the results were different log an error in Sentry.

I’m glad I put this process in place as it helped prevent disastrous results. My diff paring was missing the following scenarios:

1. Support for blob files. These needed to be ignored.
2. Files being renamed and modified at the same time. GitHub handles this automatically in the API results, but I would need to parse the additional attributes in the diff output.
3. Some pull requests have no files changed.

After solving the above problem, I was still seeing a number of reviews that had drastically different file counts. In one example the remote file count was 6, but the local file count was 860!

Shallow history and diffs

Tracking down this discrepancy proved to be tricky as it was entangled in another optimization.
To help keep reviews snappy, Stickler does shallow clones. When a repository is cloned a command similar to this is run:

Show Plain Text

git clone --depth 50 <url> /some/path

By fetching a shallow clone, the time spent cloning a repository is dramatically reduced. This is critical for fast moving repositories which have gigabytes of history. Because shallow clones don’t have all the necessary history, another fetch is made for each branch heads. These additional fetches help ensure that a comparison can be made.

I had assumed that the --shallow option only applied to the initial clone, and that subsequent fetches would fetch the entire history of the branch. At least this was my assumption based on local testing. However, the results I was seeing in production contracdicted this assumption. In the real world a shallow clone also does shallow fetches on branch heads. This means that both branch heads only had 50 commits of history. If both branches didn’t have a common ancestor in those 50 commits, git diff would trace ancestry back much farther back in history resulting in the huge diffs I was seeing in production.

Getting consistent results

After more debugging and testing, the only surefire way to get consistent results between the local and remote differences was to fetch the entire history of the repository. Increasing the depth was my first instinct, but that would only move the failure threshold where reviews will break farther away. Cloning the entire history creates would be reliable and always correct, but at the cost of serious performance penalties, as large repositories are often multiple gigabytes.

Based on the distribution of files changed in pull requests, most of them do not fall into the problematic scenario where 50+ API requests are needed to fetch the entire diff over the API. However, there are many Stickler customers with repositories measuring in the 3+ Gigabyte range and doing full clones would give them a terrible experience

Parting thoughts

My simple optimization got messy as the real world added additional constraints. Looking at real world data helped put my solution and its trade-offs into perspective. Not all performance problems are quick fixes, and sometimes the trade-offs are not worth the gains.

Failing to optimize diff generation with local operations

Diff parsing isn’t that simple

Shallow history and diffs

Getting consistent results

Parting thoughts

Comments

There are no comments, be the first!

Have your say: