A few weeks ago I got an email from a customer:

I just did an import and categorized the transactions. The month of November has transactions, but they don't show up in the budget. It says $0.00 for a category, but if I click on it they show up in the transactions list.

The budget should be pretty solid by now, so my first thought was either a problem specific to their filesystem (is the disk full?) or a rare edge case between the API or other less-used features.

However, after asking a few questions, it became clear this user had done nothing out of the ordinary. Like any new user, they added some transactions and categorized them. Their computer was normal. It should have worked.

Around the same time, a friend who uses the app messaged me saying he was seeing some funky budget values. Two users seeing a similar problem? My head started exploding with red flags. I recently went through and categorized lots of transactions, so I frantically opened the app to view my data and sure enough, the budget values were wrong. The transactions weren't showing up.

Mortified. If there's one thing a personal finance app should get right, it's summarizing your spending amounts in the budget. It was late at night and my head was buzzing with reasons for why this could happen. The world around me melted away — I couldn't think about anything else.

I needed to go to sleep soon, so I looked at error reports. Nothing out of the ordinary. I looked for any patterns to isolate the problem. This couldn't be a widespread problem because I talk to a couple users on a regular basis who have been happy with it. Feedback from new users recently haven't mentioned this. How could it have happened to 3 of us now?

I scanned the code related to the budget and how categorizing works. I saw a couple problems, but only things that should show up in extreme cases (the hard disk gets full, etc). Finally, I noticed one thing out of place: in a critical process, it does something specific only on the web version of the app. Actual persists data differently on the web, and it's not as well-tested as the other platforms (desktop and mobile). There's some extra code that persists changes to IndexedDB, and a problem there could cause transactions to update without the budget updating.

It must be a problem with the web app. I was fairly confident I used the web app to categorize, and I sent off emails to the users asking if they used the web app. While I waited to hear back, I scanned the error logs to look for any IndexedDB errors on the day that I categorized or when the user emailed me. Nothing. Why wasn't I seeing any errors? I told myself it must be IndexedDB failing, brushed off the lack of logged errors, and tried to go to sleep.

I didn't sleep very well.

The next day, I got a response and sure enough both users most likely used the web app. That was enough to support my theory, so I went from there.

Rethink your assumptions

We've all been here: something is wrong in production and you have no idea why it's happening. You might be staring at code thinking "this should never happen" and yet you're looking at errors from production saying that exact thing happened. Usually it's buried somewhere severals steps removed from the thing actually causing it.

The first thing you need to do is reproduce the problem locally. Without that, you're just guessing and the feedback loop of debugging in production could take weeks or months.

Reproducing an obscure error locally is a skill. Brute force doesn't work — you need to understand the system as a whole and piece together theories based on any data you have (error reports, logs, user-facing behavior, etc).

One thing that helps me is to always re-check my assumptions before diving too deep. If I'm convinced something is the problem, and if I haven't figured it out soon after, I look at all the data to make sure my theory makes sense. If something's off, I try to think of anything else that could be the problem. This avoids diving super deep into one theory and wasting time.

Reproducing the bug

Even though the bug already happened to me, any traces of it were long gone. I tried categorizing, undo, bulk categorizing, moving transactions between accounts, and it all worked. The budget showed the correct values.

I'm going to list all the theories I walked through. It only took me about a day to figure it out, but it could have taken much longer if I focused too much on any one of these.

Each of these is a deep technical dive into how Actual works. You don't have to read them: the short story is after running through many scenarios, I found one which reproduces the bug. It's a rare edge case that only happens on the web.

While working through this, I did find another logic bug regarding the budget cache that happens on all platforms. All of these fixes will be available in the next version 0.0.127. If you still see any problems after the next release, please don't hesitate to reach out!

Theory #1: IndexedDB failure

I don't trust IndexedDB. There's confusing documentation about browsers doing weird things if user's disk space runs low. Is it permanent storage or not? My feeling is that it's supposed to be permanent, but browers may purge it in rare situations. Whatever that means.

There's also the max limit on storage size. When you hit the limit, what happens? It should throw a QuotaExceededError, which makes sense. But the max limit is so different for each browser, and if it's small (which it used to be on Safari, something like 50MB) Actual could hit that.

I haven't optimized Actual's storage for the web yet. My database which has 5 years worth of data is ~11MB. That should fit nicely in IndexedDB, but how I'm storing it doesn't play nicely in some browsers (Safari) and seems to aggravate sqlite's WAL which backs IDB in Safari. Chrome seems better, but I haven't thoroughly test in all browsers yet.

Given all this, plus the fact that I see weird IndexedDB errors logged (especially from Safari), when I noticed a line in the sync system that saves to IndexedDB a red flag immediately popped up.

To understand the rest of the theories, let me explain a little bit how Actual works: all changes to data go through single code path. That code is responsible for committing the data and updating anything, like the budget, that depends on it. On desktop and mobile, native sqlite is used, so the (very simplified) code looks like this:

async function applyChanges(changes) {
  await commitToDB(changes)
  updateSpreadsheet(changes)
}

changes is a list of updates to apply, commitToDB actually changes the sqlite database, and updateSpreadsheet triggers any updates that need to respond to the changes. But there's a trick: for performance reasons, we cache the spreadsheet values so they aren't always recalculated. The whole process looks like this:

async function applyChanges(changes) {
  await commitToDB(changes)
  updateSpreadsheet(changes)
  
  // `updateSpreadsheet` can do async work, so at some
  // point in the future when it finishes it will trigger
  // a `saveSpreadsheetCache` which commits the cache
  // to the database
}

That's all well and good, but there's another wrinkle: on the web version we use an in-memory sqlite db because we can't use native files. This works well, but the obvious problem is that your changes aren't going to persist. To solve that, on the web we store your changes in IndexedDB and replay them when needed. So here's the code with that added:

async function applyChanges(changes) {
  await commitToDB(changes)
  
  if(web) {
    await persistChangesToIDB(changes)
  }
  
  updateSpreadsheet(changes)
  
  // `updateSpreadsheet` will commit the spreadsheet
  // cache at some time in the future
}

If you look at the code above, you'll see the big red flag if the web version has a problem with stale budget values. Obviously persistChangesToIDB is failing, which means the transactions get saved to the db but the spreadsheet doesn't update!

However, the behavior users saw didn't match what would happen in this case for several reasons:

  • commitToDB only commits it to a local in-memory db. The changes wouldn't persist. Users (and me) saw updated data after refreshing the page, so the changes persisted even though the budget cache did not.
  • If persistChangesToIDB was erroring, I would see errors in the logs. I saw nothing.

Theory #2: Silent errors

At this point, I don't think I had grasped the fact that persistChangesToIDB had to be working since changes persisted. The lack of errors bothered me a lot.

Here's the code for most of persistChangesToIDB. I wasn't checking the result of objectStore.put so I wondered if it was throwing an error and I wasn't handling it. At this point I tweeted looking for someone who could audit my IDB code to make sure I was doing it correctly.

I really wanted to bring someone on to help, but looking more at the code any errors should fire on transaction.onerror which I am handling.

In fact, I noticed my comment in the function: // Don't do anything if it fails. Apparently I already thought of this case, so even though errors would be suppressed, applyChanges should continue executing and update the spreadsheet.

Theory #3: Safari

At this point I started getting very convinced that it was a Safari-specific problem. The problem with this thinking is that Safari has a very small marketshare, and if I believed this I might take the bug less seriously. I almost added a bunch of logs and watched them over time to debug.

My theory was that in Safari, maybe persistChangesToIDB was somehow never completing. In that case applyChanges would never update the spreadsheet, you would see stale values, and the changes would still be persisted. But for whatever reason, maybe Safari wasn't firing transaction.oncomplete correctly.

Both users said they probably used Safari. Unfortunately I use Safari too, so I almost gave it up as a weird Safari quirk. Luckily, not only did it bother me to just blame Safari (it's never a compiler/browser/etc bug… most of the time), I was pretty sure I used Chrome when I did a bunch of categorizing.

I also realized a fatal flaw in this theory: applyChanges never runs concurrently. Because the workflow is so critical, it would be problematic to run it multiple times at once. The actual implementation is a lot more complex and makes transactional assumptions that require data to not change underneath it. That means if persistChangesToIDB never resolved, the entire system would hang because any updates would be blocked on the unresolved update. It was clear none of us saw this happen.

Theory #4: Persisting db problems

There's another trick I haven't explained yet: so we store all the changes… but what happens when they grow large? We don't want to store tens of thousands of messages forever. It would eventually slow down opening the app (because it replays them).

Every so often, the entire in-memory db is persisted into IndexedDB. For 5 years worth of data, it's an 10MB Uint8Array. Luckily browsers handle storing large data in IDB very well. When we persist the db, the entire db is saved to IDB and the changes are cleared out because we don't need them anymore. When the app loads, it reads the db snapshot from IDB and loads it, and then replays any changes stored in IDB.

We could periodically persist the db, but instead inside persistChangesToIDB it checks to see how many changes are in IDB and if the total number is over a threshold, instead of saving more changes it just persist the entire db. That means when you make an update, every now and then it will flush the whole db to disk. It's sort of like a write-ahead log. There are two cases this handles:

  • You make a bunch of small changes. At some point, one small change will make it go over the threshold and it'll flush.
  • You make one large batch change. In this case, the amount of changes it produces is so many that it goes over the threshold no matter what, and instead of saving them to IDB it just flushs the disk. I've found that's faster to flush the db to disk than to save 3,000 individual changes.

In pseudo-code persistChangesToIDB looks like this:

async function persistChangesToIDB(changes) {
  if((getExistingCount() + changes.length) > FLUSH_THRESHOLD) {
    await persistDatabase();
  }
  else {
    // Save changes to IDB
  }
}

I immediately noticed that persistDatabase does not suppress errors, so it could fail and cause stale data. However, any errors though should show up logs and I haven't seen any.

I opened the app and made changes that triggered a full db flush, but it all seemed to work. Slightly discouraged, with how complex this is it had to related to this somehow. I took a shower.

Theory #5: Out of order flushing

It really felt like theory #4 should have worked, and I was racking my brain to come up with a new theory. Looking at all my guesses, I checked them all off. Theory #4 still felt the most likely. Something felt off there but I couldn't put my finger on it.

Finally, it clicked! I had recently moved the db flushing inside of persistChangesToIDB. Previously, it would happen periodically (like every 5 minutes) but the problem with that is bulk editing would easily fill up the changes in IDB with way too many messages.

But I can't flush the db in persistChangesToIDB! Any guess why?

At that point, all changes have been committed to the db but the spreadsheet cache has not been updated yet. This means when the app is loaded next time, it will create the db from the snapshot. The db ignores changes that have already been applied — so it doesn't apply those changes again. But it still has an old cache! The spreadsheet cache never got updated in response to those changes.

I rushed to my computer, filled up the changes in IDB so that it would flush the db and then categorized some transactions and sure enough, it flushed the db and when I reloaded the app the budget values were wrong. Yes!! Or no!! It's bad that this happened, but I was so happy to reproduce it so it can be fixed.

The fix

There are many thing I want to improve here. The immediate fix is to avoid flushing the db at that point and go back to flushing it more rarely. To avoid filling the IDB with many changes I'm looking into storing a more compressed version of the changes in IDB.

This has highlighted a critical problem that exists on all platforms: update the db and the spreadsheet cache is not transactional. This case was a logic bug, but it's still possible for something to go wrong in between the time that the db commits changes and the spreadsheet cache is updated.

To fix that, I'm going to add a flag in the database that specifies whether or not the spreadsheet cache is valid. It will flip it false when committing changes to the db, and then flip it true when committing the spreadsheet cache. When the app boots up, it checks the flag and if it's false, it will ignore the cache and recalculate the entire spreadsheet to bring all values up-to-date. This adds a couple seconds to startup time, but should only happen in very rare case when something goes wrong.

Since writing this post, I have found one other logic bug regarding budget values that I will go into in another post.

All of these improvements will available in the next version 0.0.127. Showing you accurate is data is super critical and every release gets closer to an app that is 100% robust. If you still see any problems after the next release, please don't hesitate to reach out!