My financial data scraping system

This article is part of the Self-hosted Finances series.

    In my Importing and cleaning my Mint transactions, I worked through loading, cleaning, and solving for transfers.

    However, Mint and other financial scraping tools are not authoritative and don’t expose everything that the bank itself will provide. For example, Mint and Monarch don’t have detailed enough stock transaction and position data to identify cost basis, tax lots, and positions. Directly going to the bank can give me higher precision time stamps, scans of checks, merchant addresses, and other attributes.

    Summary

    I’ve been working on this project for the better part of 2024 at this point. Since none of my banks or brokerages provided any sort of obvious API to access my data, I created a few different syncing jobs that used Playwright, a web browser testing and automation framework, to login to my accounts and pull data. I then stored all that data in a custom PostgreSQL DB and load it into Firefly and Ghostfolio. I also wrote Apache Zeppelin notebooks to do ad-hoc analytics and testing of new algorithms.

    With this new found data, I found it possible to do quick analytics. For example, I was able to easily write some Python to decide whether I was getting enough returns to justify whether or a credit card annual fee was worth it.

    The Architecture

    A architectural flow diagram showing the different components in use. Explained in this section

    The architecture breaks down into a few different components. First, we have the banks and brokerages that I want to scrape. Then I use Monarch to scrape all banks, and I use my own Playwright based scrapers to pull data from the banks that I specifically wrote scrapers for. This all gets written into a few different PostgreSQL tables. The Postgres tables are there as a copy of the data because they contain more data that Firefly/Ghostfolio can store and provides a place for me to test new transfer solving algorithms and rebuild the Firefly/Ghostfolio databases as I make improvements.

    Security

    Security is paramount given that I have to store my own account credentials to log into my accounts. I didn’t want to store my passwords in a Kubernetes Secret which would have been too risky in case of a malicious actor. I opted to leverage Vault and OpenBao (an open-source Vault fork) which software-based secret storage system to store everything sensitive including the usernames, passwords, TOTP secrets (Time based one time passwords. These are codes that last for about 30 seconds), and cookie jars.

    Why cookie jars? As a it turns out, banks get nervous when you keep logging in from fresh browsers each time. They’ll make you go through verification steps that may include an SMS, or a mobile app push notification. What I do is sign in once manually and approve the request, then every time my sync job runs, I restore the entire cookie jar into the browser, and save it afterwards. This makes it look like my scraping job is somewhat human.

    All data in Vault is encrypted at rest and encrypted in transit, so no plain-text password is stored on disk. The decryption key is stored only in memory. If my server ever restarts, I have to re-enter the recovery key(s) to unlock the Vault database.

    Vault is then configured to only trust certain Kubernetes namespaces to read/write secrets. If you want to see exactly how this is configured, read my Vault install post.

    Why not scrape everything myself?

    In a previous post, I did talk about some issues that I had with Monarch around their web application loading ad scripts here. This has not yet been fixed and I hope they do fix it, so it might seem unusual that I’m still using it. Unfortunately, I realized that I couldn’t develop scrapers for every single application. Some of my banks (e.g. Citi) required me to 2FA (two factor auth. A password is one factor, an extra code makes it 2FA) using SMS every single time I logged in which was impractical to automate. Monarch and their data providers (Plaid, MX, and Finicity) have the ability to do OAuth auth and API based communications with banks, but that’s not possible for me without compliance reviews.

    Monarch handled the banks and credit cards, whereas I focused on the stock brokerages because Monarch did not going to scrape the data that I needed.

    Institutions

    Venmo

    Venmo was comparatively easy vs the other sites. I did this one mostly as a prototype for a simple, non-stock scrape target. Their login page seems to use Recaptcha, but it never prompted me with a captcha. What’s nice is they have a hidden JSON API to transactions, but it’s wonky. My Playwright script will login to venmo.com, then navigate to Statements, find the Download CSV button, get the URL, modify to grab it in JSON format and the last 30 days, then download that data and update my SQL table.

    Fidelity

    This one consumed most of my time. During login, Fidelity would randomly require a 2FA verification if I ever changed from my server to my dev desktop even though I statically defined the User-Agent. My scraper jobs don’t handle 2FA very well yet. Scraping the data was painful because Fidelity’s website seems to be composed of several different web frameworks and engines.

    Some data, like positions and activity, was available using CSV which I used when available, but positions data only contains stock level data, not at lot level.

    Other data is returned using server-rendered HTML, some of it is returned using JSON APIs. Sometimes I couldn’t figure out how to replicate web requests and would have to click a button and sniff the requests using Playwrights expect_request mechanism. I had to use BeautifulSoup to parse HTML responses in other times.

    The server-rendered HTML was problematic because it was designed for humans. For example, the tax lot table would contain icons for wash sales or include non-dates like the string Various and other random details that was hard to process programmatically.

    Betterment

    Betterment was another (not) fun scraping. The plus side is while they did require 2FA auth, they supported TOTP based auth. I stored the TOTP seed in my Vault DB and used pytotp to generate the token. I’ve since learned that Vault supports a TOTP provider natively which would be more secure because it doesn’t vend out the TOTP seed, it just vends out the OTP codes. That’s better because it reduces the risk of full exposure.

    Once logged in, it was difficult to get the data I needed. They had some basic CSV exports, but it was severely lacking in useful data. The transaction level data CSV is unsuitable because it doesn’t contain symbol information. Their reports don’t even provide quarterly gains data making estimated tax payments hard.

    I even email them every few years and they never implemented any of my requests. They even told me We are not able to provide API access to an individual, though as noted you can access information through programs such as TurboTax. Clearly they have APIs, they just don’t make them available.

    The only way to get access to activity data was to scrape PDF files. Which, I being masochist, did just that. Did you know that PDFs don’t have a DOM like HTML? It’s just absolute positioning of content. How do you parse a table of content? Well, have to grab the (x, y) coordinates of text that I expect to appear above, then the disclaimers that appear below, grab all the text in between that and somehow make a table out of that.

    Citibank

    Citi was unworkable. They required an SMS OTP every single time I signed in. No option to remember it. While I could try to extract the OTP from the SMS’s on my phone using Home Assistant’s Android App sensor, I have not gone down that path given the fragility of that setup.

    What did I learn?

    Banks don’t make it easy to scrape your own data. They make it hard with 2FA authentication using push notifications or SMS (which is no-longer recommended by NIST). I also have to take care not to authenticate too often, which makes debugging difficult because I sometimes debug the script by running the sync job entirely which logs out and logs back in. My session saving only stores cookies and localStorage, but does not persist any session cookies or sessionStorage.

    Synchronizing data without strong identifiers is tricky. When scraping data, I basically get a tuple of (date, amount, description). Without having some kind of identifier to uniquely and stably identify each transaction makes it challenging to figure out when a transaction should be created, updated, or deleted. This is similar to the problem I had when identifying transfer pairs. Instead I haveatch based on dates, amounts, and sometimes descriptions, but that often causes me to delete and recreate instead of updating. This isn’t a major problem because it self corrects, but annoying.

    Where’s the code?

    Right now, my code is pretty coupled to my system, however I’m slowly moving to to my GitHub repo here: ajacques/own-your-finance.

    What’s next?

    • Improve reliability - right now the scrapers don’t handle unexpected things well like 2FA, or slow or failed navigations. Next would be to implement a 2FA handler to get OTP codes
    • Handle more institutions - my goal is to support as many of the banks I or my friends use to reduce dependency on aggregators
    • Better analytics - right now Firefly and Ghostfolio are fine, but they lack in some of the features I’m looking for like good charts and tables
    Copyright - All Rights Reserved

    Comments

    Comments are currently unavailable while I move to this new blog platform. To give feedback, send an email to adam [at] this website url.

    Other Posts in Series

    This post is part of the Self-hosted Finances series. You can check out the other posts for more information: