Building a local RAG in Go

I had to go through an “AI” course recently for work, which included a couple of homework assignments. The intent was that these would be company-specific, but because 1 of these had to be posted on GitHub1, I have a version without internal data. That was the RAG assignment (probably the closest thing the “AI” hype has to a redeeming value). In theory, this would seem academically dishonest, but 1) I had to make the repo public to submit it, 2) it wasn’t really “graded” so much as “checked to see you did it at all,” 3) assumed that everyone was just slopping it out as a vibe coding exercise anyways. Since they didn’t really seem to care if learned anything, I don’t really care about posting the answer. For people who do want to learn and do it themselves, this isn’t really that complicated, and can be done in your spare time over a weekend.

The inspiration for this approach was a devopstoolbox video on libsql and its vector database capabilities (since vector databases are the backbone of a lot of RAG systems). That gives me a nice local database, and Ollama has been around since shortly after LLMs got popular. The 2 of those put together means I could do this whole thing on my laptop.

As an aside, I generally think local LLMs are a better pursuit of your time than the “big guns,” So I figured if I had to do “AI” nonsense homework, I may as well a) do it locally, which I think is more valuable and useful than just integrating with a vendor, b) do it in Go, which I’m trying to work with whenever I can, and c) also do it this way because if I have to do LLM crap, the least I could do was try to get whatever joy out of it I could.

The assignment was to build a RAG that solves a problem at work2. I grabbed some support docs and put them into a .gitignored directory and wrote a support lookup engine. This tool has 4 parts, the main.go being the main app itself (a command line tool where you ask questions, search the support docs, and get the answer), the database (for interacting with libsql), an ingestion script that will load data from the source documents into the database, and and LLM wrapper for calling Ollama to pretend that clankers can think.

RAG in a nutshell

If you’re not familiar with RAG, it stands for Retrieval Augmented Generation. In layman’s terms, it’s looking up additional information to pass to an LLM so it sounds like it’s talking out of its @$$ less. The most common example is full text search. Essentially you take raw text, break it up into chunks, do some LLM mathing (that every course on “AI” likes to skip over) to turn the text chunk into a numeric vector. Those are stored in databases (that support them). Then, your query is also converted into a vector, and the database is searched for the vector fields closest to the vector input (that math is regular old linear algebra). Based on those matches, you can link back to the original source to pass as context to an LLM for actual parsing and output.

For the record, vector databases for arbitrary text searches are not the only form of RAG out there – there’s regular database queries, API calls, file loading, etc. Any lookup of data that gets passed to an LLM qualifies, this is just the most cited example.

On to the app

On startup, the app will try to connect to the database and ensure the tables it needs are present, creating them if they aren’t. Then it reads the data (not tracked in the repo), vectorizes it, and stores it in libsql. Cool, now we’re ready to ask questions. At this point, we start a loop that runs until you type the phrase “all done.” You ask a question, the question looks for keywords it can use to “categorize” the question to limit it’s vector search to any categories it can identify. The app searches the database for similar vectors and returns the associated documents that’s then passed to Ollama as additional context. All without calling a server, anywhere. That’s it, no magic. Most of the work done on ingesting the data and setting up the vector storage and querying, once that was solved everything else straightfoward enough to have.

The biggest thing I would change would be the Ollama models I used. Originally, I grabbed the first model names I could find (1 for embedded, and for general text processing. They weren’t in any way coordinated and it showed. Switching to Qwen for its embeddings model and regular model would have made for better output.

As for how it works…well, it’s good enough to say I built a RAG system, but not good enough to go onto your severs just yet. In addition to fixing the models (which is a simple change in llm.go), there’s also some work to be done with chunking and backfilling3. To be fair, chunking and backfilling needs to be re-evaluated and tuned on a per-dataset basis.

Also in the “good enough for a homework assignment but not for really using” category – I didn’t make ingestion idempotent. If you look at the Makefile, I remove the database on each run and re-populate it on startup. It’s handy, and for the initial building/debugging/tuning of breaking the documents into vectorized chunks for searching, but bad for a “real” production system. Remember, homework assignments typically hit “done” long before being production-ready. In this case, at a point where I still wanted to optimize for retries and refreshes over stability and permanence.

1 change I did have to make was changing the embedding field in my database. The devopstoolbox uses an F32_BLOB, I had to use F64_BLOB. I’m not sure if that’s because of my chunk size, or data, but just know you’re not limited to F32.

This isn’t the greatest RAG querying setup (and if you look at the evals when you run you can see that). The model upgrade would likely help some, as would tweaking how I chunk the data and how much overlap there is between chunks. The other thing is that in this instance, I’d drop the categories (again, they were force-fit in for homework requirements). However, that force-fitting did more harm than good as they weren’t well thought-out, didn’t make for clear boundaries, and in general made it easy for the search to look in the wrong places.

Like I mentioned, RAG is probably the most useful thing to come out of the “AI” hype as it enables the “better search” experience people have working with LLMs, since it does a better job of matching things like the search query, even if the search terms you entered don’t match the page contents exactly. This seems like a better way of doing full-text search than the Lucene-based approach of “get machines with butt-tons of RAM and load everything in there. ” In fact, the LLM part is optional, the vectorized search is what actually shows the most potential. Hopefully research into, and implementation of, vectorization will continue, even after the “AI” bubble inevitably bursts.

  1. I slapped this together for a “homework” submission, and haven’t updated the code since, so there’s a good bit of cleanup that it could certainly use that I don’t care enough to put into it ↩︎
  2. There was technically more, including evaluations (which print with the answer for both assignment demo and just generally debugging tools). For some other requirement (I forget which), I also needed to implement categorization, so there’s some categories slapped in there too. ↩︎
  3. I think that’s the term. Basically, the beginning of each chunk should overlap with the end of the previous chunk. This gets you better matches when you do your searches later. ↩︎