How to Build Your Own Local RAG System

Thomas Uhrig · November 14, 2025

tech AI

AI is all around, but when it comes to actually using it, many organizations move slow. Discussions about data protection, governance, integrations, and vendor evaluations often block concrete use cases. But here’s the surprising part:

You don’t need a big platform or a six-month project to get started with AI on your internal knowledge base.

You can build a fully local, privacy-friendly Retrieval-Augmented Generation (RAG) system in just a couple of hours. No external dependencies. No cloud vector databases. No proprietary frameworks. Just a local embedding model, your internal documents, and a Large Language Model (LLM) endpoint.

This post shows you how.

Demo Project on GitHub

You can find a demo project to this topic on my GitHub account:

https://github.com/tuhrig/local-rag-java-gradle

What We Are Building

So what is RAG? RAG systems use an LLM, after retrieving the most relevant pieces of your own documents. Our documents can be anything from text files, to code or PDFs. But in a typical company, it will pretty much be Confluence. That’s where your knowledge of the last couple of years is buried. So let’s give it back some life.

In practice, this results in a system like this:

You ask a question
Your system searches through your documents
It picks the most relevant text chunks
It builds a huge context prompt including the found document parts
It asks the LLM
You get a context-grounded answer

Technically, we will go through the following steps:

                     ┌────────────────────────┐
                     │   1. Extract Content   │
                     │  (Confluence API, PDFs)│
                     └─────────────┬──────────┘
                                   ▼
                     ┌────────────────────────┐
                     │     2. Clean & Chunk   │
                     │   HTML → text → chunks │
                     └─────────────┬──────────┘
                                   ▼
            ┌──────────────────────────────────────────┐
            │         3. Embed & Store Chunks          │
            │  local embeddings → JSON vector files    │
            └────────────────────────┬─────────────────┘
                                     ▼
                     ┌────────────────────────┐
                     │   4. Similarity Search │
                     └─────────────┬──────────┘
                                   ▼
          ┌────────────────────────────────────┐
          │      5. Build Prompt & Query LLM   │
          │  (inject context → call LLM)       │
          └──────────────────────────┬─────────┘
                                     ▼
                     ┌────────────────────────┐
                     │      Final Answer      │
                     └────────────────────────┘

Step 1: Extract Your Internal Content

Most companies use something like Confluence to share their internal knowledge. Whatever you have — if it has an API or can export PDFs, you can get the content. But for Confluence in particular, it’s very convenient because of its REST-API:

GET /rest/api/content?spaceKey=ABC&limit=...&expand=body.storage

Using this endpoint, you can paginate through all pages of a space and download them as JSON. Here’s a minimal Java example that retrieves every page from a Confluence space and stores the results as individual JSON files:

var client = HttpClient.newHttpClient();
var mapper = new ObjectMapper();

var baseUrl = "https://your-confluence-domain/rest/api/content";
var spaceKey = "ABC";
int limit = 50;
int start = 0;

while (true) {
    var url = baseUrl 
            + "?spaceKey=" + spaceKey 
            + "&limit=" + limit 
            + "&start=" + start 
            + "&expand=body.storage";

    var req = HttpRequest.newBuilder()
            .uri(URI.create(url))
            .header("Authorization", "Bearer <YOUR_TOKEN>")
            .build();

    var resp = client.send(req, HttpResponse.BodyHandlers.ofString());
    var root = mapper.readTree(resp.body());
    var results = root.get("results");

    if (results == null || !results.isArray() || results.size() == 0) {
        break;
    }

    for (var page : results) {
        var id = page.get("id").asText();
        var out = new File("raw_pages/" + id + ".json");
        mapper.writerWithDefaultPrettyPrinter().writeValue(out, page);
    }

    int size = results.size();
    if (size < limit) {
        break; // no more pages
    }
    start += limit;
}

After running this script, your local download folder might look like this:

raw_pages/
  123456.json
  123457.json
  123458.json
  123459.json
  ...

Each file contains a full Confluence page in JSON format — including its ID, title, metadata, and HTML content (body.storage.value).

Step 2: Clean the HTML Content

Before we can embed the text, we should convert the HTML into clean, readable plain text:

remove tags
remove boilerplate (menus, macros, metadata)
normalize whitespace
keep only the meaningful textual content

The easiest way to do this in Java is using Jsoup, a lightweight HTML parser.

Here is a minimal snippet that takes the downloaded Confluence JSON files, extracts the HTML, and converts it to plain text:

var inputDir = new File("raw_pages");
var outputDir = new File("clean_pages");
outputDir.mkdirs();

for (var file : inputDir.listFiles((d, n) -> n.endsWith(".json"))) {
    var root = mapper.readTree(file);
    var id = root.get("id").asText();
    var body = root.path("body").path("storage").path("value");
    var html = body.isMissingNode() ? "" : body.asText();
    var cleanText = Jsoup.parse(html).text();
    cleanText = cleanText.replaceAll("\\s+", " ").trim();
    var out = new File(outputDir, id + ".txt");
    var.writeString(out.toPath(), cleanText);
}

After running this step, your directory structure looks like this:

raw_pages/
  123456.json
  123457.json
  ...

clean_pages/
  123456.txt
  123457.txt
  ...

Each .txt file now contains a clean, normalized text representation of the corresponding Confluence page. This content is ready for chunking and embedding in the following steps.

Step 3: Chunk the Documents

Big documents don’t embed well. So we split them into small pieces — for example 300–600 characters each. Every chunk gets stored locally:

{pageId}_{title}_chunk_{n}.json

To do so, we can also use some Jsoup and simple Java:

var inputDir = new File("clean_pages");   // contains 12345.txt etc.
var chunkDir = new File("chunks");
chunkDir.mkdirs();

int chunkSize = 600;
int overlap = 100;

for (var file : inputDir.listFiles((d, n) -> n.endsWith(".txt"))) {

    var pageId = file.getName().replace(".txt", "");
    var text = Files.readString(file.toPath());

    text = text.replaceAll("\\s+", " ").trim();

    int index = 0;
    int start = 0;

    while (start < text.length()) {

        int end = Math.min(start + chunkSize, text.length());
        String chunk = text.substring(start, end).trim();

        var node = mapper.createObjectNode();
        node.put("pageId", pageId);
        node.put("chunkIndex", index);
        node.put("text", chunk);

        var out = new File(chunkDir, pageId + "_chunk_" + index + ".json");

        mapper.writerWithDefaultPrettyPrinter().writeValue(out, node);

        index++;
        start = end - overlap;  // sliding window
        if (start < 0) start = 0;
    }
}

This becomes your “source library”. Each chunk_*.json file contains:

the pageId
the chunkIndex
the cleaned text snippet

In the next step, we will embed these files as the foundation for your vector store.

clean_pages/
  12345.txt
  12346.txt
  ...

chunks/
  12345_chunk_0.json
  12345_chunk_1.json
  12345_chunk_2.json
  12346_chunk_0.json
  ...

Step 4: Create Embeddings Locally

To perform semantic search (and find the most relevant documents to our question), we need a numerical representation (an embedding) for each text chunk. We can easily run an embedding model locally, entirely offline. For this example, we use the lightweight and well-established all-MiniLM-L6-v2 embedding model, which is small, fast, and works great for document search. You can download a full copy of the model as a ZIP file here:

👉 https://www.kaggle.com/datasets/sircausticmail/all-minilm-l6-v2zip

After downloading, unzip it into a folder like:

D:/embedding_models/all-MiniLM-L6-v2/

Once the model is available locally, we expose it via a tiny Python Flask service. This allows any application (e.g., our Java tooling) to request embeddings via a simple HTTP call.

from flask import Flask, request, jsonify
from sentence_transformers import SentenceTransformer

app = Flask(__name__)

# Path to your downloaded model folder
MODEL_PATH = r"D:\embedding_models\all-MiniLM-L6-v2"

print("Loading model from:", MODEL_PATH)
model = SentenceTransformer(MODEL_PATH)

@app.route("/embed", methods=["POST"])
def embed():
    data = request.get_json()
    text = data.get("text", "")
    embedding = model.encode(text).tolist()
    return jsonify({"embedding": embedding})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5005)

Run the server:

python embedding_service.py

To embed all your cleaned and chunked documents, you can call the local this service from Java. This script iterates over all chunk files, sends each chunk’s text to the local embedding service, and writes the result into a corresponding .embedding.json file.

var chunkDir = new File("chunks");
var embedDir = new File("embeddings");
embedDir.mkdirs();

for (var file : chunkDir.listFiles((d, n) -> n.endsWith(".json"))) {

    var root = MAPPER.readTree(file);
    var pageId = root.get("pageId").asText();
    int chunkIndex = root.get("chunkIndex").asInt();
    var text = root.get("text").asText();

    var body = MAPPER.createObjectNode();
    body.put("text", text);

    var req = HttpRequest.newBuilder()
            .uri(URI.create("http://localhost:5005/embed"))
            .header("Content-Type", "application/json")
            .POST(HttpRequest.BodyPublishers.ofString(body.toString()))
            .build();

    var resp = CLIENT.send(req, HttpResponse.BodyHandlers.ofString());
    var embedJson = MAPPER.readTree(resp.body());

    var out = MAPPER.createObjectNode();
    out.put("pageId", pageId);
    out.put("chunkIndex", chunkIndex);
    out.set("embedding", embedJson.get("embedding"));

    var outFile = new File(embedDir, pageId + "_chunk_" + chunkIndex + ".embedding.json");

    MAPPER.writerWithDefaultPrettyPrinter().writeValue(outFile, out);
}

After running this step, your directory structure looks like this:

chunks/
  12345_chunk_0.json
  12345_chunk_1.json

embeddings/
  12345_chunk_0.embedding.json
  12345_chunk_1.embedding.json

Each embedding file will contain a large number of vectors returned from the local embedding model. Now we have created the data basis for our RAG system: We have download all documents from Confluence, we have clean and chunked them and finally we have converted them to vectors.

Step 5: Similarity Search

Once every chunk has an embedding, the next step is to find the chunks that are most relevant to the user’s question.

The process is simple:

Embed the user’s query (using the same local embedding service)
Compare this query embedding with all stored chunk embeddings
Compute similarity between them
Sort the results
Pick the top k chunks (e.g., 10–20)

List<Double> embed(String text) {
    // call local embedding server
    var body = "{\"text\": " + MAPPER.writeValueAsString(text) + "}";
    var req = HttpRequest.newBuilder()
            .uri(URI.create("http://localhost:5005/embed"))
            .header("Content-Type", "application/json")
            .POST(HttpRequest.BodyPublishers.ofString(body))
            .build();
    var resp = CLIENT.send(req, HttpResponse.BodyHandlers.ofString());
    return MAPPER.readTree(resp.body())
                 .get("embedding").findValuesAsText(null)
                 .stream().map(Double::valueOf).toList();
}

double cosine(List<Double> a, List<Double> b) {
    double dot=0, na=0, nb=0;
    for (int i=0; i<a.size(); i++) {
        dot += a.get(i)*b.get(i);
        na  += a.get(i)*a.get(i);
        nb  += b.get(i)*b.get(i);
    }
    return dot / (Math.sqrt(na)*Math.sqrt(nb));
}

// --- Similarity Search ---
var query = embed("How does the booking logic work?");
var scores = new HashMap<>();

for (var f : new File("embeddings").listFiles()) {
    var vec = ... load embedding from JSON ...;
    scores.put(f, cosine(query, vec));
}

scores.entrySet().stream()
      .sorted((a,b)->Double.compare(b.getValue(), a.getValue()))
      .limit(5)
      .forEach(e -> System.out.println(e.getValue() + " -> " + e.getKey()));

Now we have the chunks that are semantically closest to the user’s question. They will become the input context for the LLM in the next step.

📁 Example Output

8421  ->  12345_chunk_1.embedding.json
8012  ->  12345_chunk_0.embedding.json
7950  ->  98765_chunk_2.embedding.json
...

Step 6: Build the Prompt and Ask the LLM

Once we have identified the most relevant chunks through similarity search, we can assemble the final RAG prompt. This prompt gives the LLM the exact pieces of information it needs to answer the user’s question — without hallucination and without relying on its general training.

The structure is simple:

You are an internal assistant.
Answer the question using only the context below.
If the information is missing, say so.

### Context
[Chunk 1]
<text>

[Chunk 2]
<text>

...

### Question
<user question>

This prompt is then sent to your LLM endpoint (e.g., Azure OpenAI or any other model you have access to). Because the LLM receives the exact, clean, and relevant context, it can generate accurate, grounded answers based on your internal documentation.

var prompt =
    "You are an internal assistant...\n\n" +
    "### Context\n" +
    topChunks.map(c -> c.text).collect(joining("\n\n")) +
    "\n\n### Question\n" +
    userQuestion;

var response = callAzureOpenAI(prompt);  // or any other LLM endpoint

System.out.println(response);

The result is a precise, context-aware answer grounded entirely in your own documents — no hallucinations, no guesswork, and no external data leakage.

Why This Works So Well

Most LLM tools focus on fancy interfaces, cloud services, and integrations. But at its core, RAG is extremely simple.

You need:

a way to download your data (from Confluence REST-API)
a way to clean and chunk text (Jsoup)
a way to embed text (e.g. all-MiniLM-L6-v2)
a way to compare embeddings
an LLM endpoint (e.g. Azure OpenAI)

Final Thoughts

LLMs are powerful, but the real magic comes when you mix them with your own knowledge. And the best part: you don’t need a huge infrastructure to get there. A local RAG system is one of the fastest and most effective ways to bring AI into everyday work.

If you want to try it yourself, start small: Pick one space, one folder, or one project — and build your own AI assistant around it.

It’s easier than you think. And highly addictive once it works.

Also check my demo project on GitHub:

https://github.com/tuhrig/local-rag-java-gradle

Best regards,
Thomas