GraphiT, Teaching Language Models to Read Graphs
Based on the Paper by Shima Khoshraftar, Niaz Abedini & Amir Hajian: “GraphiT: Efficient Node Classification on Text-Attributed Graphs with Prompt Optimized LLMs” ACM Web Conference 2025 (WWW ‘25) - arXiv:2502.10522
There is a peculiar tension at the heart of modern AI. Large language models have become extraordinarily good at understanding text; and graphs have long been one of the richest ways humans represent structured knowledge. Citation networks, social networks, knowledge bases: all of these encode not just what things are, but how they relate to each other. The trouble is that graphs are deeply unnatural for language models. A network of interconnected nodes does not arrive neatly as a paragraph. So what happens when you try to make an LLM work with graph-structured data?
This is the question that GraphiT sets out to answer, and the answer turns out to hinge not just on what you feed the model, but how you ask it.
The Problem: Graphs Don’t Speak LLM
In a text-attributed graph, every node carries a text description alongside its network connections. Think of a citation network like Cora: each node is a research paper, its text attribute is the title and abstract, and its edges connect papers that cite one another. The task of node classification is to predict the topic or category of each paper using both the text and the graph structure.
Graph Neural Networks (GNNs) have long been the state of the art here. They propagate information across edges in a principled message-passing mechanism, aggregating neighbourhood signals at each layer. But GNNs carry an important limitation: they rely on shallow text embeddings, i.e., bag-of-words or word2vec-style representations, to encode node text attributes. These cannot capture the contextual relationships between words that modern language models understand so naturally.
The obvious response is to bring LLMs into the picture. Researchers have tried this in several ways. Hybrid systems like LEADING jointly train a language model and a GNN end to end, while ENGINE links them through a tunable side structure. These hybrids can be effective, but they are computationally intensive and require labelled training data, which is not always available.
The more appealing possibility is to use an LLM alone, without any GNN. Give the model a description of a node and its neighbourhood and simply ask: what category does this belong to? This approach is flexible and requires no task-specific training, but it introduces a new problem that turns out to be surprisingly consequential. The answer you get depends enormously on how you phrase the question. Prompts require constant manual tuning, the tweaks that work are slow to find, and they are nearly impossible to replicate systematically. Additionally, including full text summaries of all a node’s neighbours can quickly overwhelm the context window and make LLM API calls expensive.
GraphiT is a framework designed to tackle both of these problems at once: how to efficiently encode a graph into text, and how to automatically find the most effective way to prompt a language model about it.
The Framework: Three Stages
The GraphiT pipeline moves through three stages. First, nodes are encoded as compact text. Second, a prompt is automatically optimised on a tiny labelled subset. Third, the optimised prompt is applied to classify all remaining nodes. Figure 1 illustrates the end-to-end flow.
Figure 1 — The GraphiT Framework. Node features, including keyphrases extracted from 1-hop neighbours, are prepared for each node. A small labelled subset is passed to DSPy alongside an initial prompt; DSPy automatically optimises the prompt by refining instructions and selecting demonstrative examples. The optimised prompt is then used to classify all nodes at scale — no further labelling or fine-tuning required.
Stage 1: Node Encoding with Keyphrases
For any given node, you have two natural sources of information: the node’s own text, and the text of its neighbours in the graph. GraphiT exploits both, but crucially does so efficiently.
Including the full text of every neighbour is impractical — a node can have dozens of neighbours, each with a lengthy abstract. Even using summaries of each neighbour produces long prompts and risks what the authors call the “lost-in-the-middle” effect: LLMs struggle to attend to information buried deep in a long input context.
GraphiT takes a more surgical approach. Instead of summaries, it extracts keyphrases from the concatenated text of a node’s 1-hop neighbours using a semantic keyphrase extraction (KPE) algorithm. The method generates n-gram candidates (with n ∈ {1, 2, 3}), maps them to dense embeddings via a Transformer-based model (KeyBERT), and selects the top-ranked phrases by semantic similarity to the full text. A diversity module then removes redundant keyphrases, yielding a compact set of five phrases that together capture the neighbourhood’s key concepts.
Each node’s encoding therefore contains: the node’s own text attribute, the labels of its 1-hop neighbours (where known), and the keyphrases distilled from their text. Table 1 shows a real example from the Cora dataset.
Table 1 — Neighbour representation comparison (Cora dataset). The keyphrases capture the key concepts of a full summary in a fraction of the tokens.
| Representation | Content |
|---|---|
| Neighbours summary | This paper presents an algorithm using reinforcement learning at each node for packet routing in networks; it utilises local information and outperforms traditional methods with minimal routing times through experiments, even in irregularly-connected network structures. |
| Neighbours keyphrases | distributed reinforcement learning, network routing, routing policies, packet routing |
The keyphrase representation is not only shorter — it is better suited to scaling up to multi-hop neighbourhoods, where summarising extended graph context becomes increasingly unwieldy. The KPE step itself runs on small encoder models fast enough to operate on an ordinary laptop CPU, with no GPU required.
Stage 2: Automatic Prompt Optimisation with DSPy
With node encodings in hand, GraphiT turns to the second bottleneck: the prompt itself. Anyone who has used an LLM for a structured task knows that small changes in wording can swing accuracy by several percentage points. The authors address this with DSPy, a framework for programming rather than prompting language model behaviour — treating prompts as parameters to be systematically optimised rather than manually crafted.
In DSPy, the node classification task is defined as a program with a signature: a formal description of inputs (node features and a list of possible labels) and the desired output (the predicted label). The authors use chain-of-thought prompting to let the model reason step by step before committing to a label.
Optimisation proceeds through two complementary mechanisms. COPRO (Coordinate-ascent Optimisation by Prompting) iteratively rewrites the task instruction based on performance on a held-out validation set, refining one prompt component at a time while holding others fixed. Separately, bootstrap few-shot random search mines the training data for successful predictions and incorporates the most informative ones as demonstrations in the prompt.
Critically, this entire optimisation process requires remarkably little labelled data: just 3 nodes per class for training and 2 per class for validation. Once optimised, the resulting prompt is applied to arbitrarily large test sets without modification.
Stage 3: Node Classification at Scale
Once encoding and optimisation are complete, classification is straightforward. Each test node is encoded in the compact format, the optimised prompt is applied, and gpt-3.5-turbo-1106 returns a predicted label. The whole pipeline is training-free from the model’s perspective — no fine-tuning, no gradient updates.
Experiments: What the Numbers Say
The authors evaluate GraphiT on three standard citation network benchmarks:
- Cora — 2,708 nodes, 5,429 edges, 7 classes (ML sub-areas including neural networks, reinforcement learning, and probabilistic methods)
- PubMed — ~19,000 nodes, ~44,000 edges, 3 classes (experimental induced diabetes, type 1 diabetes, type 2 diabetes)
- Ogbn-arxiv — ~169,000 nodes, ~1 million edges, 40 classes (CS arXiv subject areas)
For each dataset, 200 nodes are randomly sampled from the test set and results are averaged over two samples. The primary metric is RP@1, which is equivalent to accuracy in single-label classification. Four methods are compared: GraphiT, a Vanilla LLM baseline (same encoding, no prompt optimisation), the Chen et al. (2024) unoptimised few-shot approach, and a standard GCN.
Table 2 — Node classification accuracy (%) across three datasets. Bold marks the best LLM-based result per dataset. GraphiT outperforms all LLM-based approaches on every dataset, and beats GCN on PubMed.
| Method | Cora | PubMed | Ogbn-arxiv |
|---|---|---|---|
| Vanilla LLM | 74.49 | 87.56 | 49.50 |
| Chen et al. (2024) | 74.00 | 90.75 | 55.00 |
| GraphiT | 79.84 | 93.28 | 57.25 |
| GCN | 82.20 | 81.01 | 73.10 |
GraphiT beats both LLM-based baselines on all three datasets, and surpasses GCN on PubMed by a meaningful margin (93.28% vs 81.01%). On Cora and Ogbn-arxiv, GCN still leads — the authors attribute this to GCN aggregating 2-hop neighbourhood information, while GraphiT currently uses only 1-hop. Extending to deeper neighbourhoods is flagged as a clear direction for future work.
Ablation Study: What Each Component Contributes
The ablation study progressively adds components to the node encoding for both the Vanilla LLM and GraphiT, starting from text attributes alone and building up to the full keyphrase-enriched representation.
Table 3 — Ablation study: accuracy (%) across encoding configurations. Each row adds one more source of information. Keyphrases match or closely approach the accuracy of full summaries while using far fewer tokens.
| Dataset | Graph information included | Vanilla LLM | GraphiT |
|---|---|---|---|
| Cora | Text attributes only | 57.65 | 59.18 |
| + Neighbours labels | 71.68 | 78.31 | |
| + Neighbours labels + Neighbours summary | 72.19 | 80.10 | |
| + Neighbours labels + Neighbours keyphrases | 74.49 | 79.84 | |
| PubMed | Text attributes only | 89.55 | 93.03 |
| + Neighbours labels | 87.06 | 90.54 | |
| + Neighbours labels + Neighbours summary | 87.31 | 92.78 | |
| + Neighbours labels + Neighbours keyphrases | 87.56 | 93.28 | |
| Ogbn-arxiv | Text attributes only | 40.00 | 45.25 |
| + Neighbours labels | 49.75 | 55.00 | |
| + Neighbours labels + Neighbours summary | 49.00 | 58.50 | |
| + Neighbours labels + Neighbours keyphrases | 49.50 | 57.25 |
Several things stand out. Adding neighbour labels delivers the biggest single jump in most settings — knowing what category your neighbours belong to is highly informative, because connected nodes tend to share labels (the homophily assumption). Keyphrases then consistently match or closely approach the accuracy of full summaries: on Cora and PubMed they essentially tie, while on Ogbn-arxiv summaries edge slightly ahead (58.5% vs 57.25% for GraphiT) — though at substantially higher token cost. GraphiT’s prompt optimisation consistently outperforms the Vanilla LLM across every encoding configuration, confirming that the DSPy step contributes meaningfully on its own.
The Cost Argument: Fewer Tokens, Competitive Results
GraphiT’s keyphrase approach is not just about accuracy, it is also about efficiency. Figure 2 shows a histogram of the ratio of tokens used by neighbour summaries to tokens used by keyphrases, across all three datasets combined. Summaries require several times more tokens on average, and since LLM API pricing scales with context length, this difference translates directly into lower inference costs.
Figure 2 — Token Efficiency: Summary vs. Keyphrase. Histogram of the ratio of summary tokens to keyphrase tokens across all three datasets. The KPE method consistently requires a fraction of the tokens needed for summarisation, translating directly to lower LLM API costs with minimal impact on classification quality.
Conclusion
GraphiT makes a simple but concrete argument: to use LLMs effectively on graphs, you need to solve two sub-problems jointly. Encode the graph compactly using keyphrases rather than full summaries, and optimise the prompt automatically rather than by hand. Address only one, and you leave significant performance on the table.
The result is a clean, reproducible pipeline that outperforms all LLM-based baselines on three datasets and beats GCN on PubMed, while using substantially less context than naive encoding approaches. The limitations are real and the authors are upfront about them — GNNs still win on larger graphs where multi-hop propagation matters most. But as a demonstration that careful representation and systematic prompting can substitute for model training in graph prediction tasks, GraphiT makes a convincing case.
It is, in the end, a story about how thinking carefully about representation and communication, two concepts that sound almost philosophical, turns out to be the key to making a very modern technology work on a very classical data structure.
GraphiT was presented at the ACM Web Conference 2025 (WWW ‘25). The full paper is available at arXiv:2502.10522.