Documentation Index
Fetch the complete documentation index at: https://vastai-80aa3a82-rate-limit-change.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Serving Rerankers on Vast.ai with vLLM
Rerankers determine relevance between text pairs—matching search queries to documents, evaluating LLM outputs, or finding similar content. They perform detailed comparisons that capture nuanced relationships simple methods miss.
This guide covers deploying the BAAI/bge-reranker-base model on Vast.ai using vLLM, with both OpenAI and Cohere-compatible APIs.
When to Use Rerankers
Embedding models with cosine similarity are fast and cheap—they encode text once and compare vectors. But they compress meaning into fixed-size vectors, losing nuance. Rerankers process query-document pairs together through a cross-encoder, capturing subtle relationships embeddings miss.
| Approach | Speed | Accuracy | Best For |
|---|
| Embeddings + cosine | Fast | Good | Initial retrieval, large candidate sets |
| Reranker | Slower | Better | Final ranking, top-k refinement |
The common pattern: use embeddings to retrieve a larger candidate set quickly, then rerank the top results for final ordering.
Prerequisites
- Vast.ai account with credits
- Vast.ai CLI installed (
pip install vastai)
Hardware Requirements
The BAAI/bge-reranker-base model (~278M parameters) has modest requirements:
- GPU RAM: 16GB (8GB may work for lower throughput)
- GPU: Single GPU, Turing architecture or newer
- Network: Static IP and at least one direct port
Setting Up the CLI
Install and configure the Vast.ai CLI:
pip install vastai
vastai set api-key YOUR_API_KEY
Finding an Instance
Search for suitable instances:
vastai search offers 'compute_cap >= 750 gpu_ram >= 16 num_gpus = 1 static_ip = true direct_port_count >= 1 verified = true rentable = true'
Deploying the Server
First, generate a secure API key to protect your endpoint:
VLLM_API_KEY=$(openssl rand -hex 32)
echo "Save this API key: $VLLM_API_KEY"
Create the instance with vLLM serving the reranker model:
INSTANCE_ID=<your-instance-id>
vastai create instance $INSTANCE_ID \
--image vllm/vllm-openai:latest \
--env "-p 8000:8000 -e VLLM_API_KEY=$VLLM_API_KEY" \
--disk 40 \
--args --model BAAI/bge-reranker-base
Verifying the Deployment
- Go to Instances in the Vast.ai console
- Wait for the image and model to download
- Find your instance’s IP and external port from “Open Ports” (format:
XX.XX.XXX.XX:YYYY -> 8000/tcp)
Test the endpoint:
VAST_IP_ADDRESS="your-ip"
VAST_PORT="your-port"
VLLM_API_KEY="your-api-key"
curl -X POST http://$VAST_IP_ADDRESS:$VAST_PORT/rerank \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VLLM_API_KEY" \
-d '{
"model": "BAAI/bge-reranker-base",
"query": "What is deep learning?",
"documents": ["Deep learning is a type of machine learning"]
}'
Using the Reranker
vLLM provides two API endpoints:
| Endpoint | API Style | Use Case |
|---|
/score | OpenAI | Raw scores for custom ranking logic |
/rerank | Cohere | Pre-sorted results for quick integration |
OpenAI-Compatible Endpoint (/score)
The /score endpoint returns raw relevance scores for each query-document pair. This gives you full control over ranking logic:
import requests
IP_ADDRESS = "your-ip"
PORT = "your-port"
VLLM_API_KEY = "your-api-key"
def openai_score(query, documents):
base_url = f"http://{IP_ADDRESS}:{PORT}"
headers = {"Authorization": f"Bearer {VLLM_API_KEY}"}
request = {
"model": "BAAI/bge-reranker-base",
"text_1": query,
"text_2": documents
}
response = requests.post(f"{base_url}/score", json=request, headers=headers)
if response.status_code == 200:
data = response.json()
scores = [(doc, item["score"]) for doc, item in zip(documents, data["data"])]
scores.sort(key=lambda x: x[1], reverse=True)
for text, score in scores:
print(f"Score: {score:.6f} | {text[:60]}...")
Example usage:
query = "What is Deep Learning?"
documents = [
"Deep learning is a subset of machine learning that uses neural networks with many layers",
"The weather is nice today",
"Deep learning enables computers to learn from large amounts of data",
"I like pizza"
]
openai_score(query, documents)
Output:
Score: 0.999512 | Deep learning is a subset of machine learning...
Score: 0.176270 | Deep learning enables computers to learn from...
Score: 0.000037 | The weather is nice today...
Score: 0.000037 | I like pizza...
Cohere-Compatible Endpoint (/rerank)
The /rerank endpoint is Cohere-compatible, returning pre-sorted results. This is useful if you’re migrating from Cohere or want sorted results without manual sorting.
Install the Cohere client:
pip install --upgrade cohere
import cohere
IP_ADDRESS = "your-ip"
PORT = "your-port"
VLLM_API_KEY = "your-api-key"
def cohere_rerank(query, documents):
base_url = f"http://{IP_ADDRESS}:{PORT}"
co = cohere.ClientV2(VLLM_API_KEY, base_url=base_url)
result = co.rerank(
model="BAAI/bge-reranker-base",
query=query,
documents=documents
)
for doc in result.results:
print(f"Score: {doc.relevance_score:.6f} | {doc.document.text[:60]}...")
The Cohere endpoint returns pre-sorted results and handles batching automatically.
Score Interpretation
| Score Range | Meaning |
|---|
| ~1.0 | Highly relevant, direct match |
| 0.1 - 0.5 | Moderately relevant |
| 0.01 - 0.1 | Tangentially related |
| < 0.001 | Irrelevant |
Use Cases
- RAG Systems: Filter retrieved context before sending to LLM
- Semantic Search: Rerank initial retrieval results
- Duplicate Detection: Identify semantically similar content
- Content Recommendation: Match user queries to content
Additional Resources