Skip to main content

Rate Limits

Apertis implements rate limits to ensure fair usage and maintain service quality for all users. This guide explains the different rate limits and how to handle them.

Rate Limit Overview

Rate limits are applied at multiple levels to protect the service:

LevelLimitDurationScope
Global API1,500 requests3 minutesPer IP
API Key3,000 requests1 minutePer Key
Per-ModelVaries by plan1 minutePer Key + Per Model
Web Dashboard120 requests3 minutesPer IP
Critical Operations20 requests20 minutesPer IP
Log Queries30 requests60 secondsPer User

API Rate Limits

Global API Rate Limit

The primary rate limit for all API endpoints:

1,500 requests per 3 minutes per IP address

This applies to:

  • /v1/chat/completions
  • /v1/embeddings
  • /v1/images/generations
  • All other /v1/* endpoints

Per API Key Rate Limit

Each API key has its own rate limit:

3,000 requests per 1 minute per API key

This provides higher throughput for applications using a single API key.

Per-Model Rate Limit

Each API key is also rate-limited per model. This prevents a single key from overwhelming any specific model with rapid requests (e.g., automated agent loops).

Model name variants are unified into a single bucket — for example, claude-opus-4-6 and code:claude-opus-4-6 share the same rate limit counter. The :web and :free suffixes are also normalized.

PlanRPM per Model per Key
Lite ($12/mo)15
Pro ($25/mo)20
Max ($200/mo)30
PAYG500
tip

Requests that are rate-limited return HTTP 429 immediately and do not consume any quota or credits. The request never reaches the upstream provider.

Effective Rate Limit

Your effective rate limit is the lowest of:

  • Global API limit (per IP)
  • API Key limit (per key)
  • Per-Model limit (per key + per model, based on your plan)
Effective Limit = min(Global Limit, API Key Limit, Per-Model Limit)

Rate Limit Headers

Rate limit information is included in API response headers:

HeaderDescription
X-RateLimit-LimitMaximum requests allowed
X-RateLimit-RemainingRequests remaining in window
X-RateLimit-ResetUnix timestamp when limit resets

Example Response Headers

HTTP/1.1 200 OK
X-RateLimit-Limit: 1500
X-RateLimit-Remaining: 1423
X-RateLimit-Reset: 1703894400

Handling Rate Limits

HTTP 429 Response

When you exceed the rate limit, the API returns HTTP 429 with a Retry-After header indicating how many seconds to wait:

HTTP/1.1 429 Too Many Requests
Retry-After: 45
{
"error": {
"message": "Rate limit exceeded for model: claude-opus-4-6. Maximum 20 requests per minute per API key. Please wait before retrying.",
"type": "rate_limit_error",
"code": "rate_limit_exceeded"
}
}

Retry Strategy

Implement exponential backoff for rate limit errors:

import time
import random
from openai import OpenAI, RateLimitError

client = OpenAI(
api_key="sk-your-api-key",
base_url="https://api.apertis.ai/v1"
)

def make_request_with_retry(messages, max_retries=5):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="gpt-4.1",
messages=messages
)
except RateLimitError:
if attempt == max_retries - 1:
raise

# Exponential backoff with jitter
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited. Waiting {wait_time:.2f}s...")
time.sleep(wait_time)

Node.js Example

import OpenAI from 'openai';

const client = new OpenAI({
apiKey: 'sk-your-api-key',
baseURL: 'https://api.apertis.ai/v1'
});

async function makeRequestWithRetry(messages, maxRetries = 5) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await client.chat.completions.create({
model: 'gpt-4.1',
messages
});
} catch (error) {
if (error.status !== 429 || attempt === maxRetries - 1) {
throw error;
}

// Exponential backoff with jitter
const waitTime = Math.pow(2, attempt) + Math.random();
console.log(`Rate limited. Waiting ${waitTime.toFixed(2)}s...`);
await new Promise(r => setTimeout(r, waitTime * 1000));
}
}
}

Best Practices

1. Request Batching

Combine multiple operations into fewer requests:

# Instead of multiple single-message calls
for message in messages:
client.chat.completions.create(messages=[message]) # Many requests

# Batch into single calls when possible
client.chat.completions.create(messages=messages) # Single request

2. Implement Request Queuing

Use a queue to manage request rate:

import asyncio
from collections import deque

class RateLimitedQueue:
def __init__(self, max_requests_per_minute=50):
self.max_rpm = max_requests_per_minute
self.queue = deque()
self.request_times = deque()

async def add_request(self, request_func):
# Wait if at rate limit
while len(self.request_times) >= self.max_rpm:
oldest = self.request_times[0]
wait_time = 60 - (time.time() - oldest)
if wait_time > 0:
await asyncio.sleep(wait_time)
self.request_times.popleft()

# Execute request
self.request_times.append(time.time())
return await request_func()

3. Use Streaming for Long Responses

Streaming doesn't reduce rate limits but improves perceived performance:

response = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": "Write a long story"}],
stream=True
)

for chunk in response:
print(chunk.choices[0].delta.content, end="")

4. Cache Responses

Cache identical requests to reduce API calls:

import hashlib
from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_completion(messages_hash, model):
# Only call API if not in cache
return client.chat.completions.create(
model=model,
messages=messages # Original messages
)

# Create hash of messages for cache key
messages_hash = hashlib.md5(str(messages).encode()).hexdigest()
response = cached_completion(messages_hash, "gpt-4.1")

5. Distribute Across Multiple Keys

For high-volume applications, use multiple API keys:

import itertools

api_keys = ["sk-key1", "sk-key2", "sk-key3"]
key_cycle = itertools.cycle(api_keys)

def get_client():
return OpenAI(
api_key=next(key_cycle),
base_url="https://api.apertis.ai/v1"
)

Special Rate Limits

Critical Operations

Higher restrictions for security-sensitive operations:

OperationLimitDuration
Registration2020 minutes
Password Reset2020 minutes
Email Verification2020 minutes

Log Query Rate Limit

For log and usage queries:

30 requests per 60 seconds per user
note

Root users (administrators) are exempt from log query rate limits.

Upload/Download Rate Limit

For file operations:

10 requests per 60 seconds per IP

Rate Limits by Plan

Subscription Users

Subscription users have per-model RPM limits that vary by plan tier. See the Per-Model Rate Limit section above for details.

Global API (1,500/3min) and API Key (3,000/min) limits remain the same across all plans.

PAYG Users

PAYG users have a generous per-model RPM limit of 500 requests per minute, since usage is billed directly.

Enterprise

Contact sales for custom rate limits for enterprise applications.

Monitoring Your Usage

Dashboard Metrics

View your API usage in the dashboard:

  • Requests per minute/hour/day
  • Rate limit hits
  • Peak usage times

Programmatic Monitoring

Track rate limits in your application:

def check_rate_limit_headers(response):
remaining = response.headers.get('X-RateLimit-Remaining')
reset_time = response.headers.get('X-RateLimit-Reset')

if remaining and int(remaining) < 100:
print(f"Warning: Only {remaining} requests remaining")
print(f"Resets at: {reset_time}")

Troubleshooting

Consistently Hitting Rate Limits

If you're frequently rate limited:

  1. Audit your request patterns - Are you making unnecessary calls?
  2. Implement caching - Cache repeated queries
  3. Optimize batch sizes - Combine requests where possible
  4. Consider upgrading - Higher plans have increased limits

Sudden Rate Limit Errors

If you suddenly start getting rate limited:

  1. Check for loops - Infinite loops can exhaust limits quickly
  2. Review recent code changes - New code may have introduced issues
  3. Check for concurrent users - Multiple processes sharing same key