Rate Limits

Apertis implements rate limits to ensure fair usage and maintain service quality for all users. This guide explains the different rate limits and how to handle them.

Rate Limit Overview

Rate limits are applied at multiple levels to protect the service:

Level	Limit	Duration	Scope
Global API	1,500 requests	3 minutes	Per IP
API Key	3,000 requests	1 minute	Per Key
Web Dashboard	120 requests	3 minutes	Per IP
Critical Operations	20 requests	20 minutes	Per IP
Log Queries	30 requests	60 seconds	Per User

API Rate Limits

Global API Rate Limit

The primary rate limit for all API endpoints:

1,500 requests per 3 minutes per IP address

This applies to:

/v1/chat/completions
/v1/embeddings
/v1/images/generations
All other /v1/* endpoints

Per API Key Rate Limit

Each API key has its own rate limit:

3,000 requests per 1 minute per API key

This provides higher throughput for applications using a single API key.

Effective Rate Limit

Your effective rate limit is the lower of:

Global API limit (based on IP)
API Key limit (based on key)

Effective Limit = min(Global Limit, API Key Limit)

Rate Limit Headers

Rate limit information is included in API response headers:

Header	Description
`X-RateLimit-Limit`	Maximum requests allowed
`X-RateLimit-Remaining`	Requests remaining in window
`X-RateLimit-Reset`	Unix timestamp when limit resets

Example Response Headers

HTTP/1.1 200 OK
X-RateLimit-Limit: 1500
X-RateLimit-Remaining: 1423
X-RateLimit-Reset: 1703894400

Handling Rate Limits

HTTP 429 Response

When you exceed the rate limit, the API returns:

{
  "error": {
    "message": "Rate limit exceeded. Please wait before making more requests.",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Retry Strategy

Implement exponential backoff for rate limit errors:

import time
import random
from openai import OpenAI, RateLimitError

client = OpenAI(
    api_key="sk-your-api-key",
    base_url="https://api.apertis.ai/v1"
)

def make_request_with_retry(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-4.1",
                messages=messages
            )
        except RateLimitError:
            if attempt == max_retries - 1:
                raise

            # Exponential backoff with jitter
            wait_time = (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limited. Waiting {wait_time:.2f}s...")
            time.sleep(wait_time)

Node.js Example

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: 'sk-your-api-key',
  baseURL: 'https://api.apertis.ai/v1'
});

async function makeRequestWithRetry(messages, maxRetries = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.chat.completions.create({
        model: 'gpt-4.1',
        messages
      });
    } catch (error) {
      if (error.status !== 429 || attempt === maxRetries - 1) {
        throw error;
      }

      // Exponential backoff with jitter
      const waitTime = Math.pow(2, attempt) + Math.random();
      console.log(`Rate limited. Waiting ${waitTime.toFixed(2)}s...`);
      await new Promise(r => setTimeout(r, waitTime * 1000));
    }
  }
}

Best Practices

1. Request Batching

Combine multiple operations into fewer requests:

# Instead of multiple single-message calls
for message in messages:
    client.chat.completions.create(messages=[message])  # Many requests

# Batch into single calls when possible
client.chat.completions.create(messages=messages)  # Single request

2. Implement Request Queuing

Use a queue to manage request rate:

import asyncio
from collections import deque

class RateLimitedQueue:
    def __init__(self, max_requests_per_minute=50):
        self.max_rpm = max_requests_per_minute
        self.queue = deque()
        self.request_times = deque()

    async def add_request(self, request_func):
        # Wait if at rate limit
        while len(self.request_times) >= self.max_rpm:
            oldest = self.request_times[0]
            wait_time = 60 - (time.time() - oldest)
            if wait_time > 0:
                await asyncio.sleep(wait_time)
            self.request_times.popleft()

        # Execute request
        self.request_times.append(time.time())
        return await request_func()

3. Use Streaming for Long Responses

Streaming doesn't reduce rate limits but improves perceived performance:

response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": "Write a long story"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

4. Cache Responses

Cache identical requests to reduce API calls:

import hashlib
from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_completion(messages_hash, model):
    # Only call API if not in cache
    return client.chat.completions.create(
        model=model,
        messages=messages  # Original messages
    )

# Create hash of messages for cache key
messages_hash = hashlib.md5(str(messages).encode()).hexdigest()
response = cached_completion(messages_hash, "gpt-4.1")

5. Distribute Across Multiple Keys

For high-volume applications, use multiple API keys:

import itertools

api_keys = ["sk-key1", "sk-key2", "sk-key3"]
key_cycle = itertools.cycle(api_keys)

def get_client():
    return OpenAI(
        api_key=next(key_cycle),
        base_url="https://api.apertis.ai/v1"
    )

Special Rate Limits

Critical Operations

Higher restrictions for security-sensitive operations:

Operation	Limit	Duration
Registration	20	20 minutes
Password Reset	20	20 minutes
Email Verification	20	20 minutes

Log Query Rate Limit

For log and usage queries:

30 requests per 60 seconds per user

note

Root users (administrators) are exempt from log query rate limits.

Upload/Download Rate Limit

For file operations:

10 requests per 60 seconds per IP

Rate Limit Exemptions

Subscription Users

Subscription plan users may have higher rate limits:

Plan	Rate Limit Multiplier
Lite	1x (standard)
Pro	1.5x
Max	2x

Enterprise

Contact sales for custom rate limits for enterprise applications.

Monitoring Your Usage

Dashboard Metrics

View your API usage in the dashboard:

Requests per minute/hour/day
Rate limit hits
Peak usage times

Programmatic Monitoring

Track rate limits in your application:

def check_rate_limit_headers(response):
    remaining = response.headers.get('X-RateLimit-Remaining')
    reset_time = response.headers.get('X-RateLimit-Reset')

    if remaining and int(remaining) < 100:
        print(f"Warning: Only {remaining} requests remaining")
        print(f"Resets at: {reset_time}")

Troubleshooting

Consistently Hitting Rate Limits

If you're frequently rate limited:

Audit your request patterns - Are you making unnecessary calls?
Implement caching - Cache repeated queries
Optimize batch sizes - Combine requests where possible
Consider upgrading - Higher plans have increased limits

Sudden Rate Limit Errors

If you suddenly start getting rate limited:

Check for loops - Infinite loops can exhaust limits quickly
Review recent code changes - New code may have introduced issues
Check for concurrent users - Multiple processes sharing same key

API Keys - Manage API keys and access
Subscription Plans - Plan limits and quotas
Error Codes - Understanding error responses

Rate Limit Overview​

API Rate Limits​

Global API Rate Limit​

Per API Key Rate Limit​

Effective Rate Limit​

Rate Limit Headers​

Example Response Headers​

Handling Rate Limits​

HTTP 429 Response​

Retry Strategy​

Node.js Example​

Best Practices​

1. Request Batching​

2. Implement Request Queuing​

3. Use Streaming for Long Responses​

4. Cache Responses​

5. Distribute Across Multiple Keys​

Special Rate Limits​

Critical Operations​

Log Query Rate Limit​

Upload/Download Rate Limit​

Rate Limit Exemptions​

Subscription Users​

Enterprise​

Monitoring Your Usage​

Dashboard Metrics​

Programmatic Monitoring​

Troubleshooting​

Consistently Hitting Rate Limits​

Sudden Rate Limit Errors​

Related Topics​