Relevant APIs:
- OpenAPI Completion and Chat API Endpoints:
api.openai.com/v1/*
Prerequisites
- Have the Lunar proxy installed and configured in your environment - tutorial.
- You have the Lunar Interceptor installed at the service from which you are consuming the OpenAI API.
- You are sending a user identifier in your requests to the OpenAI API - this can be any identifier (ID, email, anonymous ID) sent in any part of the request (most commonly in a header).
- [REQUEST] Rewrite Request - Copies the value from the header X-On-Behalf-Of to the header X-Lunar-User-Key that we will use in future processors, with a fallback value if X-On-Behalf-Of is undefined. Used to provide a fallback for services that don’t send an OBO header.
- [REQUEST] Count Tokenized Value - Tokenizes the value in req.body.input and counts the number of tokens, placing the result in the request header X-Lunar-Request-Tokens.
- [REQUEST] Get Quota Usage - Fetches the quota usage statistic for the past day for the user based on the determined header.
- [REQUEST] Check Rate Limit - Checks the rate limit against a provided rate limit (10,000) and the fetched usage statistic. Excludes any requests that come with the “system” header. If check passed, continue on stream (to API), if fails - pass it to new stream (below).
- [RESPONSE] Count Tokenized Value - Tokenizes the value in res.body.input and counts the number of tokens, placing the result in the response header X-Lunar-Request-Tokens.
- [RESPONSE] Update Quota Usage - Updates the quota usage statistic based on the counted tokens in the response and request (added together).
When to Use This Flow:
This flow is useful for applications exposing AI capabilities to their users based on the OpenAI APIs. These capabilities can be autocomplete, content generation, content suggestion and media generation. By giving users unrestricted access to capabilities built on the OpenAI API, you are risking certain users abusing these features and sending an abundance of requests. This can lead to high API usage costs (as the OpenAI API charges per usage), and to your entire app losing access to the OpenAI API, as OpenAI limits API usage on a per-app basis.
This flow is tailored for OpenAI by limiting users based on the amount of tokens in their requests, rather than by the number of requests they make. Traditional rate limiting solutions simply count the number of API calls made. This approach can be insufficient when it comes to the OpenAI APIs, as they are priced and limit usage based on the amount of tokens in requests. A single request can include as few as 100 tokens or as many as 100,000 tokens, rendering request based rate limiting ineffective.
By user-based rate limit, you can limit how many tokens each individual user of your app can consume. This limit should be high enough so that none of users will hit it during normal usage - only if they are abusing the system.
About the OpenAI API:
The OpenAI API is a platform developed by OpenAI, a research organization in the field of artificial intelligence, to provide access to a range of advanced AI models, including versions of the Generative Pre-trained Transformer (GPT) and DALL·E. This API allows for the integration of AI functionalities into various applications, enabling developers, researchers, and businesses to leverage complex machine learning models without requiring deep expertise in the field. It facilitates tasks such as natural language understanding, text generation, and the creation of images from textual descriptions.
OpenAI was established in December 2015 by a group including Elon Musk and Sam Altman, with the goal of advancing digital intelligence in ways that can most benefit humanity. Initially launched as a non-profit, the organization has since adopted a "capped" profit model to scale its research efforts while committing to safety and ethical considerations in AI development. This approach allows OpenAI to attract funding and talent, supporting its mission to develop artificial general intelligence (AGI) responsibly.
The OpenAI API's design emphasizes user-friendliness and scalability, making sophisticated AI technologies accessible to a broader audience. It is employed across a variety of sectors, enhancing capabilities in fields such as healthcare, education, customer service, and more. OpenAI continually updates its models to reflect the latest advancements in AI research, ensuring that users have access to cutting-edge technologies. This commitment to innovation and accessibility helps to bridge the gap between AI research and practical applications, facilitating the integration of AI into everyday tools and services.
Adding User IDs in OpenAI Requests:
For this rate limit to work, each request must have an identifier of which user triggered it (so that rate limits can be applied on a per user basis). We recommend adding a header with some sort of unique user identifier (user ID is ideal). You can name the header X-On-Behalf-Of-User (or any other name).
Understanding Token Based Rate Limiting:
OpenAI tokens are based on the length of the prompt and the API response, and roughly correspond to one token per 4 characters. Thus, an effective rate limiting strategy must be able to inspect requests to and responses from the OpenAI API, count the number of tokens in them and use these as the basis for the limit.
An effective rate limiting policy for any OpenAI API must be based on the number of tokens as both the cost and and the overall limit for the OpenAI API is based on the number of tokens. The table below summarizes costs and rate limits associated with the different OpenAI API for different models:
Setting the right Rate Limit:
One critical decision you must make with deploying this flow is what to set your per-user rate limit to. Set it too low, and you might end up limiting your users during normal usage of the system. Set it too high and you may leave yourself vulnerable to abuse. There are three elements worth considering when determining the appropriate rate limit:
Time period
Time period - your first decision is what time period to limit for - per minute, per hour, per day, or per month. This greatly depends on the usage pattern for your product. Is it a service someone might use very intensely but infrequently? That may be cause for a longer rate limit.
However, a service used constantly might benefit from shorter rate limits to prevent isolated spikes. We’ve found that very short periods (especially per second or per minute) tend to be irrelevant for AI APIs given they normally take longer to respond. For most AI tools, monthly rate limits tend to give the right balance.
Usage patterns
Once you have determined the appropriate time frame, you must decide what to set your limit to. We recommend looking at historical usage patterns for your product. Remember - the right rate limit should be above the normal usage pattern for your users. Consider how many tokens are used by users in the 95th percentile as a good minimum for the limit. If you have Lunar installed, you can see API usage patterns right from the dashboard. Click here to learn more.
Cost
Lastly, consider the cost implications of your rate limit. With the OpenAI API, each token has a cost (as of the time of writing this article, the GPT-4 API costs $10/million requests). So if your monthly limit for a user is 500,000 tokens, each user may cost you up to $5/month in API usage fees. Consider this cost against your revenue per user to ensure it makes sense in your scenario.
Do remember that not all users are going to hit your rate limit every month (ideally, very few will). So even if the cost at the top of the limit seems high, it might be OK depending on your usage pattern. This should be seen as more of a stop gap for excessive losses.
Most importantly - keep monitoring your rate limit performance after deploying it to see how many users hit it on a monthly basis. If too many are hitting it, you might want to reconsider the limit level.
Lunar offers a comprehensive dashboard for monitoring this and other aspects of your API usage. Learn more about monitoring with Lunar [[here]].
About Rate Limiting:
API rate limiting is a crucial technique used in web development to control the amount of incoming requests a client can make to a server within a specific timeframe. This practice is essential for maintaining the stability, security, and reliability of web services by preventing any single user or service from overwhelming the system with too many requests. Rate limiting can be implemented in various ways, such as limiting requests based on IP address, API keys, or user accounts, and often involves setting thresholds for requests per second (RPS) or requests per minute (RPM).
The rationale behind API rate limiting is multifaceted. It helps protect against abusive behaviors like scraping, brute force attacks, and denial-of-service (DoS) attacks, ensuring that services remain available and responsive for all users. Moreover, it allows for fair usage among consumers by preventing any single user from monopolizing resources, thus ensuring a level playing field. Additionally, rate limiting can serve as a mechanism for API providers to tier their service offerings, enabling them to offer different usage limits at various pricing levels. By implementing rate limiting, API providers can safeguard their infrastructure, optimize resource allocation, and maintain a high quality of service for their users.
About Lunar.dev:
Lunar.dev is your go to solution for Egress API controls and API consumption management at scale.
With Lunar.dev, engineering teams of any size gain instant unified controls to effortlessly manage, orchestrate, and scale API egress traffic across environments— all without the need for code changes.
Lunar.dev is agnostic to any API provider and enables full egress traffic observability, real-time controls for cost spikes or issues in production, all through an egress proxy, an SDK installation, and a user-friendly UI management layer.
Lunar.dev offers solutions for quota management across environments, prioritizing API calls, centralizing API credentials management, and mitigating rate limit issues.