Benchmarking Function-Calling Enabled LLMs

Large Language Models (LLMs) have revolutionized natural language processing (NLP) tasks like text generation, translation, and question-answering. However, limitations remain in many areas such as reasoning and factual accuracy. Function calling, the ability for LLMs to interact with external APIs, presents a promising solution for accessing real-time information and third-party services. However, evaluating the effectiveness of LLM function calls remains an open question.

This project introduces a novel benchmark designed to comprehensively assess the function-calling capabilities of LLMs across three core areas: function selection (LLMs' ability to identify appropriate functions based on user intent), parameter passing (constructing necessary parameters), and results provided (understanding query and returning desired answer). The benchmark includes 310 user questions focused on two use cases: New York City exploration (Airbnb listings, restaurants) and Music exploration (albums, songs, artists). The questions are designed to evaluate the performance of LLMs on various functionalities: reasoning, information aggregation, data type handling (numerical, dates, locations, text), and user interaction (typos, extra-context, criteria, entities, instructions). Evaluating this benchmark with GPT-4 revealed strengths in dealing with different languages, reasoning tasks but challenges with diverse data formats, particularly dates.

1. Introduction
2. Datasets and Functions
3. Question Set and Ground Truth
4. Benchmark Evaluation
5. Download Links
6. Existing benchmarks
7. References

1. Introduction

Large language models (LLMs) represent a significant advancement beyond early language models by not only proficiently modeling and generating text but also by addressing broader and more intricate general-purpose tasks. These models exhibit emergent capabilities that surpass those of smaller, pre-trained language models. They have demonstrated efficacy in tackling multifaceted challenges across various domains, showcasing a depth of understanding and adaptability previously unseen in their predecessors. LLMs achieve this significant leap through substantial increases in the amount of computational power employed, the number of model parameters utilized, and the size of the training dataset employed [14, 15]. Large language models exhibit emergent abilities in various domains, including prompt-based task completion, instruction following without prior exemplars, and program execution, as demonstrated by recent advancements in few-shot prompting, chain-of-thought prompting, and scratchpad-based execution techniques in multi-step computational tasks. Additionally, ongoing research explores model calibration methodologies to assess the confidence and accuracy of language model predictions [16].

Despite their remarkable capabilities, Large Language Models (LLMs) are susceptible to various limitations and challenges. These include biases and inaccuracies in generated outputs, difficulties in comprehending complex logic and reasoning tasks, constraints in handling extensive datasets and long-term memory, and limitations in incorporating real-time or dynamic information [17]. Additionally, LLMs may exhibit sensitivity to prompts, struggles with text summarization tasks, and limitations in mastering domain-specific knowledge or generating structured data. They also face challenges in generating truthful information, maintaining alignment with sources, and updating parametric knowledge in a timely manner. Furthermore, LLMs may exhibit inconsistencies between derived answers and reasoning processes, encounter difficulties in numerical computation, and demonstrate a tendency to hallucinate facts. Addressing these challenges necessitates multifaceted approaches, including augmenting LLMs with external knowledge sources, fine-tuning LLMs with process-level feedback, using an ensemble of diverse reasoning paths, refining reasoning processes with self-reflection or external feedback, utilizing mathematical tools for numerical computation, tokenizing digits for improved arithmetic abilities, alignment tuning to maintain coherence with sources, and effective tool utilization for mitigating issues such as hallucinations in generated texts [18, 19].

Function calling addresses these challenges by ensuring a consistent response format and enabling the utilization of external data which accesses dynamic information from diverse sources within interactive user agent chat contexts, thereby augmenting the adaptability and utility of LLMs in practical applications [20, 21].

Introduction to Function Calling — **Figure 1:** Illustration of the LLMs retrieving the user-created functions within the user prompts.

As illustrated in Figure 1, with the incorporation of the function calling feature, which enables the model to access user-created functions adhering to specified structure and parameters, the large language model (LLM) selectively invokes the pertinent function within the user-generated prompt, utilizing designated arguments, and subsequently delivers a response based on the output generated by this invoked function(s). This mechanism facilitates enhanced versatility and adaptability of LLMs in responding to diverse user queries and prompts, leveraging the customized functionalities provided by users to enrich the model's capabilities [12].

Function calls in LLMs can be categorized based on the number of functions involved and their execution order. Single calls represent the simplest form, where the LLM executes a single, self-contained function within the prompt (e.g., "Provide the address of a specific Airbnb listing"). Multiple calls introduce more complexity, allowing for sequential or parallel execution. Sequential calls involve executing functions one after another, with the output of one feeding into the next (e.g., "Find the address of a specific Airbnb and then provide some restaurants nearby"). Parallel calls, where multiple functions execute concurrently, is a promising yet challenging area of research due to complexities in context management [12].

Nevertheless, ongoing discussions among researchers revolve around the performance of these models and their effectiveness in implementing these functionalities, particularly function-calling, that facilitate connectivity with external API sources. When implementing function calling, discussions also center on identifying the main constraints of these models and investigating possible ways to overcome them. These efforts aim to enhance the performance of the LLMs when used in various real-world tasks.

To address these inquiries, we introduce an innovative benchmark specifically designed to evaluate the function-calling capabilities of LLMs when interfacing with external APIs. This benchmark is crafted to assess the LLMs' ability to implement function calls for two distinct use cases. Beyond core function calling, our benchmark aims to capture the primary challenges encountered by the models by exposing them to a variety of user questions. These questions, grouped into distinct categories within the benchmark, will facilitate testing across different domains (New York City travel and music APIs) and enable us to assess in detail potential limitations when invoking various functions.

Ultimately, the limitations identified through running this benchmark can serve as valuable guidance for future research endeavors, enabling researchers to address and overcome these challenges in enhancing the capabilities of LLMs when interfacing with function-calling using external APIs and handling diverse user queries.

2. Datasets and Functions

In this section, we outline the datasets utilized in the creation of our benchmark for function calls using LLMs. A total of six repository datasets were selected to ensure comprehensive coverage and representativeness across both use cases i.e. Music and Travel. For the Music use case, three datasets were meticulously chosen containing information about music albums, songs, and artists. For the Travel use case, three datasets were thoughtfully selected that include information about Airbnb listings, restaurants and cafes in New York City, and food ordering. During the first two phases (i.e. for single parameter and multiple parameter function calls), we only employed two datasets. However, in the final phase, we introduced a third dataset to ensure the connectedness between datasets and to enrich the diversity of attributes available for querying in multiple function calling cases. Further, for the music use case, a sample of the previous two datasets was used in this phase ensuring the fact that it covers common entities and have overlapping attributes between all three datasets. Table 1 provides details about each dataset used for the respective use cases:

**Table 1:** Details of Datasets
Use Case	Dataset Name	No. of Entities	No. of Attributes	Relevant Attributes for API Creation	Source
^*: Value in bracket indicates sampled no. of entities used in multiple function calling phase
Music	Albums	5119 (187)*	8	release name, artist name, release date, genres, descriptors, average rating, rating count, review count	Most Popular Albums: https://www.kaggle.com/datasets/tobennao/rym-top-5000/ Niche Albums (scraped data): https://rateyourmusic.com/
	Songs	1413 (123)*	19	track name, artist, album, release date, streams, bpm, danceability, energy, speechiness, instrumentalness, explicit, popularity, duration	Most Streamed Spotify Songs 2023: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023
	Artists	47	7	name, formed/born, city, country, genre	Artists (scraped data) https://rateyourmusic.com/
Travel	Airbnbs	1000	16	hostname, neighbourhood group, precise geo-coordinates (latitude and longitude), room type, prices, reviews, the last date of review, and an average of the reviews per month	New York City Airbnb Open Data: https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data
	Restaurants	1000	9	restaurant name, cuisine, borough of location, street address, zip code, building, phone number, and precise geo-coordinates (latitude and longitude)	New York City Restaurant Inspection Results: https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j/about_data
	Food Orders	1898	7	restaurant name, cuisine type, order cost, customer ratings, food preparation time, and delivery time	NYC Restaurants Data - Food Ordering and Delivery: https://www.kaggle.com/datasets/ahsan81/food-ordering-and-delivery-app-dataset

For the datasets that we specified in the table above, we have created an API that includes several functions to enhance the LLM’s ability to answer the user questions provided as prompts. For the three different phases, functions with simple descriptions are created. Additionally, for phase 1, single parameter calls, functions with complex descriptions are also created to check whether the LLMs would find it challenging to answer the questions. With complex descriptions, we are referring to descriptions that are worded differently to add complexity or added information that may confuse LLMs in calling the correct function.

For each phase tested, we have created a certain number of functions. For the Travel Use Case, we have created 14 functions for the single parameter functions, both with simple and complex descriptions, 10 functions for the multi-parameter functions with a range of 1 to 4 parameters, 11 functions for the sequential multi-calls with a range of 1 to 4 parameters and finally 7 for the parallel multi-calls with a range of 1 to 2 parameters. For the Music Use Case, we have created 10 functions with single parameters, both with simple and complex descriptions, 16 functions with multiple parameters ranging from 2 to 5, and 11 functions with a range of 1 to 2 parameters for the sequential and parallel multi-calls. For both use cases, the data types we have used are strings, integers, float, and boolean. We have decided to not include date format as a specific data type of the parameters, to not confuse the LLMs with the different date formats, but rather specify the required formats in the function descriptions. A few examples of functions created for both use cases are shown in Table 2.

We provide all function sets for download.

**Table 2:** Examples of Functions.
Use Case	Function Name	Parameters	Data Types of Parameters	Description
Travel	get_host_name	listing_name	string	Provide the host name for the Airbnb listing.
	get_host_name	listing_name	string	Provide the name of the Airbnb’s owner.
	get_x_most_popular_places_in_neighbourhood_group_room_type	popularity, neighbourhood_group, room_type	integer, string, string	Get the x most popular Airbnbs in a neighborhood group and for a specific room type.
	get_airbnb_by_price_min_nights_and_neighborhood_group	price, min_nights, neighbourhood_group	integer, integer, string	Get the listing by minimum price in US dollars, minimum nights and specific neighborhood group.
	get_avg_delivery_time_by_restaurant_name	restaurant_name	string	Get average delivery time of a restaurant.
Music	top_rated_albums	n	integer	Returns the top-rated albums based on average rating.
	top_rated_albums	n	integer	Retrieves the records of most favoured music releases ranked by their scores.
	filter_albums_by_date_range	start_date, end_date	string, string	Filters and retrieves albums released within a specified date range. The range is inclusive of the start and end dates.
	songs_by_danceability_explicitness_speechiness	danceability_threshold, speechiness_threshold, explicit	integer, integer, boolean	Retrieves songs based on specified thresholds for danceability, speechiness, and explicit content criteria.
	artist_info	artist_name	string	Retrieves detailed information for a specified artist, including attributes like name, band status, formation/birth date, city, country, genres, and related artists.

3. Question Set and Ground Truth

In order to test the ability of the LLMs to call functions when needed, a total of 310 questions was created. For the Travel Use Case, we have tested in total 150 questions, while for the Music Use Case we have tested in total 160 questions for all the different phases including testing single parameter function calls, multi-parameter function calls and multiple sequential and parallel function calls to give the final answer to the user. These questions can be used to evaluate a model's function calling ability, parameter filling, and lastly, the final answer given. They are categorized into different test categories to help better analyze and interpret the results. The combined categories created by both use cases and all phases are as follows: (i) Selection, (ii) Languages, (iii) Data Types, (iv) Aggregation, (v) Extra Context, (vi) Reasoning, (vii) Simple Questions, (viii) Typos, (ix) Long-Tail Entities, and (x) Interpretation.

The number of questions tested for each category is listed below:

Selection: 40 questions
Languages: 43 questions
Aggregation: 57 questions
Extra Context: 50 questions
Reasoning: 39 questions
Data Types: 22 questions
Typos: 21 questions
Interpretation: 17 questions
Simple Questions: 11 questions
Long-Tail Entities: 9 questions

(i) Languages: We have created this category to examine the LLMs' performance in handling questions in languages other than English. In this way, we can access the LLM's multilingual capabilities and its capacity to return correct answers in case of other popular, unpopular, non-Latin languages. The languages tested in all the phases include German, Turkish, Urdu, Spanish, Greek, Albanian, and French for both use cases.

(ii) Selection: In the Selection of Records category, we intend to evaluate GPT’s accuracy across several dimensions:

Single vs. Multiple Record Retrieval: This tests the LLM's ability to retrieve either a single record or multiple records based on the query's specificity or breadth.
Attribute-Based Selection: This assesses the model's capability to select records based on specific attributes or characteristics.
Handling of Collections: This focuses on the model's proficiency in dealing with collections, such as retrieving lists of artists.
Filtering and Conditional Queries:
- Specific Records: Evaluate the model's skill in sorting or filtering data based on specific attributes to retrieve exact matches.
- Range-Based Selection: Tests the model's handling of queries requiring selection within certain ranges, like dates or ratings.
- Similarity-based Selection: Examines the model's understanding and execution of queries involving relationships between data sets, such as selecting records based on association with genres or descriptors.

(iii) Aggregation: The Aggregation category includes questions that test the ability of LLMs to perform simple calculations (computation of sum, difference, and average) and identify the minimum, maximum value, and counting of records. It evaluates the LLM’s proficiency in aggregating information from multiple records to provide answers to the user query. Some typical cases being tested are: filtering by a lower or a high price, filtering by a price range, division of the budget by number of visitors, finding average prices for cuisines, summing costs, filtering by the highest number of reviews, the average of streams of songs by an artist, the number of albums released by an artist etc.

(iv) Extra Context: The Extra Context category includes questions where the user provides either additional related context or unrelated context. The questions with related context may require semantic understanding of the query and intend to assess the model's ability to understand and incorporate contextual cues. This category also includes questions with unrelated or misleading context that focus on evaluation of the model's capability to distinguish and disregard irrelevant information. The model should be able to select correct values of parameters and select right functions.

(v) Data Types: In the Data Types category, we have created questions to test the ability of LLMs to understand several types of data types being currencies and their conversions to other types of currencies (e.g from yuans to dollars), converting to the same currency (e.g cents to dollars), different date formats (e.g from YYYY-MM-DD to DD-MM-YY), time measurement units (e.g. from minutes to hours), phone numbers with prefixes(e.g. +49 is known for Germany’s prefix) or either convert from a textual representation of a number to its numerical value (e.g one zero zero one nine to 10019).

(vi) Reasoning: This category was created to evaluate the ability of LLMs for logical reasoning and inference generation. The questions in this category require the model to make some recommendations based on the desired conditions in the query or to reason about the sense of attributes on their own or in relation to other attributes. This helps analyze whether the model is able to make informed conclusions and identify correlation among different attributes in the dataset. It also tests the common-sense reasoning capability of the model. Some commonsense reasoning cases could be related to: time concept (e.g. tonight refers to one night of stay) and selection of correct records according to a minimum number required, budgeting (e.g. when several people have a fixed amount of budget what could be the cost for each), wording (e.g. if the user is at a specific location, US, then if they require traditional food could be American) and popular locations & geographical proximity (e.g the user is at Statue of Liberty, where would that borough be located).

(vii) Simple Questions: For this category, the user queries are more instructive and informative for the model to call the correct functions, parameter and parameter values to receive the final answer.

(viii) Typos: For this category, we have created questions testing different types of typos with intentionally misspelled words: in Airbnb’s or restaurant names, spelling, date formats, currencies, numbers or strings indicating prices, typos in artist name, linguistic typos, typo in genre etc. The model should understand these typos and correct them appropriately when selecting the functions to call as well as with parameter filling.

(ix) Long-Tail Entities: The "Long Tail Entities" category includes questions that inquire about some niche entities, like lesser-known artists, or non-popular music genres. This category focuses on the evaluation of the model's ability to provide accurate responses even for less common entities. Also included in this category are questions about confusing entities, like for example Airbnb listings with the same names but different attributes such as location and location details. In this case, we test whether the model distinguishes between these entities based on their other attributes.

(x) Interpretation: In the Interpretation category, we construct questions to test the capability of the model to interpret and comprehend the attributes in case of parameter filling as well as extracting meaningful insights from the resulting records. We focus on three dimensions: textual interpretation of attributes, temporal interpretation, and date interpretation (various date formats were tested).

After formulating the questions and categorizing them, we constructed the ground truth based on the data sources we have used. To test the accuracy of the LLMs, we have defined ground truths for three dimensions: Correct functions, Correct Parameter Values, and Overall Correct Answer. Table 3 provides a short overview of questions for each category with the respective ground truths including the expected functions to be called, parameter values to be filled, and correct answers. In some cases, we can have several correct function calls, as also indicated in the table, that generate the same correct answer. For the questions in the table that require multiple function calls, we have denoted them with “&” to show that all the functions are needed to receive a response.

We provide all question sets including their ground truth for download.

**Table 3:** Examples of questions for each category.
Category	Questions	Functions	Parameters	Answer
Languages	Travel: Po verdallisem neper new york dhe nuk po gjej ndonje vend per te ndenjur sonte. Nuk kam shume para ne xhep keshtu qe me duhet nje vend i lire per fjetur? Me jep dot nja 5 sugjerime qe kushtojne rreth 30 dollare per nate? (Translates as: I am wandering around New York and I am not finding a place to stay tonight, I do not have much money with me, so I need a cheap place to stay. Can you give me 5 suggestions that cost around 30 dollars per night?)	get_listing_by_lower_price	price: 30	"Huge room in great area 25 minutes from Manhattan", "Room in Bushwick Bk available June", "Sunny spacious room full of good energy", … etc.
	Music: Ich mag das Album von Coldplay, das 2011 veröffentlicht wurde, sehr. Finden Sie ihre Alben und können Sie mir dann die Künstler empfehlen, die dem gleichen Genre wie das genannte Album angehören. (Translates as: I really like the album of Coldplay which was released in 2011. Find their albums and then can you recommend me the artists with the same genre as that of the mentioned album.)	albums_by_artist & artists_by_genres	artist_name: Coldplay & genres: ["Pop Rock", "Dream Pop"]	"The Beatles", "The Rolling Stones", "Bob Dylan", "Taylor Swift", "Lady Gaga", "Miley Cyrus", "Shakira", "Olivia Rodrigo", "Elton John"
Selection	Music: Among the 15 songs with the longest durations, what is the most popular song?	songs_by_longest_duration	n: 15	“2085”
Data Types	Travel: I am currently in New York, I was wandering around Manhattan and I have only 10 dollars and 10 euros so please help me find a place for tonight.	get_airbnb_by_price_range_neighbourhood or get_airbnb_by_price_and_neighborhood_group	min_price : 0, max_price : 120, neighbourhood_group : Manhattan or price: 20, neighbourhood_group: Manhattan	“Calm bed great area”, “Room with a view”
Aggregation	Travel: I plan to budget 500 dollars for my upcoming New York trip, with 80 dollars designated for accommodation, 220 dollars for clothing, and 200 dollars for dining at various restaurants. Could you offer me diverse lodging recommendations that fit within this financial plan? Please return at most 10 entries.	get_listing_by_lower_price	price: 80	“Amazing huge furnished room!”, “Sunny and spacious 1-bedroom in Brooklyn”, “Huge room in great area 25 minutes from Manhattan”, etc.
Extra Context	Travel: I know that I have a friend from high school, she was from California and she now lives in Brooklyn, New York. Recently, I heard she got some new apartments in that area, but I couldn’t find them. Could you please help me find some apartments in that area?	get_listing_by_neighbourhood_group	neighbourhood_group: Brooklyn	“Sunny and spacious 1-bedroom in Brooklyn”, “Large Private Room in Greenpoint Williamsburg”, “Spacious Brooklyn Brownstone 3BR+ etc.”
	Music: The Michelin star, which the most respected restaurants and chefs can have, has been used to measure the most successful food services in Europe and the world for more than 100 years. While having dinner with my friends in a restaurant in Paris, we started discussing our favourite artists. My best friend likes Bollywood songs and her favourite singer is Arjit Singh, while my other friends are fans of BTS. I personally like Taylor Swift and Selena. Can you recommend 3 songs by my best friend's favourite singer which we can enjoy while tasting an epic menu at a Michelin star restaurant?	songs_by_artist	artist_name: Arijit Singh	"Phir Aur Kya Chahiye", "Apna Bana Le", "Jhoome Jo Pathaan", "Kesariya"
Reasoning	Travel: I have 300 dollars with me but I was planning to spend one third on accommodation in New York, do you have some recommendations for me? Please return at most 10 entries.	get_listing_by_price or get_listing_by_lower_price	price: 100	“Amazing huge furnished room!”, “Sunny and spacious 1-bedroom in Brooklyn”, “Large Private Room in Greenpoint Williamsburg”, etc.
	Music: I am hosting a birthday party for my child. I prefer the danceability score to be higher than 90 to be suitable for a birthday party. Can you please recommend some latest songs which are suitable for children?	songs_by_danceability_explicitness	danceability_threshold: 90, explicit: false	"YOU THE VIBE", "TRAKA", "DOGGY DOGGY", "MOLI", "PALETA PA TO EL MUNDO", "RICO FEO", "TEKIRIKI"
Simple Questions	Travel: As I am staying for seven weekdays in New York, 5 days I will be staying at a relative's house and afterwards I would like to find a very nice Airbnb in Queens to spend the rest of the days with my wife. Each of us have a budget of 220 bucks per night so we could stay together in a nice, clean Airbnb. Considering my requirements and after finding the certain Airbnbs, please find the corresponding proximity data such as latitude and longitude and use them to provide me with its specific address if possible?	get_airbnb_by_price_min_nights_and_neighborhood_group & get_long_lat_by_airbnb & get_airbnb_address_by_lat_long	price: 440, min_nights: 2, neighbourhood_group: Queens & listing: rooms for rent in Queens with piano, City Skyline Views from every room! Nice Private Room Beauty in Queens & latitude: 40.70163, longitude: -73.90867	“TURNBULL AVENUE”
Typos	Travel: I want to find a shered place to stay, can you give me some suggestions?	get_listing_by_room_type	room_type: Shared room	“Bedroom 7 bed A.”, “Williamsburg Loft!! Bedford L 1blk!”, “Calm bed great area.”, etc.
	Music: I am a fan of Dua Lipa. Can you please provide me with a list of her songs.	songs_by_artist	artist_name: Dua Lipa	"Dance The Night", "Cold Heart", "One Kiss ", "Don't Start Now", "No Lie", "Sweetest Pie", "Levitating", "Potion"
Long Tail Entities	Travel: I am close to 'DUNKIN'', BASKIN ROBBINS .'Is it sure that I will find coffee or tea there?	get_cuisine	restaurant: DUNKIN', BASKIN ROBBINS	"Donuts", "CoffeeTea"
	Music: Provide all albums that are footwork but not wonky, alternative R&B, conscious hip hop, instrumental hip hop, deconstructed club.	albums_by_genres2	genres_in: ["wonky"], genres_out: ["footwork", "alternative R&B", "conscious hip hop", "instrumental hip hop", "deconstructed club"]	"No Love Deep Web", "WLFGRL", "Double Cup", "Galaxy Garden", "Room(s)", "So It Goes"
Interpretation	Music: Among the 10 longest songs, how many can I listen to fully in an hour of my time?	songs_by_longest_duration	n: 10	9

4. Benchmark Evaluation

In this section, we introduce our methodology for evaluating the LLM responses and the experimental results on the different question categories of the benchmark.

4.1 Evaluation of LLM Response

To assess the LLM responses, we evaluate the response along three dimensions: functions called, parameters filled, and final answer given.

Evaluation of final answer. To evaluate the final answer given, the LLM's response, which is returned as JSON, is compared to the expected JSON answer in the ground truth. The comparison is done by exactly matching their name/value pairs. The expected JSON values can include strings, numbers, boolean and lists. For lists, the evaluation may consider the order of items if specified in the ground truth in the boolean parameter ordered_items. If the order is unspecified, the lists are sorted before comparison. Ultimately, the response is evaluated as "Correct" if all key/value pairs match and is evaluated as "Incorrect" otherwise.

Evaluation of functions called. To evaluate the performance of LLMs in calling the correct sequence of functions along with their parameters, we compare the path taken by the model to provide the final answer to the correct paths in the ground truth. If multiple function paths can be taken to answer the question correctly, we first identify which path in the ground truth has the most functions in common to the final answer given by the LLM and compare those functions to the functions called by the LLM. We distinguish three cases in the evaluation of function calls:

“Correct”: The number of functions called by the LLM is equal to the number of functions in the ground truth and all of the functions called are contained in the ground truth.
“Partially Correct”: At least one of the functions called by the LLM is included in the ground truth, but not the rest of the functions.
“Incorrect”: None of the functions called by the LLM are contained in the ground truth.

The accuracy of function calling is calculated as: $$Acc_{function} = {\#correct\_functions\_in\_answers \over \# functions\_in\_ground\_truth}$$

Evaluation of parameters filled. Similarly to evaluating functions called, in evaluating the accuracy of parameter filling, we compare the called parameters to the correct parameters in the ground truth. A string parameter is deemed correct if the keys and values match. If the parameter is a number (integer or float), we round it to two decimals and check if it equals to the parameter value in the ground truth. As in function calling, we distinguish three evaluation cases:

“Correct”: All parameters filled by the model matches the parameters in the ground truth.
“Partially Correct”: At least one of the parameters filled by the model matches the parameters in the ground truth.
“Incorrect”: None of the parameters filled by the model match the parameters in the ground truth.

The accuracy of parameter filling is calculated as: $$Acc_{parameter} = {\#correct\_parameters\_in\_answers \over \# parameters\_in\_ground\_truth}$$

4.2 Experimental Results

In this section, we present the experimental results of our project, focusing on the accuracy of function calls, parameter filling, and final answers generated by the GPT 4 model (gpt-4-1106-preview). Detailed statistics are provided in the following tables, showcasing the model's performance in various function-calling scenarios. These include the accuracy of the model when calling functions with a single parameter (both with simple and complex function descriptions), calling functions with multiple parameters, and multiple function calling (sequential and parallel). The statistical analysis presented in the following tables offers insights into the model's effectiveness in interpreting and responding to queries covering different categories.

**Table 4:** Results of Single Parameter Function Calling Phase (Simple Description)
	Accuracy (in %)
Categories	Function	Parameter	Answer
Selection	100	100	64
Interpretation	100	100	86
Reasoning	100	100	80
Aggregation	100	100	71
Data Types	100	77.78	77.78
Languages	100	87.5	75
Typos	100	66.67	55.56
Long-Tail Entities	33.33	33.33	33.33
Extra Context	83.33	66.67	44.44
Overall	95	85	66

**Table 5:** Results of Single Parameter Function Calling Phase (Complex Description)
	Accuracy (in %)
Categories	Function	Parameter	Answer
Selection	90.91	90.91	63.63
Interpretation	100	100	57.14
Reasoning	100	100	80
Aggregation	100	100	85.71
Data Types	100	100	100
Languages	93.75	87.5	87.5
Typos	100	66.67	55.56
Long-Tail Entities	33.33	33.33	33.33
Extra Context	88.89	72.22	50
Overall	93.81	86.6	71.13

**Table 6:** Results of Multiple Parameter Function Calling Phase
	Accuracy (in %)
Categories	Function	Parameter	Answer
Selection	89	90	53
Interpretation	80	89	40
Reasoning	82	83	45
Aggregation	95	95	81
Data Types	100	66.67	18.18
Languages	100	92.31	77.78
Typos	92	82.86	58.33
Long-Tail Entities	83.33	88.23	83.33
Extra Context	86.67	84.09	53.33
Overall	90	85.68	59

Table 4 depicts the efficiency of single-parameter function calling when simple function descriptions are employed. The model distinctly exhibited inferior performance within the Long Tail Entities category concerning function calls, relative to its performance in other categories, and demonstrated reduced efficacy in accurately determining parameters across several categories, including Typos, Long Tail Entities, and Extra Context. Comparing the results in Tables 4 and 5, it can be seen that the model performed almost equally well in case of functions with simple and complex descriptions. The overall accuracy is almost the same for both the cases.

Table 6 shows the result of multi-parameter function invocation. We can observe from the results that the performance of the model in providing correct answers has declined noticeably in the case of invoking functions with multiple parameters as compared to that of a single parameter. There is a noticeable decline in the accuracy of parameter invocation, particularly within the Data Types category, in comparison to the performance observed across other categories. In delivering the accurate final answer, the model's overall performance notably lags its success in both parameter and function calling.

**Table 7:** Results of Multiple Function Calling (Parallel)
	Accuracy (in %)
Categories	Function	Parameter	Answer
Selection	75	75	50
Reasoning	100	95.45	77.77
Aggregation	100	100	50
Languages	100	100	57.14
Extra Context	100	82.35	50
Simple Questions	100	100	100
Overall	97.8	93.54	60.97

**Table 8:** Results of Multiple Function Calling (Sequential)
	Accuracy (in %)
Categories	Function	Parameter	Answer
Selection	91.67	91.67	66.67
Reasoning	71.43	66.67	44.44
Aggregation	78.57	68.29	50
Languages	72.72	48.38	36.36
Data Types	75	75	50
Extra Context	90	87.09	53.85
Simple Questions	63.63	50	37.5
Overall	76.74	65.57	50.88

The findings presented in Tables 7 and 8 offer valuable insight into the effectiveness of parallel invocation of multiple functions as compared to sequential. Table 7 shows that within the Selection and Extra Context category, the model encountered challenges in comprehending context and accurately transferring parameter values, particularly in comparison to other categories. The model faces challenges in determining the final answer in the overall evaluation, particularly Selection of Records, Aggregation, and Extra Context when compared to its accuracy in function and parameter determination. From Table 8, we can infer that the model finds it difficult to correctly parse parameters in case of sequential function calling, leading to the decline in accuracy of final answers. The results of Simple Question category indicates that despite providing additional instructions during prompting, the model remained unable to produce the correct final answer.

4.3 Error Analysis

After conducting a detailed analysis of cases where the GPT4 model failed to retrieve the correct function calls, and parameters, or provide accurate final answers, we identified recurring patterns and grouped them into distinct error categories for both use cases. These error categories are classified into three main groups: instances where the algorithms fail to retrieve correct functions, instances where accurate parameters are not retrieved, and situations where the models provide incorrect final answers. The latter group encompasses scenarios where the sequence order of function calls may be incorrect, or the model fails to reason adequately even when functions and parameters are retrieved accurately.

In the following section, we will provide a detailed breakdown of each error category, focusing on the specific types of errors made by the algorithms.

4.3.1 Error Categories for Incorrect Function Calls

There are approximately 8 instances where the GPT4 model fails to call the appropriate functions or provide the correct list of functions. Through identifying patterns and limitations in each instance, we have clustered them into two error categories. Each category represents similar types of limitations encountered while running the user’s prompts.

(i) Incomplete Function Invocation: This error category comprises 7 questions where the model fails to call the second or third function due to lack of reasoning or misunderstanding of the provided user request. This could occur if the model does not recognize the need for calling the second or the third function or incorrectly assumes that the task can be completed with just one function call. It may also arise when the user provides excessive numerical details in the input prompt, causing the model to overlook other aspects of the user's request. This also includes the instance where the model relies on existing knowledge and prefers not to call the artist_info function when information about famous artists is required.

(ii) Wrong Function Invocation: This error category comprises 11 questions where the model fails to call the correct function due to not understanding the provided user request. The model either does not correctly undertand which parameters needs to be used to call the correct function which is requested in the user query or in some cases, the model relies on existing knowledge and prefers not to call the correct function.

(iii) Additional Function Calls: There are around 3 instances where the GPT4 model calls multiple functions in addition to the correct function. The model can not differentiate which function needs to be called due to the multi-targeted information retrieval and tries to call potential functions in order to retrieve the information requested in the user query.

(iv) Failure to Pass Entire List between Functions (List Passing Error): This error category encompasses 2 failed answers, which we consider significant due to their uniqueness and their demonstration of important algorithm limitations. Although these instances stand alone, they shed light on a broader issue observed during experimentation with other instances: the consistent failure of the model to pass the full list of values from the output of one function to the input of another. In these failed instances, the algorithms fail to transfer the entire list of elements outputted by the first function to the second function during the sequential multiple-function calling, resulting in incomplete data processing and potential inaccuracies.

Error Categories for Incorrect Function Calls — **Figure 2:** Distribution of the questions in the error categories for incorrect function calling

Based on the error categories assigned for incorrect function calls, we have gained insight into the main limitations contributing to most of the model failures in calling the right functions. According to Figure 2 , the primary issue with the function calling functionality of the GPT model lies in sequential multiple function calling. In these instances, the model fails to call the correct sequence of functions or disregards the need to call the second or third function. This failure is typically attributed to a lack of reasoning and understanding of the necessity for calling a sequence of functions rather than just a single function.

4.3.2 Error Categories for Incorrect Parameter Calls

There are approximately 60 instances where the GPT4 algorithms fail to retrieve the correct parameters for the called functions. By identifying patterns in the limitations that contribute to the failure of the model to call the correct parameter values given the user’s prompts, we have grouped similar limitations into distinct error categories. As a result, we have identified 7 error categories, each representing a cluster of questions where the algorithms failed to retrieve the right parameters due to similar limitations.

(i) Parameter Hallucination due to Wording/Extra Context: This category encompasses 11 instances where the model, tested for function calling abilities, fails to retrieve the correct parameter fillings and instead completely hallucinates the parameters. Parameters generated by the model are entirely unrelated to the task at hand or the information provided in the user query, indicating a disconnect between the generated output and the intended context.

This phenomenon is believed to occur due to the model's inability to correctly understand the entirety of the user request, particularly when the request is verbose, includes excessive details, or provides extensive related context. In such cases, the algorithms become confused and fail to grasp the essence of the request, resulting in inaccurate parameter filling and consequently, a wrong or completely unrelated answer to the user.

(ii) Misidentification of Numeric Entities in User Prompt: This error category contains 2 instances. In these instances, the model fails to accurately identify parameter fillings, specifically numeric entities provided by the user in the prompt. The model misinterprets numeric sequences in the input text, such as phone numbers and postal codes, as non-numeric entities, such as listing names or other textual elements. This leads to errors in identifying relevant parameter values, resulting in incomplete information being processed.

(iii) Misinterpretation of Non-Latin Characters: Under this error category fall around 5 instances. The model misinterprets characters from non-Latin scripts, leading to errors in comprehension and analysis of text inputs. The model encounters difficulties in accurately translating the user's request from one language to another, resulting in incorrect parameter values being passed to the function.

(iv) Incorrect String Handling: There are around 13 instances where the GPT4 model encounters difficulties in handling string values in the prompts. This error occurs in three cases: Firstly, when the user provides context that may confuse the algorithms, leading to incorrect extraction of string elements for parameter values based on the provided context. Secondly, when the model fails to handle string synonyms, resulting in extracted parameters deviating from the expected string representation format in the dataset used, thereby introducing inaccuracies when the function is called. Lastly, when dealing with user typos, the algorithms struggle to identify and correct them, resulting in incorrect parameter values being selected.

(v) Currency and Conversion Error: In this error category, there are 5 instances where the model fails to retrieve the correct parameter value due to limitations in currency conversions. This error occurs in two situations: firstly, when the algorithms are unable to perform accurate currency conversion from the currency format provided by the user to the currency type used in the dataset API (e.g., yuan to dollars); secondly, when the algorithms inaccurately convert currency values provided in different units by the user’s prompt (e.g., dollars and cents) to the currency unit of the same currency type used in the dataset API.

(vi) Limitations in Text-to-Number Conversion: There are a total of 3 cases where the model’s algorithm retrieves incorrect parameter fillings due to limitations in text-to-number conversion. The algorithms struggle to recognize and convert textual representations of time, such as "week", into their corresponding numeric values (e.g., 7 days). Additionally, they encounter errors or failures when attempting to perform division or other mathematical operations with numerical values represented as text within user prompts.

(vii) Date Inference Error: According to the data presented, the primary limitation of the GPT4 model arises when dealing with various date formats. There are approximately 18 instances where the model fails to retrieve the correct date value as the parameter. This error category encompasses questions where the model returns inaccurate answers due to difficulties in inferring dates from user inputs or contextual cues. These errors may occur when the algorithms encounter discrepancies in converting textual date representations provided in the prompts to the standardized date format required by the API dataset, resulting in errors in parameter value representation. Additionally, errors may occur when the algorithms fail to accurately recognize and interpret the date format when presented with dates in different formats in the user prompt, leading to errors in parameter value retrieval.

(viii) Incomplete Parameter Values Passing: In this category, there are 3 instances. In these three cases, the LLM does not receive all the necessary parameters or arguments that are required to execute the set of functions correctly. It can happen due to misunderstanding of the function requirements and lack of context.

Error Categories for Incorrect Parameter Calls — **Figure 3:**Distribution of the questions in the respective error categories for incorrect parameter calls

As depicted in Figure 3, our analysis reveals that the primary cause of the highest number of failures in the GPT4 model is the presence of dates in various formats within user prompts. Additionally, we observe that the second most prominent limitation occurs when users include additional context (related and unrelated) or synonyms in their prompts, which the model struggles to interpret accurately. This often results in incorrect parameter fillings and subsequently, incorrect answers. These findings underscore the importance of addressing these specific limitations to enhance the model's performance in handling diverse user inputs effectively.

4.3.3 Error Categories for Incorrect Answers

There are approximately 64 instances where the GPT 4.5 model fails to provide correct answers. By analyzing the underlying patterns and types of errors observed in these instances, we categorized them into 8 distinct error categories. Each error category comprises questions where the model provided inaccurate responses due to similar limitations or recurring patterns.

(i) Arithmetic Anomaly (Summation/Average): This error category contains 13 instances where the GPT 4.5 model encounters difficulty in executing arithmetic computations accurately, leading to inaccurate responses generated by the model in case of summation and average operations. The performance of the model in arithmetic calculations is observed to be especially affected in the case of dealing with a long list of numerical data. For example, when dealing with a long list of ratings by customer, a list of the number of streams of song, or a list of any other numerical attribute, the model struggles to maintain precision in its calculations.

(ii) Sorting Anomaly (Numerical Values): This error category comprises 9 questions where the GPT4 model returned inaccurate answers due to probable anomalies in sorting numerical values, particularly when requested to identify specific positions of any attribute such as the third highest or lowest value. Processing multiple numeric values simultaneously poses challenges for the algorithms, leading to inaccuracies or failures in determining relationships between the items.

(iii) Sorting Anomaly (Dates): This error category encompasses 6 instances where the GPT4 model provides inaccurate responses due to potential anomalies in sorting date values, particularly when queried to return results based on specific criteria such as the third most recent or least recent date. It can lead to discrepancies in the ordering of dates and consequently inaccurate responses in case of specific chronological criteria-based queries.

(iv) Counting Discrepancy: This error category consists of 1 instance. In this instances, the model fails to provide a correct response due to discrepancies or inaccuracies in counting elements or results, such as the number of songs from the resulting records of the called function.

(v) Multi-Information Retrieval Error: This error category includes 12 instances where the GPT4 model encounters challenges or inaccuracies in retrieving multiple pieces of information or results. Instead of retrieving relevant information that satisfies the criteria specified by the user, the model returns responses with errors, such as selecting records that meet only one of the multiple specified criteria. Examples include instances where the model fails to retrieve all relevant records that meet specified criteria or where it overlooks certain criteria altogether, resulting in incomplete or inconsistent responses.

(vi) Query Misinterpretation: This error category comprises 7 questions where the GPT4 model returns inaccurate answers due to difficulty in interpreting user queries. It includes cases where the model fails to grasp the context of user queries, misinterprets specific keywords or phrases, or the intent behind the question by the user. This category also caters to instances where the model incorrectly parses the provided details or fails to recognize a parameter value due to the given extra context in the user’s prompt, leading to errors in identifying the relevant parameter values, resulting in incomplete information being processed. Also, there are cases where the model misinterprets numeric sequences in the input text, such as phone numbers and postal codes, as non-numeric entities, such as listing names or other textual elements.

(vii) Inexact Matching/Lack of Contextual Relevance & Common-sense Reasoning/Lack of Specificity: In this category, there are a total of 14 questions. The responses of the GPT4 model lack contextual relevance and specificity, leading to inaccuracies or inadequacies in addressing user queries. This category also encompasses instances where the provided information only partially aligns with the user query, resulting in incomplete responses or additional irrelevant information. Furthermore, the algorithms struggle to select the appropriate functions or provide the right final answer due to a lack of common-sense reasoning abilities. (e.g., the algorithm, lacking common-sense reasoning abilities, fails to understand that a baby typically doesn't require a separate accommodation booking or incur additional charges)

(viii) Reliance on Previous Knowledge: There are 2 instances where the GPT4 model returns inaccurate answers due to the use of its previous knowledge base rather than relying on getting information from the results of called functions. Algorithms encounter difficulties when dealing with well-known entities, leading to failures in identifying relevant functions and parameters for popular restaurant names, albums, or artists. They provide responses based on pre-existing knowledge, resulting in potentially inaccurate or incomplete answers.

Error Categories for Incorrect Answers — **Figure 4:**Distribution of the questions in the respective error categories for incorrect final answers.

Even when the correct functions and parameters are called, the models occasionally fail to provide the right final answer to users. According to Figure 4, this can be attributed to various limitations, including the failure to retrieve all relevant records that meet specified criteria or overlook certain criteria altogether, resulting in incomplete or inconsistent responses. Another common limitation is the inability of the models to perform arithmetic operations such as summation and averaging of long lists of values. Additionally, issues such as dealing with nuanced context and the lack of commonsense reasoning contribute to query misinterpretation and inaccurate results.

5. Code and Data

We offer the Mannheim Function Calling Benchmark for public download and make the code for running the experiments available on github. In the table below, you will find for download all questions and functions sets as well as all test configuration files to run the experiments testing the different combinations of question and function sets. We also offer for download the logs of the previous model runs. Lastly, you will find in the last row of the table the datasets used for the creation of the benchmark.

	File	Size
Question and function sets	Question and function sets.zip	77 KB
Test configuration files	Configuration Files.zip	7 KB
Model Output Logs	Model Output Logs.zip	123 KB
Datasets	Datasets.zip	586 KB

6. Existing Benchmarks

In this section, we show and compare some related work regarding benchmarks, dataset creation and models that try using different tools in order to enhance LLM with additional information and capabilities that go in the same line of thought as with function calling enabled LLMs.

APIBench [5] introduces a new benchmark to help Large Language Models (LLMs) improve their accuracy and flexibility when working with various tools through APIs and API documentation. By combining self-instruct fine-tuning and retrieval methods, LLMs are trained on a large dataset of APIs gathered from major model hubs like TorchHub, TensorHub, and HuggingFace. This dataset covers a wide range of domains, including multimodal data, computer vision, natural language processing, audio, tabular data, and reinforcement learning. Each API call is detailed in JSON objects, including information like domain, framework, functionality, and example code. Synthetic user prompts, generated using the self-instruct approach and GPT-4, accompany each dataset entry to task the model with creating real-world use cases involving the APIs [6]. Evaluation involves matching AST sub-trees to determine which API the LLM selects, with a focus on compatibility with the reference API. Experiments comparing Gorilla's performance against other models in a zero-shot setting assess different retrieval methods and Gorilla's adaptability to changes in API documentation at test time. Gorilla's retriever-aware training proves highly adaptable to such changes, maintaining accuracy and relevance over time, while also avoiding hallucination and meeting specified constraints. However, it's worth noting that ML APIs might produce biased predictions if trained on biased data.

API-Bank [8] aims to address three key questions regarding the effectiveness of LLMs in utilizing tools, methods to enhance their tool utilization ability, and the obstacles they face in effectively leveraging tools. To evaluate LLMs' tool utilization effectiveness, the API-Bank evaluation system is implemented, incorporating 73 commonly used APIs and 314 tool-use dialogues with 753 manually annotated API calls. To enhance LLMs' tool utilization ability, a comprehensive tool-augmented LLM training dataset is developed using a novel method called Multi-agent, comprising five collaborative agents. This dataset covers three different API usage abilities and emphasizes domain diversity, API authenticity, API diversity, and evaluation authenticity. The study also conducts experimental analyses to understand the main challenges faced by LLMs, like GPT-4 and their own model Lynx when utilizing APIs. Annotated dialogues in the evaluation data cover Call, Retrieval+Call, and Plan+Retrieval+Call abilities, with Lynx being fine-tuned using the APIBank training dataset and benchmarked against various LLMs. Model performance is evaluated based on API call correctness and the quality of LLM-generated responses, with six primary error types classified and assessed. Limitations include the implementation being in English only, the use of a small model for fine-tuning, and the potential for future work in other languages and with larger scale models [8].

ToolQA [1] is a dataset designed to assess the Language Model's (LLM) ability to utilize external tools and generate knowledge for improved question answering. It minimizes overlap with pre-training data and includes 8 domains and 13 types of tools to retrieve information. The process involves three phases: reference data collection, human-guided question generation with LLMs, and programmatic answer generation. Different LLM models, including standard and tool-augmented versions, are used for easy and hard questions. ToolQA focuses on the final correct answer rather than intermediate tool use processes. The dataset employs reference corpora defined by contextual dimensions and answer templates generated by ChatGPT. Answers are sampled from retrieved data, and accurate answers are created using operators and tool chains for multi-step reasoning. Various tools are utilized for text retrieval, database operations, code interpretation, mathematical computations, graph data, and parsing feedback. The analysis identifies incorrect tool calls and data sources, categorized into three main error types.

ToolBench [2] is a benchmark devised to assess how open-source Language Models (LLMs) can be improved with tool manipulation capabilities akin to closed LLM APIs, with practical human oversight to avoid exposing enterprise-internal workflows. It integrates a variety of software tools for real-world tasks, incorporating both existing and newly acquired datasets, and employs test cases for quantitative evaluation, setting it apart from other benchmarks. Task complexity is gauged based on API intricacy and the need for advanced reasoning, with success rate serving as the primary evaluation metric. To enhance open-source LLMs, three techniques are employed: Model Alignment, Incontext demonstration retriever, and System Prompt [3, 11]. These methods aim to align LLMs with API usage examples, enhance argument population, and control the natural language style of responses, respectively. Evaluation involves identifying challenges such as incorrect API selection, difficulty in argument population, and non-executable code generation, which are addressed by tuning with API usage examples. Advanced reasoning remains a challenge for open-source models.

To provide a more general and concentrated picture of the comparisons done above, we summarize the comparisons in Table 9.

**Table 9:** Comparison of Benchmarks
Comparison Points	ToolQA	ToolBench	APIBench	API-Bank	Our Benchmark
The goal	LLMs enhanced using external sources	LLMs enhanced using external sources	LLMs enhanced using external sources	LLMs enhanced using external sources	LLMs enhanced using external sources
Type of assessment	Question answering	API Calls	API Calls	API Calls	Question answering
Answer Evaluation	Final Correct Answer	Success Rate (in code generation)	Accuracy metric (in code generation)	Accuracy metric(API call) & ROUGE-L metric(responses after API call)	Final Correct Answer
Human Intervention	Human Templates	Human Templates	Human Templates & Self-Instruct method	Human Templates & Multi-agent method	Human Templates
Answer Retrieval	Operations/ Functions for multi-step reasoning	Functions/Techniques for multi-step reasoning	Functions/Techniques & Retrievers	Functions/Techniques for multi-step reasoning	Operations /Functions for multi-step reasoning
Task Challenge	No distinction in questions or API call complexity (advanced reasoning)	API complexity to measure the difficulty of choice of API calls (No advanced reasoning)	API usage abilities in dialogues single and multiple calls	API calls with various constraints	No distinction in questions or API call complexity (advanced reasoning)
Type of Enhancement	Specific to different domains	Specific to different domains	Specific to different domains and major ML hubs	Specific to different domains and principles	No direct specification - use case independent
LLM Knowledge	API Calls only- no internal LLM knowledge	API Calls only- no internal LLM knowledge	API Calls only- no internal LLM knowledge	API Calls and internal LLM knowledge	API Calls and internal LLM knowledge
Type of Data Used	Several types of data: tabular, text corpora, and graphs	New and existing datasets	Tool augmented dataset with Self-Instruct method	Tool augmented dataset with Multi-agent method	Tabular Data only, Existing Datasets
Components Passed to Models	Questions, answers, reference corpora, and available tools	Instruction in natural language as a goal, API documentations	Instruction in natural language as a goal, API documentations	Instruction in natural language as a goal, API documentations	Question set file, functions, and ground truths
Intermediary Evaluation	No (only final answer)	Yes (intermediate steps included - API calls, parameters, etc.)	Final API call using AST sub-tree matching method	Yes (intermediate steps included - API calls, parameters, etc.)	Yes (intermediate steps included - API calls, parameters, etc.)

7. References

[1] Zhuang, Y., Yu, Y., Wang, K., et al. (2024). Toolqa: A dataset for LLM question answering with external tools. Advances in Neural Information Processing Systems, 36.
[2] Xu, Q., Hong, F., Li, B., et al. (2023). On the tool manipulation capability of open-source large language models. arXiv:2305.16504
[3] Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, pp. 27,730–27,744.
[4] Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073
[5] Patil, S.G., Zhang, T., Wang, X., et al. (2023). Gorilla: Large Language Model Connected with Massive APIs. ArXiv, abs/2305.15334.
[6] Wang, Y., Kordi, Y., Mishra, S., et al. (2022). Self-instruct: Aligning language model with self-generated instructions. arXiv:2212.10560.
[7] Taori, R., Gulrajani, I., Zhang, T., et al. (2023). Stanford Alpaca: An instruction-following llama model.
[8] Li, M., Song, F., Bowen, Y., et al. (2023). API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. Conference on Empirical Methods in Natural Language Processing.
[9] Brown, T., Mann, B., Ryder, N., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
[10] Du, Z., Qian, Y., Liu, X., et al. (2022). GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335.
[11] Glaese, A., McAleese, N., Tr˛ebacz, M., et al. (2022). Improving alignment of dialogue agents via targeted human judgements, arXiv:2209.14375.
[12] Kim, S., Moon, S., Tabrizi, R., et al. (2024). An LLM compiler for parallel function calling. arXiv:2312.04511 [cs.CL]
[13] Srinivasan, V. K., Dong, Z., Zhu, B., et al. (2023). NexusRaven: A commercially-permissive language model for function calling. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
[14] Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling Laws for Neural Language Models. ArXiv, abs/2001.08361.
[15] Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). Training Compute-Optimal Large Language Models. ArXiv, abs/2203.15556.
[16] Wei, J., Tay, Y., Bommasani, R., et al. (2022). Emergent Abilities of Large Language Models. ArXiv, abs/2206.07682.
[17] Chang, Y., Wang, X., Wang, J., et al. (2023). A Survey on Evaluation of Large Language Models. ArXiv, abs/2307.03109.
[18] Schick, T., Dwivedi-Yu, J., Dessì, R., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. ArXiv, abs/2302.04761.
[19] Mialon, G., Dessì, R., Lomeli, M., et al. (2023). Augmented Language Models: a Survey. ArXiv, abs/2302.07842.
[20] Langchain. (2023). Parallel Function Calling for Structured Data Extraction.
[21] OpenAI. (2023). Function calling.

Mannheim LLM Function-Calling Benchmark (MLFC)

Contents