Deidamea Bajri
Dennis Heinz
Saman Khursheed
Serxhina Kutrolli
Stiliana Jano
Zeynep Eroglu
Keti Korini (Supervisor)
Christian Bizer (Supervisor)

Large Language Models (LLMs) have revolutionized natural language processing (NLP) tasks like text generation, translation, and question-answering. However, limitations remain in many areas such as reasoning and factual accuracy. Function calling, the ability for LLMs to interact with external APIs, presents a promising solution for accessing real-time information and third-party services. However, evaluating the effectiveness of LLM function calls remains an open question.

This project introduces a novel benchmark designed to comprehensively assess the function-calling capabilities of LLMs across three core areas: function selection (LLMs' ability to identify appropriate functions based on user intent), parameter passing (constructing necessary parameters), and results provided (understanding query and returning desired answer). The benchmark includes 310 user questions focused on two use cases: New York City exploration (Airbnb listings, restaurants) and Music exploration (albums, songs, artists). The questions are designed to evaluate the performance of LLMs on various functionalities: reasoning, information aggregation, data type handling (numerical, dates, locations, text), and user interaction (typos, extra-context, criteria, entities, instructions). Evaluating this benchmark with GPT-4 revealed strengths in dealing with different languages, reasoning tasks but challenges with diverse data formats, particularly dates.

Contents

1. Introduction

Large language models (LLMs) represent a significant advancement beyond early language models by not only proficiently modeling and generating text but also by addressing broader and more intricate general-purpose tasks. These models exhibit emergent capabilities that surpass those of smaller, pre-trained language models. They have demonstrated efficacy in tackling multifaceted challenges across various domains, showcasing a depth of understanding and adaptability previously unseen in their predecessors. LLMs achieve this significant leap through substantial increases in the amount of computational power employed, the number of model parameters utilized, and the size of the training dataset employed [14, 15]. Large language models exhibit emergent abilities in various domains, including prompt-based task completion, instruction following without prior exemplars, and program execution, as demonstrated by recent advancements in few-shot prompting, chain-of-thought prompting, and scratchpad-based execution techniques in multi-step computational tasks. Additionally, ongoing research explores model calibration methodologies to assess the confidence and accuracy of language model predictions [16].

Despite their remarkable capabilities, Large Language Models (LLMs) are susceptible to various limitations and challenges. These include biases and inaccuracies in generated outputs, difficulties in comprehending complex logic and reasoning tasks, constraints in handling extensive datasets and long-term memory, and limitations in incorporating real-time or dynamic information [17]. Additionally, LLMs may exhibit sensitivity to prompts, struggles with text summarization tasks, and limitations in mastering domain-specific knowledge or generating structured data. They also face challenges in generating truthful information, maintaining alignment with sources, and updating parametric knowledge in a timely manner. Furthermore, LLMs may exhibit inconsistencies between derived answers and reasoning processes, encounter difficulties in numerical computation, and demonstrate a tendency to hallucinate facts. Addressing these challenges necessitates multifaceted approaches, including augmenting LLMs with external knowledge sources, fine-tuning LLMs with process-level feedback, using an ensemble of diverse reasoning paths, refining reasoning processes with self-reflection or external feedback, utilizing mathematical tools for numerical computation, tokenizing digits for improved arithmetic abilities, alignment tuning to maintain coherence with sources, and effective tool utilization for mitigating issues such as hallucinations in generated texts [18, 19].

Function calling addresses these challenges by ensuring a consistent response format and enabling the utilization of external data which accesses dynamic information from diverse sources within interactive user agent chat contexts, thereby augmenting the adaptability and utility of LLMs in practical applications [20, 21].

Introduction to Function Calling
Figure 1: Illustration of the LLMs retrieving the user-created functions within the user prompts.

As illustrated in Figure 1, with the incorporation of the function calling feature, which enables the model to access user-created functions adhering to specified structure and parameters, the large language model (LLM) selectively invokes the pertinent function within the user-generated prompt, utilizing designated arguments, and subsequently delivers a response based on the output generated by this invoked function(s). This mechanism facilitates enhanced versatility and adaptability of LLMs in responding to diverse user queries and prompts, leveraging the customized functionalities provided by users to enrich the model's capabilities [12].

Function calls in LLMs can be categorized based on the number of functions involved and their execution order. Single calls represent the simplest form, where the LLM executes a single, self-contained function within the prompt (e.g., "Provide the address of a specific Airbnb listing"). Multiple calls introduce more complexity, allowing for sequential or parallel execution. Sequential calls involve executing functions one after another, with the output of one feeding into the next (e.g., "Find the address of a specific Airbnb and then provide some restaurants nearby"). Parallel calls, where multiple functions execute concurrently, is a promising yet challenging area of research due to complexities in context management [12].

Nevertheless, ongoing discussions among researchers revolve around the performance of these models and their effectiveness in implementing these functionalities, particularly function-calling, that facilitate connectivity with external API sources. When implementing function calling, discussions also center on identifying the main constraints of these models and investigating possible ways to overcome them. These efforts aim to enhance the performance of the LLMs when used in various real-world tasks.

To address these inquiries, we introduce an innovative benchmark specifically designed to evaluate the function-calling capabilities of LLMs when interfacing with external APIs. This benchmark is crafted to assess the LLMs' ability to implement function calls for two distinct use cases. Beyond core function calling, our benchmark aims to capture the primary challenges encountered by the models by exposing them to a variety of user questions. These questions, grouped into distinct categories within the benchmark, will facilitate testing across different domains (New York City travel and music APIs) and enable us to assess in detail potential limitations when invoking various functions.

Ultimately, the limitations identified through running this benchmark can serve as valuable guidance for future research endeavors, enabling researchers to address and overcome these challenges in enhancing the capabilities of LLMs when interfacing with function-calling using external APIs and handling diverse user queries.

2. Datasets and Functions

In this section, we outline the datasets utilized in the creation of our benchmark for function calls using LLMs. A total of six repository datasets were selected to ensure comprehensive coverage and representativeness across both use cases i.e. Music and Travel. For the Music use case, three datasets were meticulously chosen containing information about music albums, songs, and artists. For the Travel use case, three datasets were thoughtfully selected that include information about Airbnb listings, restaurants and cafes in New York City, and food ordering. During the first two phases (i.e. for single parameter and multiple parameter function calls), we only employed two datasets. However, in the final phase, we introduced a third dataset to ensure the connectedness between datasets and to enrich the diversity of attributes available for querying in multiple function calling cases. Further, for the music use case, a sample of the previous two datasets was used in this phase ensuring the fact that it covers common entities and have overlapping attributes between all three datasets. Table 1 provides details about each dataset used for the respective use cases:

Table 1: Details of Datasets
*: Value in bracket indicates sampled no. of entities used in multiple function calling phase
Use Case Dataset Name No. of Entities No. of Attributes Relevant Attributes for API Creation Source
Music Albums 5119 (187)* 8 release name, artist name, release date, genres, descriptors, average rating, rating count, review count

Most Popular Albums:

https://www.kaggle.com/datasets/tobennao/rym-top-5000/

Niche Albums (scraped data):

https://rateyourmusic.com/
Songs 1413 (123)* 19 track name, artist, album, release date, streams, bpm, danceability, energy, speechiness, instrumentalness, explicit, popularity, duration

Most Streamed Spotify Songs 2023:

https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023
Artists 47 7 name, formed/born, city, country, genre

Artists (scraped data)

https://rateyourmusic.com/
Travel Airbnbs 1000 16 hostname, neighbourhood group, precise geo-coordinates (latitude and longitude), room type, prices, reviews, the last date of review, and an average of the reviews per month

New York City Airbnb Open Data:

https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data
Restaurants 1000 9 restaurant name, cuisine, borough of location, street address, zip code, building, phone number, and precise geo-coordinates (latitude and longitude)

New York City Restaurant Inspection Results:

https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j/about_data
Food Orders 1898 7 restaurant name, cuisine type, order cost, customer ratings, food preparation time, and delivery time

NYC Restaurants Data - Food Ordering and Delivery:

https://www.kaggle.com/datasets/ahsan81/food-ordering-and-delivery-app-dataset

For the datasets that we specified in the table above, we have created an API that includes several functions to enhance the LLM’s ability to answer the user questions provided as prompts. For the three different phases, functions with simple descriptions are created. Additionally, for phase 1, single parameter calls, functions with complex descriptions are also created to check whether the LLMs would find it challenging to answer the questions. With complex descriptions, we are referring to descriptions that are worded differently to add complexity or added information that may confuse LLMs in calling the correct function.

For each phase tested, we have created a certain number of functions. For the Travel Use Case, we have created 14 functions for the single parameter functions, both with simple and complex descriptions, 10 functions for the multi-parameter functions with a range of 1 to 4 parameters, 11 functions for the sequential multi-calls with a range of 1 to 4 parameters and finally 7 for the parallel multi-calls with a range of 1 to 2 parameters. For the Music Use Case, we have created 10 functions with single parameters, both with simple and complex descriptions, 16 functions with multiple parameters ranging from 2 to 5, and 11 functions with a range of 1 to 2 parameters for the sequential and parallel multi-calls. For both use cases, the data types we have used are strings, integers, float, and boolean. We have decided to not include date format as a specific data type of the parameters, to not confuse the LLMs with the different date formats, but rather specify the required formats in the function descriptions. A few examples of functions created for both use cases are shown in Table 2.

We provide all function sets for download.

Table 2: Examples of Functions.
Use Case Function Name Parameters Data Types of Parameters Description
Travel get_host_name listing_name string Provide the host name for the Airbnb listing.
get_host_name listing_name string Provide the name of the Airbnb’s owner.
get_x_most_popular_places_in_neighbourhood_group_room_type popularity, neighbourhood_group, room_type integer, string, string Get the x most popular Airbnbs in a neighborhood group and for a specific room type.
get_airbnb_by_price_min_nights_and_neighborhood_group price, min_nights, neighbourhood_group integer, integer, string Get the listing by minimum price in US dollars, minimum nights and specific neighborhood group.
get_avg_delivery_time_by_restaurant_name restaurant_name string Get average delivery time of a restaurant.
Music top_rated_albums n integer Returns the top-rated albums based on average rating.
top_rated_albums n integer Retrieves the records of most favoured music releases ranked by their scores.
filter_albums_by_date_range start_date, end_date string, string Filters and retrieves albums released within a specified date range. The range is inclusive of the start and end dates.
songs_by_danceability_explicitness_speechiness danceability_threshold, speechiness_threshold, explicit integer, integer, boolean Retrieves songs based on specified thresholds for danceability, speechiness, and explicit content criteria.
artist_info artist_name string Retrieves detailed information for a specified artist, including attributes like name, band status, formation/birth date, city, country, genres, and related artists.

3. Question Set and Ground Truth

In order to test the ability of the LLMs to call functions when needed, a total of 310 questions was created. For the Travel Use Case, we have tested in total 150 questions, while for the Music Use Case we have tested in total 160 questions for all the different phases including testing single parameter function calls, multi-parameter function calls and multiple sequential and parallel function calls to give the final answer to the user. These questions can be used to evaluate a model's function calling ability, parameter filling, and lastly, the final answer given. They are categorized into different test categories to help better analyze and interpret the results. The combined categories created by both use cases and all phases are as follows: (i) Selection, (ii) Languages, (iii) Data Types, (iv) Aggregation, (v) Extra Context, (vi) Reasoning, (vii) Simple Questions, (viii) Typos, (ix) Long-Tail Entities, and (x) Interpretation.

The number of questions tested for each category is listed below:

(i) Languages: We have created this category to examine the LLMs' performance in handling questions in languages other than English. In this way, we can access the LLM's multilingual capabilities and its capacity to return correct answers in case of other popular, unpopular, non-Latin languages. The languages tested in all the phases include German, Turkish, Urdu, Spanish, Greek, Albanian, and French for both use cases.

(ii) Selection: In the Selection of Records category, we intend to evaluate GPT’s accuracy across several dimensions:

(iii) Aggregation: The Aggregation category includes questions that test the ability of LLMs to perform simple calculations (computation of sum, difference, and average) and identify the minimum, maximum value, and counting of records. It evaluates the LLM’s proficiency in aggregating information from multiple records to provide answers to the user query. Some typical cases being tested are: filtering by a lower or a high price, filtering by a price range, division of the budget by number of visitors, finding average prices for cuisines, summing costs, filtering by the highest number of reviews, the average of streams of songs by an artist, the number of albums released by an artist etc.

(iv) Extra Context: The Extra Context category includes questions where the user provides either additional related context or unrelated context. The questions with related context may require semantic understanding of the query and intend to assess the model's ability to understand and incorporate contextual cues. This category also includes questions with unrelated or misleading context that focus on evaluation of the model's capability to distinguish and disregard irrelevant information. The model should be able to select correct values of parameters and select right functions.

(v) Data Types: In the Data Types category, we have created questions to test the ability of LLMs to understand several types of data types being currencies and their conversions to other types of currencies (e.g from yuans to dollars), converting to the same currency (e.g cents to dollars), different date formats (e.g from YYYY-MM-DD to DD-MM-YY), time measurement units (e.g. from minutes to hours), phone numbers with prefixes(e.g. +49 is known for Germany’s prefix) or either convert from a textual representation of a number to its numerical value (e.g one zero zero one nine to 10019).

(vi) Reasoning: This category was created to evaluate the ability of LLMs for logical reasoning and inference generation. The questions in this category require the model to make some recommendations based on the desired conditions in the query or to reason about the sense of attributes on their own or in relation to other attributes. This helps analyze whether the model is able to make informed conclusions and identify correlation among different attributes in the dataset. It also tests the common-sense reasoning capability of the model. Some commonsense reasoning cases could be related to: time concept (e.g. tonight refers to one night of stay) and selection of correct records according to a minimum number required, budgeting (e.g. when several people have a fixed amount of budget what could be the cost for each), wording (e.g. if the user is at a specific location, US, then if they require traditional food could be American) and popular locations & geographical proximity (e.g the user is at Statue of Liberty, where would that borough be located).

(vii) Simple Questions: For this category, the user queries are more instructive and informative for the model to call the correct functions, parameter and parameter values to receive the final answer.

(viii) Typos: For this category, we have created questions testing different types of typos with intentionally misspelled words: in Airbnb’s or restaurant names, spelling, date formats, currencies, numbers or strings indicating prices, typos in artist name, linguistic typos, typo in genre etc. The model should understand these typos and correct them appropriately when selecting the functions to call as well as with parameter filling.

(ix) Long-Tail Entities: The "Long Tail Entities" category includes questions that inquire about some niche entities, like lesser-known artists, or non-popular music genres. This category focuses on the evaluation of the model's ability to provide accurate responses even for less common entities. Also included in this category are questions about confusing entities, like for example Airbnb listings with the same names but different attributes such as location and location details. In this case, we test whether the model distinguishes between these entities based on their other attributes.

(x) Interpretation: In the Interpretation category, we construct questions to test the capability of the model to interpret and comprehend the attributes in case of parameter filling as well as extracting meaningful insights from the resulting records. We focus on three dimensions: textual interpretation of attributes, temporal interpretation, and date interpretation (various date formats were tested).

After formulating the questions and categorizing them, we constructed the ground truth based on the data sources we have used. To test the accuracy of the LLMs, we have defined ground truths for three dimensions: Correct functions, Correct Parameter Values, and Overall Correct Answer. Table 3 provides a short overview of questions for each category with the respective ground truths including the expected functions to be called, parameter values to be filled, and correct answers. In some cases, we can have several correct function calls, as also indicated in the table, that generate the same correct answer. For the questions in the table that require multiple function calls, we have denoted them with “&” to show that all the functions are needed to receive a response.

We provide all question sets including their ground truth for download.

Table 3: Examples of questions for each category.
Category Questions Functions Parameters Answer
Languages Travel: Po verdallisem neper new york dhe nuk po gjej ndonje vend per te ndenjur sonte. Nuk kam shume para ne xhep keshtu qe me duhet nje vend i lire per fjetur? Me jep dot nja 5 sugjerime qe kushtojne rreth 30 dollare per nate? (Translates as: I am wandering around New York and I am not finding a place to stay tonight, I do not have much money with me, so I need a cheap place to stay. Can you give me 5 suggestions that cost around 30 dollars per night?) get_listing_by_lower_price price: 30 "Huge room in great area 25 minutes from Manhattan", "Room in Bushwick Bk available June", "Sunny spacious room full of good energy", … etc.
Music: Ich mag das Album von Coldplay, das 2011 veröffentlicht wurde, sehr. Finden Sie ihre Alben und können Sie mir dann die Künstler empfehlen, die dem gleichen Genre wie das genannte Album angehören. (Translates as: I really like the album of Coldplay which was released in 2011. Find their albums and then can you recommend me the artists with the same genre as that of the mentioned album.) albums_by_artist & artists_by_genres artist_name: Coldplay & genres: ["Pop Rock", "Dream Pop"] "The Beatles", "The Rolling Stones", "Bob Dylan", "Taylor Swift", "Lady Gaga", "Miley Cyrus", "Shakira", "Olivia Rodrigo", "Elton John"
Selection Music: Among the 15 songs with the longest durations, what is the most popular song? songs_by_longest_duration n: 15 “2085”
Data Types Travel: I am currently in New York, I was wandering around Manhattan and I have only 10 dollars and 10 euros so please help me find a place for tonight. get_airbnb_by_price_range_neighbourhood or get_airbnb_by_price_and_neighborhood_group min_price : 0, max_price : 120, neighbourhood_group : Manhattan or price: 20, neighbourhood_group: Manhattan “Calm bed great area”, “Room with a view”
Aggregation Travel: I plan to budget 500 dollars for my upcoming New York trip, with 80 dollars designated for accommodation, 220 dollars for clothing, and 200 dollars for dining at various restaurants. Could you offer me diverse lodging recommendations that fit within this financial plan? Please return at most 10 entries. get_listing_by_lower_price price: 80 “Amazing huge furnished room!”, “Sunny and spacious 1-bedroom in Brooklyn”, “Huge room in great area 25 minutes from Manhattan”, etc.
Extra Context Travel: I know that I have a friend from high school, she was from California and she now lives in Brooklyn, New York. Recently, I heard she got some new apartments in that area, but I couldn’t find them. Could you please help me find some apartments in that area? get_listing_by_neighbourhood_group neighbourhood_group: Brooklyn “Sunny and spacious 1-bedroom in Brooklyn”, “Large Private Room in Greenpoint Williamsburg”, “Spacious Brooklyn Brownstone 3BR+ etc.”
Music: The Michelin star, which the most respected restaurants and chefs can have, has been used to measure the most successful food services in Europe and the world for more than 100 years. While having dinner with my friends in a restaurant in Paris, we started discussing our favourite artists. My best friend likes Bollywood songs and her favourite singer is Arjit Singh, while my other friends are fans of BTS. I personally like Taylor Swift and Selena. Can you recommend 3 songs by my best friend's favourite singer which we can enjoy while tasting an epic menu at a Michelin star restaurant? songs_by_artist artist_name: Arijit Singh "Phir Aur Kya Chahiye", "Apna Bana Le", "Jhoome Jo Pathaan", "Kesariya"
Reasoning Travel: I have 300 dollars with me but I was planning to spend one third on accommodation in New York, do you have some recommendations for me? Please return at most 10 entries. get_listing_by_price or get_listing_by_lower_price price: 100 “Amazing huge furnished room!”, “Sunny and spacious 1-bedroom in Brooklyn”, “Large Private Room in Greenpoint Williamsburg”, etc.
Music: I am hosting a birthday party for my child. I prefer the danceability score to be higher than 90 to be suitable for a birthday party. Can you please recommend some latest songs which are suitable for children? songs_by_danceability_explicitness danceability_threshold: 90, explicit: false "YOU THE VIBE", "TRAKA", "DOGGY DOGGY", "MOLI", "PALETA PA TO EL MUNDO", "RICO FEO", "TEKIRIKI"
Simple Questions Travel: As I am staying for seven weekdays in New York, 5 days I will be staying at a relative's house and afterwards I would like to find a very nice Airbnb in Queens to spend the rest of the days with my wife. Each of us have a budget of 220 bucks per night so we could stay together in a nice, clean Airbnb. Considering my requirements and after finding the certain Airbnbs, please find the corresponding proximity data such as latitude and longitude and use them to provide me with its specific address if possible? get_airbnb_by_price_min_nights_and_neighborhood_group & get_long_lat_by_airbnb & get_airbnb_address_by_lat_long price: 440, min_nights: 2, neighbourhood_group: Queens & listing: rooms for rent in Queens with piano, City Skyline Views from every room! Nice Private Room Beauty in Queens & latitude: 40.70163, longitude: -73.90867 “TURNBULL AVENUE”
Typos Travel: I want to find a shered place to stay, can you give me some suggestions? get_listing_by_room_type room_type: Shared room “Bedroom 7 bed A.”, “Williamsburg Loft!! Bedford L 1blk!”, “Calm bed great area.”, etc.
Music: I am a fan of Dua Lipa. Can you please provide me with a list of her songs. songs_by_artist artist_name: Dua Lipa "Dance The Night", "Cold Heart", "One Kiss ", "Don't Start Now", "No Lie", "Sweetest Pie", "Levitating", "Potion"
Long Tail Entities Travel: I am close to 'DUNKIN'', BASKIN ROBBINS .'Is it sure that I will find coffee or tea there? get_cuisine restaurant: DUNKIN', BASKIN ROBBINS "Donuts", "CoffeeTea"
Music: Provide all albums that are footwork but not wonky, alternative R&B, conscious hip hop, instrumental hip hop, deconstructed club. albums_by_genres2 genres_in: ["wonky"], genres_out: ["footwork", "alternative R&B", "conscious hip hop", "instrumental hip hop", "deconstructed club"] "No Love Deep Web", "WLFGRL", "Double Cup", "Galaxy Garden", "Room(s)", "So It Goes"
Interpretation Music: Among the 10 longest songs, how many can I listen to fully in an hour of my time? songs_by_longest_duration n: 10 9

4. Benchmark Evaluation

In this section, we introduce our methodology for evaluating the LLM responses and the experimental results on the different question categories of the benchmark.

4.1 Evaluation of LLM Response

To assess the LLM responses, we evaluate the response along three dimensions: functions called, parameters filled, and final answer given.

Evaluation of final answer. To evaluate the final answer given, the LLM's response, which is returned as JSON, is compared to the expected JSON answer in the ground truth. The comparison is done by exactly matching their name/value pairs. The expected JSON values can include strings, numbers, boolean and lists. For lists, the evaluation may consider the order of items if specified in the ground truth in the boolean parameter ordered_items. If the order is unspecified, the lists are sorted before comparison. Ultimately, the response is evaluated as "Correct" if all key/value pairs match and is evaluated as "Incorrect" otherwise.

Evaluation of functions called. To evaluate the performance of LLMs in calling the correct sequence of functions along with their parameters, we compare the path taken by the model to provide the final answer to the correct paths in the ground truth. If multiple function paths can be taken to answer the question correctly, we first identify which path in the ground truth has the most functions in common to the final answer given by the LLM and compare those functions to the functions called by the LLM. We distinguish three cases in the evaluation of function calls:

The accuracy of function calling is calculated as: $$Acc_{function} = {\#correct\_functions\_in\_answers \over \# functions\_in\_ground\_truth}$$

Evaluation of parameters filled. Similarly to evaluating functions called, in evaluating the accuracy of parameter filling, we compare the called parameters to the correct parameters in the ground truth. A string parameter is deemed correct if the keys and values match. If the parameter is a number (integer or float), we round it to two decimals and check if it equals to the parameter value in the ground truth. As in function calling, we distinguish three evaluation cases:

The accuracy of parameter filling is calculated as: $$Acc_{parameter} = {\#correct\_parameters\_in\_answers \over \# parameters\_in\_ground\_truth}$$

4.2 Experimental Results

In this section, we present the experimental results of our project, focusing on the accuracy of function calls, parameter filling, and final answers generated by the GPT 4 model (gpt-4-1106-preview). Detailed statistics are provided in the following tables, showcasing the model's performance in various function-calling scenarios. These include the accuracy of the model when calling functions with a single parameter (both with simple and complex function descriptions), calling functions with multiple parameters, and multiple function calling (sequential and parallel). The statistical analysis presented in the following tables offers insights into the model's effectiveness in interpreting and responding to queries covering different categories.

Table 4: Results of Single Parameter Function Calling Phase (Simple Description)
Accuracy (in %)
Categories Function Parameter Answer
Selection 100 100 64
Interpretation 100 100 86
Reasoning 100 100 80
Aggregation 100 100 71
Data Types 100 77.78 77.78
Languages 100 87.5 75
Typos 100 66.67 55.56
Long-Tail Entities 33.33 33.33 33.33
Extra Context 83.33 66.67 44.44
Overall 95 85 66
Table 5: Results of Single Parameter Function Calling Phase (Complex Description)
Accuracy (in %)
Categories Function Parameter Answer
Selection 90.91 90.91 63.63
Interpretation 100 100 57.14
Reasoning 100 100 80
Aggregation 100 100 85.71
Data Types 100 100 100
Languages 93.75 87.5 87.5
Typos 100 66.67 55.56
Long-Tail Entities 33.33 33.33 33.33
Extra Context 88.89 72.22 50
Overall 93.81 86.6 71.13
Table 6: Results of Multiple Parameter Function Calling Phase
Accuracy (in %)
Categories Function Parameter Answer
Selection 89 90 53
Interpretation 80 89 40
Reasoning 82 83 45
Aggregation 95 95 81
Data Types 100 66.67 18.18
Languages 100 92.31 77.78
Typos 92 82.86 58.33
Long-Tail Entities 83.33 88.23 83.33
Extra Context 86.67 84.09 53.33
Overall 90 85.68 59

Table 4 depicts the efficiency of single-parameter function calling when simple function descriptions are employed. The model distinctly exhibited inferior performance within the Long Tail Entities category concerning function calls, relative to its performance in other categories, and demonstrated reduced efficacy in accurately determining parameters across several categories, including Typos, Long Tail Entities, and Extra Context. Comparing the results in Tables 4 and 5, it can be seen that the model performed almost equally well in case of functions with simple and complex descriptions. The overall accuracy is almost the same for both the cases.

Table 6 shows the result of multi-parameter function invocation. We can observe from the results that the performance of the model in providing correct answers has declined noticeably in the case of invoking functions with multiple parameters as compared to that of a single parameter. There is a noticeable decline in the accuracy of parameter invocation, particularly within the Data Types category, in comparison to the performance observed across other categories. In delivering the accurate final answer, the model's overall performance notably lags its success in both parameter and function calling.

Table 7: Results of Multiple Function Calling (Parallel)
Accuracy (in %)
Categories Function Parameter Answer
Selection 75 75 50
Reasoning 100 95.45 77.77
Aggregation 100 100 50
Languages 100 100 57.14
Extra Context 100 82.35 50
Simple Questions 100 100 100
Overall 97.8 93.54 60.97
Table 8: Results of Multiple Function Calling (Sequential)
Accuracy (in %)
Categories Function Parameter Answer
Selection 91.67 91.67 66.67
Reasoning 71.43 66.67 44.44
Aggregation 78.57 68.29 50
Languages 72.72 48.38 36.36
Data Types 75 75 50
Extra Context 90 87.09 53.85
Simple Questions 63.63 50 37.5
Overall 76.74 65.57 50.88

The findings presented in Tables 7 and 8 offer valuable insight into the effectiveness of parallel invocation of multiple functions as compared to sequential. Table 7 shows that within the Selection and Extra Context category, the model encountered challenges in comprehending context and accurately transferring parameter values, particularly in comparison to other categories. The model faces challenges in determining the final answer in the overall evaluation, particularly Selection of Records, Aggregation, and Extra Context when compared to its accuracy in function and parameter determination. From Table 8, we can infer that the model finds it difficult to correctly parse parameters in case of sequential function calling, leading to the decline in accuracy of final answers. The results of Simple Question category indicates that despite providing additional instructions during prompting, the model remained unable to produce the correct final answer.

4.3 Error Analysis

After conducting a detailed analysis of cases where the GPT4 model failed to retrieve the correct function calls, and parameters, or provide accurate final answers, we identified recurring patterns and grouped them into distinct error categories for both use cases. These error categories are classified into three main groups: instances where the algorithms fail to retrieve correct functions, instances where accurate parameters are not retrieved, and situations where the models provide incorrect final answers. The latter group encompasses scenarios where the sequence order of function calls may be incorrect, or the model fails to reason adequately even when functions and parameters are retrieved accurately.

In the following section, we will provide a detailed breakdown of each error category, focusing on the specific types of errors made by the algorithms.

4.3.1 Error Categories for Incorrect Function Calls

There are approximately 8 instances where the GPT4 model fails to call the appropriate functions or provide the correct list of functions. Through identifying patterns and limitations in each instance, we have clustered them into two error categories. Each category represents similar types of limitations encountered while running the user’s prompts.

(i) Incomplete Function Invocation: This error category comprises 7 questions where the model fails to call the second or third function due to lack of reasoning or misunderstanding of the provided user request. This could occur if the model does not recognize the need for calling the second or the third function or incorrectly assumes that the task can be completed with just one function call. It may also arise when the user provides excessive numerical details in the input prompt, causing the model to overlook other aspects of the user's request. This also includes the instance where the model relies on existing knowledge and prefers not to call the artist_info function when information about famous artists is required.

(ii) Wrong Function Invocation: This error category comprises 11 questions where the model fails to call the correct function due to not understanding the provided user request. The model either does not correctly undertand which parameters needs to be used to call the correct function which is requested in the user query or in some cases, the model relies on existing knowledge and prefers not to call the correct function.

(iii) Additional Function Calls: There are around 3 instances where the GPT4 model calls multiple functions in addition to the correct function. The model can not differentiate which function needs to be called due to the multi-targeted information retrieval and tries to call potential functions in order to retrieve the information requested in the user query.

(iv) Failure to Pass Entire List between Functions (List Passing Error): This error category encompasses 2 failed answers, which we consider significant due to their uniqueness and their demonstration of important algorithm limitations. Although these instances stand alone, they shed light on a broader issue observed during experimentation with other instances: the consistent failure of the model to pass the full list of values from the output of one function to the input of another. In these failed instances, the algorithms fail to transfer the entire list of elements outputted by the first function to the second function during the sequential multiple-function calling, resulting in incomplete data processing and potential inaccuracies.

 Error Categories for Incorrect Function Calls
Figure 2: Distribution of the questions in the error categories for incorrect function calling

Based on the error categories assigned for incorrect function calls, we have gained insight into the main limitations contributing to most of the model failures in calling the right functions. According to Figure 2 , the primary issue with the function calling functionality of the GPT model lies in sequential multiple function calling. In these instances, the model fails to call the correct sequence of functions or disregards the need to call the second or third function. This failure is typically attributed to a lack of reasoning and understanding of the necessity for calling a sequence of functions rather than just a single function.

4.3.2 Error Categories for Incorrect Parameter Calls

There are approximately 60 instances where the GPT4 algorithms fail to retrieve the correct parameters for the called functions. By identifying patterns in the limitations that contribute to the failure of the model to call the correct parameter values given the user’s prompts, we have grouped similar limitations into distinct error categories. As a result, we have identified 7 error categories, each representing a cluster of questions where the algorithms failed to retrieve the right parameters due to similar limitations.

(i) Parameter Hallucination due to Wording/Extra Context: This category encompasses 11 instances where the model, tested for function calling abilities, fails to retrieve the correct parameter fillings and instead completely hallucinates the parameters. Parameters generated by the model are entirely unrelated to the task at hand or the information provided in the user query, indicating a disconnect between the generated output and the intended context.

This phenomenon is believed to occur due to the model's inability to correctly understand the entirety of the user request, particularly when the request is verbose, includes excessive details, or provides extensive related context. In such cases, the algorithms become confused and fail to grasp the essence of the request, resulting in inaccurate parameter filling and consequently, a wrong or completely unrelated answer to the user.

(ii) Misidentification of Numeric Entities in User Prompt: This error category contains 2 instances. In these instances, the model fails to accurately identify parameter fillings, specifically numeric entities provided by the user in the prompt. The model misinterprets numeric sequences in the input text, such as phone numbers and postal codes, as non-numeric entities, such as listing names or other textual elements. This leads to errors in identifying relevant parameter values, resulting in incomplete information being processed.

(iii) Misinterpretation of Non-Latin Characters: Under this error category fall around 5 instances. The model misinterprets characters from non-Latin scripts, leading to errors in comprehension and analysis of text inputs. The model encounters difficulties in accurately translating the user's request from one language to another, resulting in incorrect parameter values being passed to the function.

(iv) Incorrect String Handling: There are around 13 instances where the GPT4 model encounters difficulties in handling string values in the prompts. This error occurs in three cases: Firstly, when the user provides context that may confuse the algorithms, leading to incorrect extraction of string elements for parameter values based on the provided context. Secondly, when the model fails to handle string synonyms, resulting in extracted parameters deviating from the expected string representation format in the dataset used, thereby introducing inaccuracies when the function is called. Lastly, when dealing with user typos, the algorithms struggle to identify and correct them, resulting in incorrect parameter values being selected.

(v) Currency and Conversion Error: In this error category, there are 5 instances where the model fails to retrieve the correct parameter value due to limitations in currency conversions. This error occurs in two situations: firstly, when the algorithms are unable to perform accurate currency conversion from the currency format provided by the user to the currency type used in the dataset API (e.g., yuan to dollars); secondly, when the algorithms inaccurately convert currency values provided in different units by the user’s prompt (e.g., dollars and cents) to the currency unit of the same currency type used in the dataset API.

(vi) Limitations in Text-to-Number Conversion: There are a total of 3 cases where the model’s algorithm retrieves incorrect parameter fillings due to limitations in text-to-number conversion. The algorithms struggle to recognize and convert textual representations of time, such as "week", into their corresponding numeric values (e.g., 7 days). Additionally, they encounter errors or failures when attempting to perform division or other mathematical operations with numerical values represented as text within user prompts.

(vii) Date Inference Error: According to the data presented, the primary limitation of the GPT4 model arises when dealing with various date formats. There are approximately 18 instances where the model fails to retrieve the correct date value as the parameter. This error category encompasses questions where the model returns inaccurate answers due to difficulties in inferring dates from user inputs or contextual cues. These errors may occur when the algorithms encounter discrepancies in converting textual date representations provided in the prompts to the standardized date format required by the API dataset, resulting in errors in parameter value representation. Additionally, errors may occur when the algorithms fail to accurately recognize and interpret the date format when presented with dates in different formats in the user prompt, leading to errors in parameter value retrieval.

(viii) Incomplete Parameter Values Passing: In this category, there are 3 instances. In these three cases, the LLM does not receive all the necessary parameters or arguments that are required to execute the set of functions correctly. It can happen due to misunderstanding of the function requirements and lack of context.

 Error Categories for Incorrect Parameter Calls
Figure 3:Distribution of the questions in the respective error categories for incorrect parameter calls

As depicted in Figure 3, our analysis reveals that the primary cause of the highest number of failures in the GPT4 model is the presence of dates in various formats within user prompts. Additionally, we observe that the second most prominent limitation occurs when users include additional context (related and unrelated) or synonyms in their prompts, which the model struggles to interpret accurately. This often results in incorrect parameter fillings and subsequently, incorrect answers. These findings underscore the importance of addressing these specific limitations to enhance the model's performance in handling diverse user inputs effectively.

4.3.3 Error Categories for Incorrect Answers

There are approximately 64 instances where the GPT 4.5 model fails to provide correct answers. By analyzing the underlying patterns and types of errors observed in these instances, we categorized them into 8 distinct error categories. Each error category comprises questions where the model provided inaccurate responses due to similar limitations or recurring patterns.

(i) Arithmetic Anomaly (Summation/Average): This error category contains 13 instances where the GPT 4.5 model encounters difficulty in executing arithmetic computations accurately, leading to inaccurate responses generated by the model in case of summation and average operations. The performance of the model in arithmetic calculations is observed to be especially affected in the case of dealing with a long list of numerical data. For example, when dealing with a long list of ratings by customer, a list of the number of streams of song, or a list of any other numerical attribute, the model struggles to maintain precision in its calculations.

(ii) Sorting Anomaly (Numerical Values): This error category comprises 9 questions where the GPT4 model returned inaccurate answers due to probable anomalies in sorting numerical values, particularly when requested to identify specific positions of any attribute such as the third highest or lowest value. Processing multiple numeric values simultaneously poses challenges for the algorithms, leading to inaccuracies or failures in determining relationships between the items.

(iii) Sorting Anomaly (Dates): This error category encompasses 6 instances where the GPT4 model provides inaccurate responses due to potential anomalies in sorting date values, particularly when queried to return results based on specific criteria such as the third most recent or least recent date. It can lead to discrepancies in the ordering of dates and consequently inaccurate responses in case of specific chronological criteria-based queries.

(iv) Counting Discrepancy: This error category consists of 1 instance. In this instances, the model fails to provide a correct response due to discrepancies or inaccuracies in counting elements or results, such as the number of songs from the resulting records of the called function.

(v) Multi-Information Retrieval Error: This error category includes 12 instances where the GPT4 model encounters challenges or inaccuracies in retrieving multiple pieces of information or results. Instead of retrieving relevant information that satisfies the criteria specified by the user, the model returns responses with errors, such as selecting records that meet only one of the multiple specified criteria. Examples include instances where the model fails to retrieve all relevant records that meet specified criteria or where it overlooks certain criteria altogether, resulting in incomplete or inconsistent responses.

(vi) Query Misinterpretation: This error category comprises 7 questions where the GPT4 model returns inaccurate answers due to difficulty in interpreting user queries. It includes cases where the model fails to grasp the context of user queries, misinterprets specific keywords or phrases, or the intent behind the question by the user. This category also caters to instances where the model incorrectly parses the provided details or fails to recognize a parameter value due to the given extra context in the user’s prompt, leading to errors in identifying the relevant parameter values, resulting in incomplete information being processed. Also, there are cases where the model misinterprets numeric sequences in the input text, such as phone numbers and postal codes, as non-numeric entities, such as listing names or other textual elements.

(vii) Inexact Matching/Lack of Contextual Relevance & Common-sense Reasoning/Lack of Specificity: In this category, there are a total of 14 questions. The responses of the GPT4 model lack contextual relevance and specificity, leading to inaccuracies or inadequacies in addressing user queries. This category also encompasses instances where the provided information only partially aligns with the user query, resulting in incomplete responses or additional irrelevant information. Furthermore, the algorithms struggle to select the appropriate functions or provide the right final answer due to a lack of common-sense reasoning abilities. (e.g., the algorithm, lacking common-sense reasoning abilities, fails to understand that a baby typically doesn't require a separate accommodation booking or incur additional charges)

(viii) Reliance on Previous Knowledge: There are 2 instances where the GPT4 model returns inaccurate answers due to the use of its previous knowledge base rather than relying on getting information from the results of called functions. Algorithms encounter difficulties when dealing with well-known entities, leading to failures in identifying relevant functions and parameters for popular restaurant names, albums, or artists. They provide responses based on pre-existing knowledge, resulting in potentially inaccurate or incomplete answers.

 Error Categories for Incorrect Answers
Figure 4:Distribution of the questions in the respective error categories for incorrect final answers.

Even when the correct functions and parameters are called, the models occasionally fail to provide the right final answer to users. According to Figure 4, this can be attributed to various limitations, including the failure to retrieve all relevant records that meet specified criteria or overlook certain criteria altogether, resulting in incomplete or inconsistent responses. Another common limitation is the inability of the models to perform arithmetic operations such as summation and averaging of long lists of values. Additionally, issues such as dealing with nuanced context and the lack of commonsense reasoning contribute to query misinterpretation and inaccurate results.

5. Code and Data

We offer the Mannheim Function Calling Benchmark for public download and make the code for running the experiments available on github. In the table below, you will find for download all questions and functions sets as well as all test configuration files to run the experiments testing the different combinations of question and function sets. We also offer for download the logs of the previous model runs. Lastly, you will find in the last row of the table the datasets used for the creation of the benchmark.

File Size
Question and function sets Question and function sets.zip 77 KB
Test configuration files Configuration Files.zip 7 KB
Model Output Logs Model Output Logs.zip 123 KB
Datasets Datasets.zip 586 KB

6. Existing Benchmarks

In this section, we show and compare some related work regarding benchmarks, dataset creation and models that try using different tools in order to enhance LLM with additional information and capabilities that go in the same line of thought as with function calling enabled LLMs.

APIBench [5] introduces a new benchmark to help Large Language Models (LLMs) improve their accuracy and flexibility when working with various tools through APIs and API documentation. By combining self-instruct fine-tuning and retrieval methods, LLMs are trained on a large dataset of APIs gathered from major model hubs like TorchHub, TensorHub, and HuggingFace. This dataset covers a wide range of domains, including multimodal data, computer vision, natural language processing, audio, tabular data, and reinforcement learning. Each API call is detailed in JSON objects, including information like domain, framework, functionality, and example code. Synthetic user prompts, generated using the self-instruct approach and GPT-4, accompany each dataset entry to task the model with creating real-world use cases involving the APIs [6]. Evaluation involves matching AST sub-trees to determine which API the LLM selects, with a focus on compatibility with the reference API. Experiments comparing Gorilla's performance against other models in a zero-shot setting assess different retrieval methods and Gorilla's adaptability to changes in API documentation at test time. Gorilla's retriever-aware training proves highly adaptable to such changes, maintaining accuracy and relevance over time, while also avoiding hallucination and meeting specified constraints. However, it's worth noting that ML APIs might produce biased predictions if trained on biased data.

API-Bank [8] aims to address three key questions regarding the effectiveness of LLMs in utilizing tools, methods to enhance their tool utilization ability, and the obstacles they face in effectively leveraging tools. To evaluate LLMs' tool utilization effectiveness, the API-Bank evaluation system is implemented, incorporating 73 commonly used APIs and 314 tool-use dialogues with 753 manually annotated API calls. To enhance LLMs' tool utilization ability, a comprehensive tool-augmented LLM training dataset is developed using a novel method called Multi-agent, comprising five collaborative agents. This dataset covers three different API usage abilities and emphasizes domain diversity, API authenticity, API diversity, and evaluation authenticity. The study also conducts experimental analyses to understand the main challenges faced by LLMs, like GPT-4 and their own model Lynx when utilizing APIs. Annotated dialogues in the evaluation data cover Call, Retrieval+Call, and Plan+Retrieval+Call abilities, with Lynx being fine-tuned using the APIBank training dataset and benchmarked against various LLMs. Model performance is evaluated based on API call correctness and the quality of LLM-generated responses, with six primary error types classified and assessed. Limitations include the implementation being in English only, the use of a small model for fine-tuning, and the potential for future work in other languages and with larger scale models [8].

ToolQA [1] is a dataset designed to assess the Language Model's (LLM) ability to utilize external tools and generate knowledge for improved question answering. It minimizes overlap with pre-training data and includes 8 domains and 13 types of tools to retrieve information. The process involves three phases: reference data collection, human-guided question generation with LLMs, and programmatic answer generation. Different LLM models, including standard and tool-augmented versions, are used for easy and hard questions. ToolQA focuses on the final correct answer rather than intermediate tool use processes. The dataset employs reference corpora defined by contextual dimensions and answer templates generated by ChatGPT. Answers are sampled from retrieved data, and accurate answers are created using operators and tool chains for multi-step reasoning. Various tools are utilized for text retrieval, database operations, code interpretation, mathematical computations, graph data, and parsing feedback. The analysis identifies incorrect tool calls and data sources, categorized into three main error types.

ToolBench [2] is a benchmark devised to assess how open-source Language Models (LLMs) can be improved with tool manipulation capabilities akin to closed LLM APIs, with practical human oversight to avoid exposing enterprise-internal workflows. It integrates a variety of software tools for real-world tasks, incorporating both existing and newly acquired datasets, and employs test cases for quantitative evaluation, setting it apart from other benchmarks. Task complexity is gauged based on API intricacy and the need for advanced reasoning, with success rate serving as the primary evaluation metric. To enhance open-source LLMs, three techniques are employed: Model Alignment, Incontext demonstration retriever, and System Prompt [3, 11]. These methods aim to align LLMs with API usage examples, enhance argument population, and control the natural language style of responses, respectively. Evaluation involves identifying challenges such as incorrect API selection, difficulty in argument population, and non-executable code generation, which are addressed by tuning with API usage examples. Advanced reasoning remains a challenge for open-source models.

To provide a more general and concentrated picture of the comparisons done above, we summarize the comparisons in Table 9.

Table 9: Comparison of Benchmarks
Comparison Points ToolQA ToolBench APIBench API-Bank Our Benchmark
The goal LLMs enhanced using external sources LLMs enhanced using external sources LLMs enhanced using external sources LLMs enhanced using external sources LLMs enhanced using external sources
Type of assessment Question answering API Calls API Calls API Calls Question answering
Answer Evaluation Final Correct Answer Success Rate (in code generation) Accuracy metric (in code generation) Accuracy metric(API call) & ROUGE-L metric(responses after API call) Final Correct Answer
Human Intervention Human Templates Human Templates Human Templates & Self-Instruct method Human Templates & Multi-agent method Human Templates
Answer Retrieval Operations/ Functions for multi-step reasoning Functions/Techniques for multi-step reasoning Functions/Techniques & Retrievers Functions/Techniques for multi-step reasoning Operations /Functions for multi-step reasoning
Task Challenge No distinction in questions or API call complexity (advanced reasoning) API complexity to measure the difficulty of choice of API calls (No advanced reasoning) API usage abilities in dialogues single and multiple calls API calls with various constraints No distinction in questions or API call complexity (advanced reasoning)
Type of Enhancement Specific to different domains Specific to different domains Specific to different domains and major ML hubs Specific to different domains and principles No direct specification - use case independent
LLM Knowledge API Calls only- no internal LLM knowledge API Calls only- no internal LLM knowledge API Calls only- no internal LLM knowledge API Calls and internal LLM knowledge API Calls and internal LLM knowledge
Type of Data Used Several types of data: tabular, text corpora, and graphs New and existing datasets Tool augmented dataset with Self-Instruct method Tool augmented dataset with Multi-agent method Tabular Data only, Existing Datasets
Components Passed to Models Questions, answers, reference corpora, and available tools Instruction in natural language as a goal, API documentations Instruction in natural language as a goal, API documentations Instruction in natural language as a goal, API documentations Question set file, functions, and ground truths
Intermediary Evaluation No (only final answer) Yes (intermediate steps included - API calls, parameters, etc.) Final API call using AST sub-tree matching method Yes (intermediate steps included - API calls, parameters, etc.) Yes (intermediate steps included - API calls, parameters, etc.)

7. References

[1] Zhuang, Y., Yu, Y., Wang, K., et al. (2024). Toolqa: A dataset for LLM question answering with external tools. Advances in Neural Information Processing Systems, 36.
[2] Xu, Q., Hong, F., Li, B., et al. (2023). On the tool manipulation capability of open-source large language models. arXiv:2305.16504
[3] Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, pp. 27,730–27,744.
[4] Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073
[5] Patil, S.G., Zhang, T., Wang, X., et al. (2023). Gorilla: Large Language Model Connected with Massive APIs. ArXiv, abs/2305.15334.
[6] Wang, Y., Kordi, Y., Mishra, S., et al. (2022). Self-instruct: Aligning language model with self-generated instructions. arXiv:2212.10560.
[7] Taori, R., Gulrajani, I., Zhang, T., et al. (2023). Stanford Alpaca: An instruction-following llama model.
[8] Li, M., Song, F., Bowen, Y., et al. (2023). API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. Conference on Empirical Methods in Natural Language Processing.
[9] Brown, T., Mann, B., Ryder, N., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
[10] Du, Z., Qian, Y., Liu, X., et al. (2022). GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335.
[11] Glaese, A., McAleese, N., Tr˛ebacz, M., et al. (2022). Improving alignment of dialogue agents via targeted human judgements, arXiv:2209.14375.
[12] Kim, S., Moon, S., Tabrizi, R., et al. (2024). An LLM compiler for parallel function calling. arXiv:2312.04511 [cs.CL]
[13] Srinivasan, V. K., Dong, Z., Zhu, B., et al. (2023). NexusRaven: A commercially-permissive language model for function calling. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
[14] Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling Laws for Neural Language Models. ArXiv, abs/2001.08361.
[15] Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). Training Compute-Optimal Large Language Models. ArXiv, abs/2203.15556.
[16] Wei, J., Tay, Y., Bommasani, R., et al. (2022). Emergent Abilities of Large Language Models. ArXiv, abs/2206.07682.
[17] Chang, Y., Wang, X., Wang, J., et al. (2023). A Survey on Evaluation of Large Language Models. ArXiv, abs/2307.03109.
[18] Schick, T., Dwivedi-Yu, J., Dessì, R., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. ArXiv, abs/2302.04761.
[19] Mialon, G., Dessì, R., Lomeli, M., et al. (2023). Augmented Language Models: a Survey. ArXiv, abs/2302.07842.
[20] Langchain. (2023). Parallel Function Calling for Structured Data Extraction.
[21] OpenAI. (2023). Function calling.