Figure 1: Mental model of an AI feature
- Business users can access organisational data in a direct and timely way.
- This relieves data scientists and analysts from the burden of ad-hoc requests from business users and allows them to focus on advanced data challenges.
- This allows the business to leverage its data in a more fluid and strategic way, finally turning it into a solid basis for decision making.
- You are offering a scalable data/BI product and want to enable more users to access their data in a non-technical day, thus growing both the usage and the user base. As an example, ServiceNow has integrated data queries into a larger conversational offering, and Atlan has recently announced natural-language data exploration.
- You are looking to build something in the data/AI space to democratise data access in companies, in which case you could potentially consider an MVP with Text2SQL at the core. Providers like AI2SQL and Text2sql.ai are already making an entrance in this space.
- You are working on a custom BI system and want to maximise and democratise its use in the individual company.
Figure 2: In this Text2SQL representation, data-related elements and relations are marked in yellow.
- A SQL query can be mapped to many different questions in natural language; for example, the above query semantics can be expressed by: “show me the names and numbers of followers per user”, “how many followers are there for each user?” etc.
- SQL syntax is highly versatile, and almost every question can be represented in SQL in multiple ways. The simplest example are different orderings of WHERE clauses. On a more advanced stance, everyone who has done SQL query optimisation will know that many roads lead to the same result, and semantically equivalent queries might have completely different syntax.
- Columns and tables of the database
- Relations between tables (foreign keys)
- Database content
Figure 3: In this Text2SQL representation, algorithm-related elements and relations are marked in yellow.
- SQL syntax and semantics
- Database structure
- Natural Language Understanding (NLU)
- Mapping between natural language and SQL queries (syntactic, lexical and semantic)
Figure 4: The Text2SQL algorithm has to deal with many different variants of a question
- Exact-set match accuracy: the generated and target SQL queries are split into their constituents, and the resulting sets are compared for identity. The shortcoming here is that it only accounts for order variations in the SQL query, but not for more pronounced syntactic differences between semantically equivalent queries.
- Execution accuracy: the datasets resulting from the generated and target SQL queries are compared for identity. With good luck, queries with different semantics can still pass this test on a specific database instance. For example, assuming a database where all users are aged over 30, the following two queries would return identical results despite having different semantics:
- select * from user
- select * from user where age > 30
- Test-suite accuracy: test-suite accuracy is a more advanced and less permissive version of execution accuracy. For each query, a set (”test suite”) of databases is generated that are highly differentiated with respect to the variables, conditions and values in the query. Then, execution accuracy is tested on each of these databases. While requiring additional effort to engineer the test-suite generation, this metric also significantly reduces the risk of false positives in the evaluation.
Figure 5: In this Text2SQL representation, UX-related elements and relations are marked in yellow.
- SQL errors: the generated query is not valid — either the SQL is invalid, or it cannot be executed against the specific database due to lexical or semantic flaws. In this case, no result can be returned to the user.
- Semantic errors: the generated query is valid but it does not reflect the semantics of the question, thus leading to a wrong returned dataset.
4. Non-functional requirements
- No information
- Database schema
- Database content
- Text2SQL allows to implement intuitive and democratic data access in a business, thus maximising the value of the available data.
- Text2SQL data consist of questions at the input, and SQL queries at the output. The mapping between questions and SQL queries is many-to-many.
- It is important to provide information about the database as part of the prompt. Additionally, the database structure can be optimised to make it easier for the algorithm to learn and understand it.
- On the input, the main challenge is the linguistic variability of natural-language questions, which can be approached using LLMs that were pre-trained on a wide variety of different text styles
- The output of Text2SQL should be a valid SQL query. This constraint can be incorporated by “injecting” SQL knowledge into the algorithm; alternatively, using an iterative approach, the query can be checked and improved in multiple steps.
- Due to the potentially high impact of “silent failures” which return wrong data for decision-making, failure management is a primary concern in the user interface.
- In an “augmented” fashion, users can be actively involved in iterative validation and improvement of SQL queries. While this makes the application less fluid, it also reduces failure rates, allows users to explore data in a more flexible way and creates valuable signals for further learning.
- The major non-functional requirements to consider are accuracy, scalability, speed, explainability, privacy and adaptability over time. The main trade-offs consist between accuracy on the one hand, and scalability, speed and privacy on the other hand.