Detecting anti-patterns in SQL Queries using Text Classification Techniques

A major problem with using relational databases, is writing efficient SQL queries. Some common errors known as anti-patterns are frequent in SQL queries and can seriously impact the query execution time and sometimes, the database general performance. This paper shows how ma-chine learning techniques can be leveraged to detect anti-patterns in SQL queries by approaching the problem as a text classification problem. Our result is a model based on a convolutional neural network that can be used to classify a SQL query into zero, one or many anti-patterns classes. Keywords— SQL, relational database, text classification techniques.1


INTRODUCTION
With the increasing amount of information stored in relational databases, it is necessary to write SQL queries that execute faster.Anti-patterns in SQL are common mistakes that if avoided, can make a query executes faster.For example, when query-ing an indexed column, replacing the OR operator with the IN operator, will result in better perfor-mance, because the IN operator leverages the in-dex.Thus, using the OR operator in this case, is an anti-pattern.SELECT u r l FROM p i c t u r e s WHERE i d = 10 OR i d = 20 can be rewritten as SELECT u r l FROM p i c t u r e s WHERE i d IN ( 1 0 , 2 0 ) By detecting the anti-patterns in a query, we can rewrite it into a better version.In this paper, we approach the problem as a multi-class multi-label classification problem.Our solution is schema-independent, meaning that the decision made by the neural network doesn't depend on the database logical or physical structure.The dataset used has been built from SQL queries provided by Sky-Server from Sloan Digital Sky Survey (SDSS).
SkyServer, the portal from the SDSS catalog, provides data access tools for astronomers and sci-entific education.Through SkyServer, users can use the SQL language to query the Sloan Digi-tal Server database.Since 2001, the portal has seen more than 280 million SQL queries submitted by users and those queries have been opened to the public through the different data releases.We fetch 1 million queries from SkyServer, that we filter, process and transform.The final dataset of usage contains 363616 unique SELECT queries.
Following a supervised learning approach, the SQL queries from SkyServer are used as input data; we manually label the data by associating each SQL query with a list of anti-patterns it con-tains.
Our model is based on a convolutional neural network trained to classify a query into multiple categories.We use the one-hot encoding technique to encode the queries as word vectors.For encoding the anti-patterns classes we use a one dimensional tensor with each class represented as an in-teger.
We explore some of the important work in the field of SQL anti-patterns detection in section 2. In section 3, we explain in details the process fol-lowed to build the dataset.Then, we discus s our model architecture in section 4. In section 5, we analyze the results from our experiments.Finally in the conclusion, we compare our work to the existing solutions and explore the possible future work.

II.
RELATED WORK Common mistakes in SQL has been already in the interest of researchers before the appearance of the ISO SQL-92 standard.In 1985, Welty studied how human factors can affect users in using SQL and found that user performance could be significantly improved.Later, Brass et  SQLLint tool which was able to au-tomatically identify these errors in (syntactically correct) SQL statements.The tool seems to be unsupported today.There is another online tool named SQLLint, but it is a SQL beautifier.
There are also books in this area.The Art of SQL and Refactoring SQL Applications pro-vide guidelines to write efficient queries, while the book of Bill Karwin collects antipatterns that should be avoided.
In a paper, Ahadi et al.,presented a large-scale analysis of students semantic mistakes in writing SQL SELECT statements.They collected data from over 2,300 students across nine years and summarized typical mistakes of the students.They found that most of the mistakes were made in queries which require a JOIN, a subquery or a GROUP BY operator.We argue that queries typ-ically use more complex syntax in practice com-pared to student projects.Hence, the situation can be even worse.
In the realm of embedded SQL, Christensen et al. proposed a technique and a tool (JSA, Java String Analyzer) to extract string expressions from Java code statically.As a potential application of their approach, they check the syntax of dynami-cally generated SQL strings.They limit their ap-proach to the syntactic validation of the queries.
Wassermann et al. propose a static string analy -sis technique to identify possible errors in dynam-ically generated SQL code.With the implemen-tation of a CFLreachability algorithm they detect type errors (e.g., concatenating a character to an integer value).Their approach works with ex-tracted query strings of valid SQL syntax.In a tool demo paper, they present their prototype tool called JDBC Checker.
Recently, Anderson and Hills studied query construction patterns in PHP.They analyzed query strings embedded in PHP code with the help of the PHP AiR framework.
Quality assessment of embedded SQL was pro-posed by Brink et al. in 2007.They analyzed em-bedded query strings in PL/SQL, Cobol, and Vi-sual Basic programs while they propose a generic approach which could be applied to Java too.They investigate relationships which could be detected through embedded queries (e.g., access, dupli-cation, control dependencies) and they propose quantitative query measures for quality assess -ment.
Many static techniques which try to deal with embedded query strings do it with the purpose of SQL injection detection.Yeole and Meshram pub-lished a survey of these techniques.SQL injection detection is different as the goal is specifically to determine whether a query could be affected by user input.Some papers also tackle SQL fault localization techniques.A dynamic approach was proposed by Clark et al. to localize SQL faults in database applications.They provide command-SQL tuples to show the SQL statements executed at database-interaction points.
A recent work of Delplanque et al. targets the database to assess the quality of the schema and to detect design smells in it.They implement a tool called DBCritics which can analyze PostgreSQL schema dumps and identify design smells such as missing primary keys or foreign key references.
A tool which also has to be mentioned here is the Eclipse plugin called Alvor and JSA [17], this plug -in analyzes the string expressions in Java code.What is more, Alvor checks syntax correct-ness, semantics correctness, and object availability by comparing the extracted queries against its in-ternal SQL grammar and by checking SQL state-ments against an actual database.

III. DATASET 3.1 Collecting the queries
We start building our dataset, by fetching 1 mil-lion successful SQL queries from the SkyServer catalog.

SELECT TOP 1000000
s t a t e m e n t FROM S q l L o g WHERE e r r o r = 0 Some of these queries need to be filtered out, in order to build a more focused dataset.

Filtering
From the fetched queries, we remove the dupli-cates, so the dataset contains unique queries only.
a l l Q u e r i e s = l i s t ( s e t ( v a l u e s ) ) As we focus on query anti-patterns, we remove all of the non SELECT queries.
i m p o r t r e a l l Q u e r i e s = l i s t ( f i l t e r ( l a m b d a i t e m : r e .s e a r c h ( " ˆ s e l e c t " , i t e m .l o w e r ( ) ) , a l l Q u e r i e s ) ) In the end, the dataset is reduce from 1000000 to 318188 queries.

Transforming
In order to eliminate irrelevant information and reduce the size of our dataset vocabulary, we re-place all of the schema-related terms contained in the queries with standard words.Thus the queries contain almost only standard SQL keywords.3.4 Annotating Following a supervised learning approach, having SQL queries as input, we need to map each query to a set of anti-patterns as labels.

International Journal of Advanced Engineering Research and Science (IJAERS)
Our work is based on 16 common anti-patterns.To each of the query, we match a single anti-pattern.In fact, a single query can contain several anti-patterns, but for simplicity purpose, we only consider the most dominant anti-pattern.We ex-plain in detail each anti-pattern in the Appendix section.In this paper, we use word embedding based on word2vec (Mikolov et al., 2013).To encode the SQL queries of our dataset, we choose to use the pre-trained google word2vec embedding.The model is trained on 100 billion words from Google News by using the Skip-gram method and maxi-mizing the average log probability of all the words using a softmax function.Our result model contains 123.852 tokens.

Convolutionnal Neural Network
The convolution neural network is a state-of-the-art method to model semantic representations of sentences.
The convolution action has been com-monly used to synthesize lexical n-gram informa-tion.In our model, we use three different convo-lutional filters with varying convolution window size to form parallel CNNs so that they can learn multiple type of embedding of local regions so as to complement each other to improve model accuracy.The final output is the concatenation of the output of each.

International Journal of Advanced Engineering Research and Science (IJAERS)
[Vol- V. EXPERIMENTS 5.1 Settings For all our experiments, we use the Stochastic Gradient Descent optimization algorithm with a learning rate of 0.1 and a weight decay of 0.95.We conduct the experiments with 50 epochs and we use mini-batches of size 64.We evaluate the model every 100 steps.We use google pretrained word2vec thus the dimension of each word vector is 300.
We study the sensivity of the proposed model to the convolutional region size, the number of con-volutional feature and the dropout rate.We found that we achieve the best performance when we use the settings values listed in the Table I.
Our model is developed in Python with Tensor-flow and Numpy libraries.The experiments are conducted on a MAC OS PC with 2.9 GHz Intel Core i7 processor and 8 GB RAM.

Validation method
For validating our model we use the iterated K-Fold validation model.
The dataset is split into 10 mini-datasets, which are used to validate each subset repeatedly.

Results
We compare our results with Sqlcheck .Sqlcheck is a lint tool that relies on syntax checking logic, to detect antipatterns in SQL queries.We run SQL check on each of our dataset query, and store the results, which we then compare to our CNN re-sults.SqlCheck Our model 80 83.2 Fig. 5: Experimental Settings After running the experiments, our model can detect anti-pattern in a query with an accuracy of 83.2.

VI. CONCLUSION AND FUTURE WORK
In this work, we experimented using text classi-fication techniques to detect anti-patterns in SQL queries.The model uses a neural network with a custom dataset built from SkyServer catalog SQL queries.Experimental results demonstrate that, our model is quite accurate and can outperform lint syntax checking software.
For the future, we could focus on rewriting queries based on the anti-patterns detected.

ACKNOWLEDGEMENTS
I would like to express my gratitude to my super-visor and corresponding author on this work, Xie Hongwei PhD, who has been a great mentor dur-ing the years of research.
I would also like to thank my friend Miguel Kakanakou, for his very helpful advice and ex-perience sharing during the whole process of this work.

APPENDIX
Anti-patterns explanation select * When you SELECT *, you're often retrieving more columns from the database than your appli-cation really needs to function.This causes more data to move from the database server to the client, slowing access and increasing load on your ma-chines, as well as taking more time to travel across the network.
Consider a scenario where you want to tune a query to a high level of performance.If you were to use *, and it returned more columns than you actually needed, the server would often have to perform more expensive methods to retrieve your data than it otherwise might.
When you SELECT *, it's possible to retrieve two columns of the same name from two different tables.This can often crash your data consumer null usage NULL is not the same as zero.A number ten greater than an unknown is still an unknown.NULL is not the same as a string of zero length.Combining any string with NULL in standard SQL returns NULL.NULL is not the same as false.Boolean expressions with AND, OR, and NOT also produce results that some people find confusing not null usage When we declare a column as NOT NULL, it should be because it would make no sense for the row to exist without a value in that column.
string concatenation You may need to force a column or expression to be nonnull for the sake of simplifying the query logic, but you don't want that value to be stored.Use COALESCE function to construct the con-catenated expression so that a null-valued column doesn't make the whole expression become null.

International Journal of Advanced Engineering Research and Science (IJAERS) [Vol-6, Issue-4, Apr-2019] https://dx.doi.org/10.22161/ijaers.6.4.35 ISSN: 2349-6495(P) | 2456-1908(O)
The most important disadvantage of pattern-matching operators is that they have poor per-formance.A second problem of simple pattern-matching using LIKE or regular expressions is that it can find unintended matches.Although using wildcards and unnamed columns satisfies the goal of less typing, this habit creates several hazards.This can break application refac-toring and can harm performance having clause usage Rewriting the query's HAVING clause into a pred-icate will enable the use of indexes during query processing.& join usage The DISTINCT keyword removes duplicates after sorting the tuples.Instead, consider using a sub query with the EXISTS keyword, you can avoid having to return an entire table. distinct