SQL education: A systematic mapping study and future research agenda

Structured Query Language (SQL) skills are crucial in software engineering and computer science. However, teaching SQL effectively requires both pedagogical skill and considerable knowledge of the language. Educators and scholars have proposed numerous considerations for the betterment of SQL education, yet these considerations may be too numerous and scattered among different fora for educators to find and internalize, as no systematic mappings or literature reviews regarding SQL education have been conducted. The two main goals of this mapping study are to provide an overview of educational SQL research topics, research types and publication fora, and to collect and propagate SQL teaching practices for educators to utilize. Additionally, we present a short future research agenda based on insights from the mapping process. We conducted a systematic mapping study complemented by snowballing techniques to identify applicable primary studies. We classified the primary studies according to research type, and utilized directed content analysis to classify the primary studies by their topic. Out of our selected 89 primary studies, we identified six recurring topics: (i) student errors in query formulation; (ii) characteristics and presentation of the exercise database; (iii) specific and (iv) non-specific teaching approach suggestions; (v) patterns and visualization; and (vi) easing teacher workload. We list 66 teaching approaches the primary studies argued for (and in some cases against). For researchers, we provide a systematic map of educational SQL research, and future research agenda. For educators, we present an aggregated body of knowledge on teaching practices in SQL education over a time frame of 30 years. In conclusion, we suggest that replication studies, studies on advanced SQL concepts, and studies on aspects other than data retrieval are needed to further educational SQL research.


INTRODUCTION
Among the core topics in software engineering, computer science, and information systems curricula in higher education are databases and Structured Query Language (SQL) [62,103,105].Since SQL is prevalent in database systems, SQL skills are also valued in the software industry, and consequently, teaching SQL effectively is essential in training future software professionals.However, teaching databases requires considerable subject knowledge in addition to pedagogical skill [102].Additionally, there are several approaches to teaching SQL, and especially for an inexperienced database course teacher, differentiating between patterns (i.e., an effective teaching approach) and anti-patterns (i.e., what merely looks like an effective teaching approach) is difficult [89].
SQL is commonly divided into at least two sublanguages, Data Manipulation Language (DML, e.g., SELECT, INSERT, UPDATE, DELETE) and Data Definition Language (DDL, e.g, CREATE, ALTER, DROP) [28,60].Additionally, as the revisions of the SQL standard have introduced more features, two more sublanguages, Data Control Language (DCL, e.g., GRANT, REVOKE) and Transaction Control Language (TCL or TxCL, e.g., BEGIN, COMMIT, ROLLBACK), are sometimes discussed in literature.The origins of the sublanguage names DCL and TCL remain unclear, as these names are not explicitly mentioned in the SQL standard.Nevertheless, we have found this division into four sublanguages helpful and rather intuitive, and utilize it in this study.

SQL in Higher Education
SQL teaching in higher education is both long-lived and widespread.In the information technology subfields, SQL is explicitly mentioned in software engineering (SE) [103], computer science (CS) [62], and information systems (IS) undergraduate curricula guidelines [105], and additionally in areas such as business analytics [113].These three information technology curricula guidelines expectedly overlap [64], and recommend SQL education on a relatively high level.SE guidelines recommend DML, DDL, and indexes, views, sequences, joins, and triggers in the context of database design.IS guidelines recommend DML, DDL, DCL, and transactions.CS guidelines provide the finest level of detail among the three, and recommend DDL, primary and foreign key attribute and schema definition, query formulation, UPDATE, integrity constraints, selection, projection, aggregate functions, GROUP BY, subqueries, division, stored procedures, and transaction control.
As these guidelines are merely guidelines, and presented at a high level, it is unclear how comprehensively and in depth SQL is covered in courses.Our impression, based on the primary studies and our teaching experience, is that basic DML is commonly discussed.Basic DML includes SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, INSERT INTO, VALUES, UPDATE and DELETE clauses, different types of joins, including the different variations of JOIN, certain predicates like IN, EXISTS, LIKE, BETWEEN, and IS NULL, and standard SQL aggregate functions MIN, MAX, AVG, SUM, and COUNT.However, advanced concepts like recursion, common table expressions, or derived tables are seldom included.It is unclear whether these advanced features are not widely known to educators and curricula designers, or have they been omitted from course contents by design.Basic DDL is commonly discussed, and includes CREATE, ALTER, and DROP statements on tables, views, and users (i.e., roles).Table creation includes column name and data type definitions, primary and foreign keys, and the CHECK constraint.However, more advanced concepts such as assertion, trigger, and procedure manipulations are rare.DCL is discussed to a lesser extent than DML and DCL, even though this sublanguage is relatively small and simple.If TCL is included in a course, it is often discussed with examples outside SQL, such as simple read(a), write(b) [e.g., 39, p. 669ff.],even though transactions are often defined using SQL.After SQL, database education may focus on non-relational extensions [107], other data models, and data analytics [109].

Learning Context
In the aforementioned information technology curricula guidelines, SQL is not taught in isolation, but as a part of a database course.Before learning SQL, students need to know, at the very least, about the relational data model, and possibly the theoretical foundations of relational query languages.Nowadays, students learn SQL in digital, interactive environments using an exercise database [23].This kind of environment can simply be a DBMS to which a student submits queries and receives output.
Manuscript submitted to ACM Alternatively, the environment may be a DBMS's SQL interface embedded in a web page, and the web page fitted with auxiliary elements, for example, a representation of the underlying database schema, a natural language request (i.e., data demand) to which a students must write an SQL equivalent, and the correct result table [101].The database may be represented at the conceptual level as an Entity-Relationship (ER) diagram [37], or at the logical level as a database schema diagram [43].For both levels, numerous additional or alternative notations can be utilized, for example, Unified Modeling Language class diagrams, enhanced/extended ER, and Logical Data Structures.When a student submits a query, the DBMS outputs either a result table, or an error message.Commonly, SQL errors have been divided into syntax and semantic errors [95].More recently, research has identified the concept of complications, for example, tautologies and unnecessary elements in queries, which do not affect the result table but performance and readability [17].Furthermore, semantic errors have been further divided into errors which are evident without knowledge of the underlying data demand, and errors which are only recognizable if the data demand is known [17].The former kind of semantic errors are called semantic, and the latter logical [101].
Finally, more advanced environments may be used [19,20].These environments provide different additional features, for example, non-binary grading of queries [1], personalized feedback [77], and visualized query execution [49].These more advanced environments are outside the scope this study.

Research Questions
We divided our research questions into two categories.Research questions 1 and 2 are closely related to typical outcomes of the systematic mapping process in software engineering [83], and research question 3 and 4 related to the proposed SQL teaching practices, and to the nature of the proposing studies: RQ3: Which practices have been proposed for teaching SQL?In addition to lectures, textbooks, and practical exercises, studies have identified and proposed practices (e.g., new tools, teaching methods, or increased emphasis on specific topics) to more effectively teach SQL.Answers are presented in Section 5.
RQ4: What kind of evidence is presented to support the proposed practices?Whereas some SQL education studies report practices as results, others merely suggest different practices in their respective discussion sections.We want to differentiate between educated opinions and scientifically supported (or contested) propositions.Answers are presented in Section 5 and Appendix C.

Search Strategy
We searched four digital libraries without applying date or publication type restrictions: ACM Digital Library, IEEExplore, ISI Web of Science, and Scopus, which include arguably the most recognized computing science education research fora such as ACM Transactions on Computing Education, Computer Science Education, and the SIGCSE and ITiCSE Manuscript submitted to ACM  2).The AIS eLibrary search was limited to peer-reviewed repositories.Both the AIS eLibrary and JISE search strings were more inclusive than the others due to the relative small size of the databases, although the former is only small if limited to peer-reviewed repositories.Google Scholar was considered, but a preliminary search returned too many results to inspect in a feasible timeframe.
The study selection process is illustrated in Fig. 1.The rectangles labeled A1 and A2 indicate the authors who performed the corresponding step.We performed backward snowballing (i.e., following reference lists to find relevant studies) [114] twice.Both authors studied the papers independently, and marked them both according to the research type facet classification (Table 3), and whether the paper should be included or excluded and why.We then compared our notes.In case of disagreement, we discussed until we reached a consensus on whether to include or exclude a paper, and how to classify the paper according to the research type facets.The comparison and discussion step was performed twice, and 89 primary studies were selected.

Study Selection
The searches yielded many papers concerning machine learning, SQL injection, and SQL learning environments.
These papers were excluded because different learning environments were outside the scope of this study, learning in machine learning is not related closely enough to human learning in the context of this study, and education regarding SQL injection is more concerned with the design of the application program rather than SQL.To a lesser extent, the searches returned papers concerned with procedural extensions of SQL (e.g., T-SQL and PL/SQL), query optimization, NoSQL, data warehousing, and web development, all of which were excluded.We were relatively unanimous in our study inclusion/exclusion discussions, yet one study [106] was particularly difficult.The study explores the effects of task complexity and time limitations on query writing, and gives every implication that the query language  [36,68,85,120, 123] E4 do not mention SQL, or SQL is merely an example or a vehicle [9,11,27,55,67,92,93,96,111] E5 concerned with SQL alternatives rather than teaching SQL [22,26,31,33,35, 81] E6 focus on describing an SQL learning environment [2,19,20,76,78,82,86] E7 lack sufficient detail to suggest a detailed teaching approach [3,52,57,69,79,[116][117][118]122] under study is indeed SQL.However, SQL is not mentioned in the paper, with the exception of a table summarizing prior work.The study was not included, along with several other borderline exclusions [7, 11, 19, 20, 23, 27, 31-33, 45, 46, 50, 52, 55, 61, 63, 73, 79, 93, 97, 104].Finally, we recognize that it is increasingly common for ICT research to report implications for research, industry, and teaching.It follows that there are most likely papers that report implications to SQL education, but were not found by our search criteria.As indications for finding these implications are not often found in abstracts, reading, for example, all relational database related research was not feasible, and not done in this study.
Inclusion (I1-I2) and exclusion (E1-E7) criteria are presented in Table 2.We decided to exclude papers published before 1989.Although this year marked the publication of the first revision to the SQL standard, SQL-89, this was not an educated choice as much as conveniently including 30 years of SQL education research.
Manuscript submitted to ACM

Data Extraction
We extracted basic reference information from the database searches: names of the authors, title, publication year, name of the publication forum, and issue number, volume, and page numbers, where applicable.Once the primary studies were selected, we classified each paper according to the research type and research topic facets.We also marked why a paper was excluded according to our predefined exclusion criteria, the number of citations from Google Scholar, and the number of participants in each study.
As Wieringa et al. [112] point out, it is possible that one study covers more than one research type facet.For example, Taipalus et al. [101] present a query concept framework based on their teaching experience, an error categorization framework based on a qualitative study, and opinions on how SQL should be taught.By these three aspects, their study could be classified as a philosophical paper, evaluation research, or an opinion paper.We classified each primary study according to what we perceived as their primary contribution.As discussed in Wohlin et al. [115], these classifications merely represent an overview of the type of research in a given mapping.
After the final 89 primary studies were selected, the first author classified the studies into categories according to their topics.This was done based on full text reading, and according to directed content analysis [56] with the utilization of prior knowledge on SQL education research categories.Using preconceived topic categories, the first author classified each primary study into a category.If a study was concerned with a topic which did not fit to any category, a new category was considered.Topic categories are reported in Section 4.2, and this category scheme is used to structure Section 5, in which we report the teaching approaches in more detail.

Threats to Validity
3.6.1 Descriptive Validity.Descriptive validity concerns the objectivity and accuracy of the data gathering.We utilized a data collection form described in Section 3.4 to increase the objectivity and accuracy of the classification and study selection.Both authors used the same form when selecting the studies and classifying the research type.
3.6.2Theoretical Validity.Theoretical validity concerns the selection and classification of the data.We tried to minimize the possibility of missing relevant studies by searching several databases, and by performing backward snowballing twice (Fig. 1).The first snowballing yielded 16 additional studies, and the second 3, yet after closer inspection, not all these studies were included in the final selection.As Petersen et al. [84] point out, researcher bias is a known threat to validity in the study selection phase.We tried to mitigate this by performing research type classification and applying Manuscript submitted to ACM Table 3. Research type facet in educational research (adapted from Wieringa et al. [112] and Petersen et al. [83])

Category Description
Evaluation research Hypotheses are tested on (or phenomena studied among) their natural target populations, and preferably in as natural environments as possible.This means that if the hypotheses are concerned with evaluating a new method for teaching students, the natural target is a student or a novice in a given technique (e.g., a language), and the most natural testing environment should be the environment the students would be using regardless of the research setting.This is not always possible, and varying degrees of unnatural elements must often be included.Sufficient quantifiable evidence is presented.

Solution proposal
Paper presents a new or significantly improved solution for a common and recognized problem.The topic may be related to concepts that are difficult for students to learn, teacher's workload, or curriculum improvement.Solid arguments for (and preferably against) the proposal are presented.exclusion criteria E3-E7 independently, and comparing results afterwards.Topic classification (Section 4.2) was done solely by the first author, and is the main threat to validity in this regard.

Interpretive Validity.
Interpretive validity concerns researchers' biases in the interpretation of the data.The first author is an author of several selected primary studies, which may induce bias in interpretation.The second author, however, is not, and there were no disagreements on whether to include or exclude those studies.The first author's experience in educational research concerning SQL was also considered helpful in the study selection and classification processes, although this may have biased the interpretation of primary study results.
3.6.4Repeatability and Generalizability.In order to increase the repeatability of our results, we followed systematic mapping guidelines proposed in Petersen et al. [83] and complemented later in Petersen et al. [84].We also reported threats to validity, and how we tried to mitigate them.However, study selection and classification involve human judgment, and another group of researchers might select at least slightly different set of primary studies.

Publication Fora
Out of the 89 primary studies, 38 (43%) were journal articles published in 18 different journals.50 (56%) studies were presented in 31 different conferences, and one primary study was a workshop paper.As can be observed in Table 4, the studies subject to this mapping study have been published in various journals and conference proceedings, and Manuscript submitted to ACM searching teaching approach proposals should not be limited to merely educational fora.Appendix A lists the primary studies and their corresponding identifiers.
Publication fora and citations among primary studies are illustrated in Fig. 2. We have clustered the primary studies according to the journal (JISE, JCSC) or the organization (ACM, IEEE, Elsevier, and top IS) the forum is associated with.In addition to the AIS senior scholars' basket of journals1 in IS, the top IS cluster contains primary studies from associated fora (CAIS and ICIS) with the exception of JISE, which formed a large enough cluster on its own.The color of an edge corresponds to the citing publication, for example, a blue edge between a red and a blue node indicates that the blue node is citing the red.Alternatively, a clockwise curving edge from node x to node y indicates that x cites y.With a few exceptions, the top IS studies cite each other extensively, while citing among ACM studies varies.JISE, IEEE, Elsevier and JCSC studies cite each other relatively seldom.A small percentage of JISE studies cite top IS studies and some ACM and Elsevier studies.Top IS studies cite some Elsevier studies, but nothing else.ACM studies cite mostly Elsevier and JISE studies, but not IEEE or top IS studies.JISE studies cite mostly top IS, Elsevier, and ACM studies, but not IEEE studies.IEEE studies cite some ACM studies, but nothing else.Two of the JCSC studies cite a total of three ACM studies.Some Elsevier studies cite some top IS studies.Publications per year are presented in Fig. 3.
The graph in Fig. 2 can be considered an indication of potentially untapped relevant primary research between clusters, yet it should be interpreted with caution.First, the edges only represent citations among primary studies, and are not an indication of how many citations a study has received (which, in itself, is not an indication of, for example, quality of a study).Second, the age of a study has a natural effect on the number of citations.The number of citations overall are presented in Table 5, to give an indication of the most commonly cited primary studies in SQL education.It is worth noting that even though a primary study is not cited among the selected primary studies, it may have received scientific attention outside these primary studies, as is the case with, for example, PS75.Finally, all SQL related research is not, intuitively, relevant to each other, but the graph propounds the view that researchers are studying similar aspects of SQL education without knowledge of each other.We give examples that support this argument in Section 7.3.

Classification
We presented our adapted research type facet classifications in Table 3.Additionally, we classified the primary studies according to their topics, which we next describe briefly, and in detail in Section 5.It is worth noting that the categories overlap, and that a number of studies were candidates to more than one category.The names and descriptions of the categories are based on full-text reading of the primary studies, and constructed using directed content analysis [56].The summary of primary study distribution between these two classifications is presented in Fig. 4  Studies presenting a specific teaching approach (9 papers): these studies present a teaching approach which concerns a specific subset of SQL, for example, how relational division, outer join, or existence negation should be taught.Most of these studies are solution proposals and opinion papers.
Studies presenting a non-specific teaching approach (22 papers): these studies discuss a more general teaching approach which should or could be used in teaching all SQL in a given course.The studies propose, for example, group learning Studies discussing patterns and visualization (12 papers): these studies mainly propose solutions on how to visualize the query execution process to students, whether to use visual query builders to facilitate SQL learning, planning queries before writing using a specialized notation, and utilizing steps and natural language patterns in query formulation.
Most of these studies are solution proposals, and many overlap with the previously described category.
Finally, a number of studies proposed approaches to ease teacher workload (15 papers): these studies proposed solutions concerning, for example, automated exercise generation, automated grading and feedback, and pointed educators to materials available online.Arguably, as the teacher workload lightens, educators can focus more on difficult concepts regarding SQL.

SQL TEACHING PRACTICES
All the teaching considerations listed in this section are not actionable advice per se, but, for example, concerned with the most common errors students commit.These insights may be utilized by the teacher to focus on certain query concepts during lectures or in exercise design.Furthermore, these errors can be utilized in exercise database data generation, so that at least incorrect queries with the most common logical errors return data that is different from the correct result table.Finally, these errors may be used to guide digital learning environment development, so that feedback for the most common errors may be generated.It is worth noting that we have applied the nomenclature discussed in Section 2 to all the following teaching practice presentations.
Teaching considerations regarding student errors, the exercise database, specific and non-specific teaching approaches, patterns and visualization, and teacher workload are compiled into Tables 6, 7, 8, 9, 10, and 11, respectively.The teaching approaches are not in any particular order regarding arguments for and against.In other words, it is arbitrary whether an approach is presented as Teach x [PS01; argued against in PS02] or Do not teach x [PS02; argued against in PS01].
Appendix C lists the number of participants in each primary study.
Manuscript submitted to ACM

Patterns and Anti-patterns
Even though we listed numerous teaching approaches in the previous section, it remains unclear which approaches are patterns and which are anti-patterns, and in which contexts.As may be observed in the previous section, we do not differentiate between approaches based on objectively interpreted results and subjective discussion.Consequently, we Manuscript submitted to ACM  advise a level of caution when interpreting the reported teaching approaches in the previous section, and the number of corresponding participants Appendix C.
As the nature of opinion papers and experience reports is as their names suggest, these approaches are seldom tested in a scientific setting.As an example, Matos and Grasser [70] suggested a teaching approach for teaching relational division which is easier for students to understand.The authors report no numbers concerning how many students found the approach easier.However, by comparing the proposed teaching approach and the commonly used alternative, the benefits are apparent; in addition to being computationally faster, the approach of using GROUP BY with HAVING is arguably easier to read than multiple existence negations, at least in our opinion.In contrast, Borthick et al. [15] Manuscript submitted to ACM studied how the database normalization level affects errors committed in query writing, and found out that end-users commit fewer errors in queries against a database adhering to the first normal form than end-users against a database adhering to the third normal form.The hypotheses were tested with 80 undergraduate and masters level students.
Based on reported quantifiable evidence supporting the views presented in these two studies, it might be compelling to advise the use of lower normal form databases over higher ones, and to dismiss the one regarding relational division.
Although fewer errors might be a desirable goal to strive for, lower normal forms in database education present significant downsides.Students learn bad design practices which later need to be unlearned, the database is subject to anomalies [38], and requires more disk space due to redundancy.Finally, it is not clear whether students should strive for fewer errors, (although other database end-users arguably should), as errors are arguably an efficient way through which students learn, as argued in SQL education research [54,100] as well as broader educational contexts [74].

Natural and Unnatural Learning Environments
A recurring theme in the primary studies, regardless of the research topic, was argumentation for [5,8,101,121] and against [75,88,101] natural learning environments.A natural learning environment better reflects industry, i.e., students' future work environments.Environmental traits differ between workplaces, job titles, and used technologies.For the sake of discussion, we state that in a workplace there is no known correct result table for a query [5], the data demand is ambiguous [24], the datasets are complex [51], and the business domain is unfamiliar [58].In contrast, peers are often present to offer help, use of textbooks and the internet is naturally not forbidden, and the query may be formulated as many times as necessary in a feasible timeframe.In an unnatural learning environment, these characteristics are reversed.The underlying arguments for natural environments are that students need to be prepared for their future work, and the arguments against are usually that natural environments hinder the learning of SQL (e.g., perceived confidence and success rates decrease).In teaching, these two approaches are usually mixed to varying degrees, for example, Taipalus et al. [101] report giving students the correct result table but designing exercise database data to contain no anomalies, yet Wagner et al. [110] report utilizing low quality data.
If the goal of SQL education is to prepare students to effectively work in their future work environments, learning should take place in more natural environments, and there is no need to exclusively choose a natural, mixed, or unnatural environment.SQL should first be taught in an unnatural environment [12,14], and when the syntax and semantics are mastered to a degree, natural elements such as data demand ambiguity may be introduced gradually [108], or a natural environment used in the final exam.Naturally, grading team performance is more difficult to the teachers, and students should be prepared to work independently in their future workplace, even though help is available.We discuss natural environments more in Section 7.1.
Although Lertnattee and Pamonsinlapatham [66] argue for using SQLite due to its relatively easy configuration, teachers should be aware that SQLite 3 contains features2 which, in our experience, confuse students.For example, in SQLite 3, data types have little meaning (strings can be stored in INT columns), some arguably important SQL concepts are not implemented (ALL, RIGHT OUTER JOIN), PRIMARY KEY does not imply NOT NULL, and strict grouping is not enforced.

Decay
In Section 3.3, we wrote that we rather conveniently chose to include SQL education research from a timeframe of 30 years.However, we advise caution when interpreting the results from the older primary studies, as these teaching considerations decay over time.Both the SQL standard and its implementations develop over time, as do the technologies in IT field in general.For example, a learning environment from 1990s appears naïve in terms of features, and the general look of the user interface.Some, mostly older works study the effects of a conceptual database structure representation instead of logical representation [61], while others criticize the very purpose of such a research setting [90].More importantly, examining some older studies raises questions whether the SQL language itself has changed too much for a teaching approach to hold true anymore.This point is further emphasized with the notion that the SQL standard has never been a simple source to interpret.Three examples follow.
First, a seminal study from 1995 [95] considered "omitting the FROM clause" a semantic rather than a syntax error, even though (at least current) SQL standard considers the FROM clause mandatory in a query.Furthermore, the study demonstrated all table joins with explicit WHERE clause conditions, without the use of JOIN predicate or subqueries.This might be an educated approach, a coincidence, or resulting from the fact that these concepts were introduced in the SQL-92 standard.At least one study [90] suggests that separating expressions and joins in their respective clauses reduces some types of query formulation errors.It is unclear why Smelcer [95] demonstrates table joins using only explicit join conditions in the WHERE clause, but this is a reason to infer that the students who participated in the study were taught table joins with explicit WHERE clause conditions.
Second, another study from 1993 [119]  Third, a study from 1988 [21] demonstrates how the aggregate function SUM handled NULL at the time.The study demonstrated how SUM would return NULL if even one of the items was NULL.Nowadays, the standard has been revised, and in most implementations, SUM handles NULL similar to zero.Rather than criticism toward the aforementioned studies, we are trying to communicate that even though the language we are using today has the same name as decades ago, SQL has undergone notable changes, and for this reason alone older studies should be given closer scrutiny.

Research Dearths
Concepts beyond SELECT have received little attention in educational research.The studied query concepts [4,6,101], and formulated error frameworks [17,101] focus solely on data retrieval.Intuitively, the transition from SELECT to UPDATE and DELETE is relatively easy [49], as the query concepts in the WHERE clause are the same.In terms of SQL, DCL and TCL concepts are relatively simple, and the difficulty comes from the design of privileges and transactions rather than implementation.However, DDL statements and INSERT are both a fundamental and important part of SQL which have not been studied in detail.
Advanced SQL features have not been studied in educational contexts.If we consider the SQL concepts reported in the primary studies, most of them could be based on the SQL-92 standard, and in some cases, even on SQL-89.Since Manuscript submitted to ACM then, numerous features have been added to the SQL standard, and they remain untapped from a research perspective.Such features are, for example, online analytical processing aggregate functions, the WINDOW clause, table functions, multisets, the MERGE statement, and generated columns added in SQL:1999 and SQL:2003 [47].Additionally, SQL:2011 introduced both temporal [65] and non-temporal features [124] such as pipelined DML and enhancements to several older concepts.As we mentioned in Section 2.2, it is unclear whether knowledge about these features need to be propagated, or have they been omitted from course contents on purpose.Finally, in addition to software development, further research could also explore how SQL has extended to adjacent fields such as data science [18], broader contexts in general [44], and what types of SQL extensions have been introduced to better fit field specific needs [80].
Are unnatural environments beneficial remains an open question.Studies in which students or novices are aided by for example, automated feedback, simpler data, or unambiguous data demands achieve higher success rates in query formulation.This, however, does not necessarily reflect their future work environments.Furthermore, a recent study [99] discovered that as the logical complexity of the exercise database increases, students are less likely to succeed in query formulation.The same study, however, cautions the use of success rates alone in evaluating different teaching approaches; it is possible that although the students who fail in query formulation with a complex database, are more prepared for natural environments than students who succeed in query formulation with a simple database.Studies that test student skill in natural environments are needed, preferably so that one group of students learns SQL in an unnatural environment, and another in a natural environment, after which both groups are tested in a natural environment.Furthermore, as unnatural environments are intuitively targeted to help poor performing students, Russell and Cumming [91] raise an important concern that a certain level of simplification may impede both the learning, and the ardor towards the IT field of brighter students.

Replication
As presented in Fig. 4, there were no replication studies among the primary studies.While experience reports, opinion papers, solutions proposals, and philosophical papers are problematic to replicate due to their nature, even the most fundamental evaluation research studies [14,34,95] remain without replication.This is problematic, as central premises of subsequent studies are occasionally based on the results of the fundamental studies.The lack of replication studies in computing education in general has only recently received scientific attention [53].Partly because of the lack of replication, we argue that educational SQL research is not mature enough to distinguish between patterns and anti-patterns.Moreover, a particularly insightful study by Rho and March [90] noted that some studies evaluated SQL on such a simple level, that the ceiling effect (i.e., variance in an independent variable is not measurable due to simplicity of the task) might explain the lack of differences in the results.Replication studies are not needed only because of reliability and the ceiling effect, but also because of obsolescence, as discussed in Section 6.3.
Beyond replication, and to uncover patterns and anti-patterns, it is crucial to evaluate proposed teaching approaches, especially those of solution proposals and opinion papers, in a scientific environment.Preferably, these evaluations should be done by researchers independent of the original authors, as it is common that reported solutions are considered helpful by the original authors.With propagation concerns [13] in mind, approaches supported by scientific evidence are likely to receive more attention among practitioners.

Building upon Existing Body of Knowledge
Based on the insights from the mapping process summarized in Fig. 2 and discussed in Section 4.1, we urge researchers to utilize and build upon existing body of knowledge in new approach proposals, and to critically evaluate all approaches.
Manuscript submitted to ACM Matos and Grasser [70] authored a study showcasing a new approach for teaching relational division.The study was published in the summer 2002 issue in JISE.Dadashzadeh [40] authored a study expanding and generalizing similar approach to other set comparison queries.This study shows relational division similarly to Matos and Grasser, and was published in the winter 2003 issue in JISE.Finally, McCann [72] authored a study presenting relational division similarly to Matos and Grasser, and this study was presented in the FIE conference in November 2003.Neither of the two latter studies cited Matos and Grasser, even though relational division is a specific concept.This might be a result of all the studies published within a short timeframe, but also due to potentially fragmented educational research fora.Regarding primary study citations in Fig. 2, is it that, for example, ACM studies in general considered top IS studies, but did not find them relevant, or is it that they did not find them?Did they not find them because of different nomenclature, or did they not utilize searches which included them?Would their research settings and conclusions have been different in this regard?As IEEE studies are seldom cited among the primary studies, we might have missed them if we did not know about IEEE beforehand.That being said, there might be relevant pockets of research that we have missed.Based on our results, we advise educational researchers and reviewers to utilize and search prior works widely, as educational considerations may be found in numerous fora.
Most of the 29 opinion papers and experience reports did not discuss potential downsides of their proposed or tested approaches, and only one [10] had a section dedicated to discussing disadvantages.We urge authors of studies of this nature to either critically evaluate their approaches, or discuss why the approach does not need critical evaluation.In comparison, even a course given as a textbook based exam (and nothing else) has positive implications, as students can study with a flexible schedule, and choose learning strategies based on their own preferences.With this in mind, one critical factor to discuss is time.Elements cannot be added to a course without expanding it or removing other elements.Expanding a course arguably has potential downsides, and, for example, sacrificing best database design practices to more efficiently teach SQL is not a desired goal in a database course.Alternatively, a teaching approach may be replaced altogether, as presented by for example, Matos and Grasser [70] and Matos et al. [71].

CONCLUSION
In this study, we set out to systematically map educational SQL research, and to list teaching approaches proposed in scientific literature.Our mapping shows that primary studies are published in numerous fora, not all of which are educational in nature.Recurring themes in educational SQL research are improved teaching approaches, students errors, the exercise database and related concepts, and easing teacher workload, and all types of research are represented, with the exception of replication studies.Furthermore, based on the 89 primary studies, we listed 66 teaching approaches to help educators teach SQL more efficiently.For researchers, and in addition to the systematic mapping, we proposed future research avenues, and general suggestions on how to conduct educational SQL research.

RQ1:
In which fora is SQL education research published?There are no publication fora which are specifically focused on SQL education, or even SQL in general.However, both computing education and database research fora in general are plentiful.Answers are presented in Section 4.1.RQ2: What types of research are represented and to what extent?Educational research is diverse by nature, and while some studies test clearly formulated hypotheses, others report opinions and experiences.We want to understand the SQL education landscape to identify potential dearths in research.Answers are presented in Section 4.2 and Appendix B.

Fig. 2 .
Fig. 2. Publication fora clusters and citations among primary studies -studies published in ACM journals or conference proceedings are clustered top right (purple), Journal of Computing Sciences in Colleges bottom right (orange), IEEE bottom center (yellow), top IS fora top left (red), Journal of Information Systems Education top center (blue), Elsevier center (green), and other fora (gray); size of a node represents in-degree, color of an edge corresponds to citing publication, edge curves clockwise from the citing to the cited publication, and numbers refer to primary study IDs

Fig. 4 .
Fig. 4. Number of primary studies in each research type facet (x-axis) and topic facet (y-axis) intersection demonstrated erroneous queries with subqueries formulated with NOT EXISTS, in which the subqueries' SELECT clauses contains multiple column names, and stated "Both cases contain errors of form.The subqueries used with EXISTS (NOT EXISTS) should use the SELECT * ... format." Nowadays, it is more of a widely accepted practice to use simple (NOT) EXISTS subquery SELECT clauses such as SELECT * or SELECT 1, but effectively it does not matter what is selected, and even division by zero is accepted by DBMSs.

Table 1 .
Search strings conferences.Due to the pervasive nature of SQL in the information and communication technology (ICT) field, we also searched two information systems focused databases: AIS eLibrary and the database of Journal of Information Systems Education (JISE).Search strings are presented in Table 1.The database searches yielded a total of 2,709 studies, 414 from ACM Digital Library, 646 from IEEExplore, 228 from Web of Science, 1,361 from Scopus, 46 from AIS eLibrary, and 14 from JISE.Additionally, we had 16 papers [4-6, 16, 17, 25, 40-42, 70, 71, 94, 98, 100, 101, 121] which we knew well enough to deem them suitable for closer criteria evaluation (cf.Table Rather than trying to estimate how natural the research settings of the primary studies were, we adapted the research type facet classification to better fit educational research (Table Replication studyPaper replicates a previously reported research setting as accurately as possible, or with premeditated alterations (e.g., a different teacher, students who major in a different subject, or undergraduate instead of graduate students).The goal of the study is to check the validity of the previous study, or to study generalizability of the results.Sufficient quantifiable evidence is presented.Philosophical paper Paper presents a new conceptual framework, taxonomy, general teaching approach, new, improved or adapted research method, or simply summarizes existing work in a form of systematic literature review or systematic mapping.Depending on the type of philosophical paper, the paper may utilize existing literature, or be based on professional opinions or experiences.
Opinion paperPaper expresses opinions of the author or a third party.These opinions may be concerned with, for example, whether something should be taught, how it should be taught, or to whom it should be taught.Typically no scientific evidence is presented.Experience reportPaper describes how something was done, for example, a course or a curriculum implementation.The new setting should be described in sufficient detail, so that others may replicate it.Paper should report what worked and what did not.

Table 5 .
Primary studies, number of citations from Google Scholar in Sep.2019, and citations divided by publication age in full years -PS09 was not indexed by Google Scholar Studies concerning the exercise database and elements closely related to it (20 papers): these studies evaluate, report experiences, and present opinions and solutions in regard to what kind of an exercise database is efficient in facilitating SQL learning.The studies discuss how to visually present the exercise database schema to students, how to express the data demands, what kind of database business domains should be used, how realistic the database should be in terms of data, and whether the students should be made aware if their SQL queries are logically correct.Most of these studies are evaluative in nature.

Table 6 .
Teaching approaches or considerations regarding student errors semantic errors and complications can be used to support discussion with students on bad query writing practices [PS18].This list is complemented with syntax and logical errors [PS80], and together give high level representation of what kinds of errors students can commit.Both of these lists are too long to discuss here.SE2 Self-join is the most difficult query concept overall [PS02, PS03, PS04, PS79], and these queries fail due to logical errors [PS02], namely join errors: joins are formed with incorrect tables, columns, or comparison operators, a join is missing, or a join is extraneous and needs to be omitted [PS79].
SE5 Logical errors are the most common class of errors overall [PS79, PS80], and the most difficult class of errors to fix [PS04, PS79].40% of errors students commit are semantic or logical in nature, and occur in the SELECT and WHERE clauses [PS04].SE6The most frequent errors that student cannot fix are, in order, illegal or insufficient grouping, common syntax errors, inconsistent expression, inconsistent joins, missing joins, expression errors such as missing or extraneous expressions, or expressions in incorrect clause, and projection errors such as missing or extraneous columns in the main the SELECT

Table 7 .
[PS07]ng approaches or considerations regarding the exercise databases as an event-based ER or state-based ER does not affect query accuracy or student confidence[PS07], but in regard to query formulation success rates, it is better to represent a database schema rather than a list of database contents or an ER diagram [PS31].Furthermore, database representation semantics [PS51], symbols [PS51; argued against in PS10], and foreign key constraint representation [PS51] all have influence on query formulation success.DB8The three most important factors in query formulation success rates and time needed are, in order, data model representation realism, high expressive ease, and query complexity.Data model representation realism refers to which level the data model is represented, and the levels are, in ascending order of realism, physical, logical, and conceptual.Expressive ease is concerned with the language used, were it SQL, natural language, or something else [PS23].DB10 Provide an interface (or a cheatsheet) that allows students to see SQL keywords and database object names to reduce typing errors [PS05].Consider highlighting relevant parts of the data model for each data demand [PS82].DB11 If data demand complexity increases, success rates fall, more time is needed, and students feel less confident [PS14].DB12 Allowing students reuse similar queries in exercises leads to faster query formulation, but results in more errors, and a poorer relationship between confidence and query correctness [PS08].DB13 Students should not execute queries in the same exercise database, because modifications affect others [PS44].DB14 Use complex [PS41, PS62, PS83; argued against in PS57 because students cannot manually check problems with erroneous queries against complex data] and low quality [PS83] exercise data because students need to gain understanding of complex environments, and that real data contains errors and missing values.Furthermore, use databases with business domains which are novel to the students so that students learn the importance of domain knowledge and can recognize abstract patterns and utilize them in different domains [PS46].More realistic databases are perceived more interesting an useful by students [PS88].DB15 Provide students with the correct result table [PS68, PS80; argued against in PS03 as students may use brute force to write correct queries], or the number of rows in the correct result table [PS68].If these are not provided, students should validate their results by manually writing tests [PS20].Students should understand that query evaluation against a single dataset is not enough [PS39].
DB9If incongruence (i.e., how well or poorly real world constructs match their equivalents in the database) increases, success rates fall, more time is needed, and students feel less confident [PS14].However, best design practices (e.g., database normalization) should not be sacrificed in order to reach more ontological clarity, as the implications for benefits are conflicted [PS15, PS16, PS17; argued against in PS74].

Table 8 .
A list of teaching approaches or considerations regarding a specific teaching approachIDTeaching approach or consideration SA1 Teach relational division with GROUP BY and HAVING, rather than multiple existence negations.This is easier for students to learn [PS54, PS56], and the written queries are computationally faster [PS54; the latter point is argued against in PS56].This work is extended from teaching relational division to teaching set comparison with a general approach [PS27].SA2 Teach OUTER JOIN according to ANSI SQL-92, i.e., with OUTER JOIN rather than UNION or derived tables.This is perceived easiest and it is computationally faster than the alternatives [PS55].SA3 Teach existence negation with an English-like query language before teaching the SQL equivalent [PS48].SA4 Explain the differences in the logic of NOT EXISTS and NOT IN subqueries [PS18].SA5 Teach strict grouping [PS21, PS80].Effectively, this means that if the main SELECT clause contains at least one aggregate function, and at least one grouping column, all and only the grouping columns must be included in the GROUP BY clause.SA6 Teach integrity constraints by dividing them into five classes: dynamic, domain, tuple, relation, and database integrity constraints [PS29].SA7 If you use Microsoft Access to teach SQL, and want to teach recursive joins which are not supported, stored procedures can be used to complement SQL [PS28].SA8 Teach transaction control using real SQL examples, and not simple READ(a) and WRITE(b) that are usually found in database textbooks [PS37].SA9 Teaching SQL after QBE yields better results than teaching SQL first [PS86; the use of QBE is argued against in PS69 because mental models must be changed when switching to SQL].

Table 9 .
[PS01]ng approaches or considerations regarding a non-specific teaching approach final exam, organize intermediary assessments which can be taken after a certain number of exercises have been passed[PS70].This helps especially weaker students[PS70].Giving the assessments in a digital learning environment positively affects grades[PS01].SQL skills should not be assessed through SQL code alone, but also with multiple choise questions [PS13].Brighter students' motivation suffers if a course is not challenging enough [PS70].NA8 Demonstrate difficult SQL concepts with animations [PS33, PS60, PS89].NA9 Encourage students against unnecessary SQL elements, even though such omittances affect readability [PS61].NA10 Use SQLite to teach SQL, because it is lightweight and students do not need to configure anything [PS52; argued against in PS80].

Table 10 .
Teaching approaches or considerations regarding patterns and visualization ID Teaching approach or consideration PV1 Utilize a template to help students write more complex SQL queries [PS06, PS20, PS82].This increases success rates [PS82], and decreases errors in the FROM and ORDER BY clauses, but not in the GROUP BY clause [PS21].PV2 Have students plan more complex queries to ease cognitive load [PS78].A planning notation is introduced and described [PS78].As data demand complexity increases, a priori planning decreases the number of errors more and more [PS21].PV3 Teach students how to identify certain natural language patterns (e.g., never, all, sum) and their corresponding SQL clauses, constructs, and keywords [PS66, PS77].PV4 Teach SQL query formulation in steps (i.e., procedurally) [PS04, PS21, PS66, PS77].Alternatively, introduce both procedural and set-based query formulation approaches at the start of a course.Students can choose which to use [PS65].Procedural approach to query formulation is more natural to students, but fails at complex queries [PS65].PV5 Visualize query execution [PS22, PS30, PS42].It is helpful if students can visualize the query step by step, and go forward and backward, similar to programming language debuggers [PS22, PS42].If possible present the query simultaneously visually and textually [PS39].PV6 If students are likely to never write complex SQL, alternatives such as QBE should be considered, as it is faster to utilize, and perceived more comfortable [PS45].

Table 11 .
Teaching approaches or considerations regarding teacher workload ID Teaching approach or consideration WL1 A list of 14 small SQL course modules is presented [PS85].The list is too long to be presented here.SQL concepts are divided into basic, advanced, and expert level modules [PS38] These modules may be used as, for example, a structure for short online lectures [PS81].WL2 Learning environments that allow teachers to monitor student activity, and also allow students to give feedback to the teacher [PS01] are available.Furthermore, large online learning environments with exercises and exercise databases are available without fee [PS84].WL3 Exercise database datasets can be generated automatically [PS12, PS25, PS26], and tested against expected erroneous queries automatically [PS12].A query should be tested against multiple datasets [PS01], and discrepancies can be used to automatically provide feedback [PS01, PS49, PS87].Alternatively, a query's correctness can be evaluated using string metrics [PS76] or XML transformations [PS39].WL4 As an alternative to automatic exercise database generation, students may be required to create their own exercise databases and grant appropriate privileges [PS44].WL5 Data demands can be automatically generated based on correct SQL queries [PS47].WL6 Utilize examinations and exercises which can be automatically graded [PS81].WL7 Students should be given the opportunity to select themselves how complex queries they want to practice writing (query concepts, number of tables etc.), and these exercises can be automatically generated [PS35].Furthermore, students should be allowed to choose a level of hints which the system suggests [PS50].WL8 SQL taught through game based learning significantly increases student performance when compared to textbooks [PS75].
Number and type of participants in each primary study; primary studies that are not listed involved no participants, or did not report participant numbers