Tagged: Data science, Databases, SQL
- This topic has 0 replies, 1 voice, and was last updated 10 months, 2 weeks ago by
Simileoluwa.
- AuthorPosts
- February 28, 2020 at 9:28 pm #86854Participant@simileoluwa
SQL (Structured Query Language) is a standard database language that is used to establish, maintain and extract data from relational databases. We produce large chunks of data daily and data being generated in real-time have the potentials of driving organizational decision-making processes. By precisely recording data, refreshing and monitoring them on an efficient, steady basis, organizations can address their difficulties on one hand and utilize the massive potential offered by databases on the other.
SQL is a very important asset in a Data Scientists arsenal as it is critical in evaluating, imputing, extracting, manipulating and modifying data stored in a system. The effectiveness of any data-driven decision-making process is solely dependent on the quality of the dataset and the quality of the dataset is in turn dependent on the ability of the Data Scientist to select, manipulate, modify and extract the right data needed for processes, thus the importance of being skilled in SQL as a programming language.
Analytical tools such as Python and R require one to load the dataset into the local computer and on the RAM, however, what if the data of interest is too large (over 50 million rows) and thus can’t be loaded, but rather it sits somewhere on a database management system (DBMS) in a cloud storage. This means that Scientists need to shuffle between Integrated Development Environment (IDE) such as Jupyter lab, Rstudio, MySQL workbench, etc. to accomplish different tasks. This process can be tedious especially when you want to use different programming languages for the task at hand and this is what the Rstudio extensions, packages aim to achieve.
Setting up your Database Manager in R Studio
Rstudio, an IDE for R programming provides different methods of using SQL in it, without having to download IDE’s designed for SQL processes, although it doesn’t perform all the full functions of IDE’s solely designed for SQL purposes. It can effectively perform data querying processes. In this article, we will introduce two ways of writing SQL queries in Rstudio, these processes provide effective methods of ensuring that you can query humongous datasets without downloading them first.
Using Database Connection packages: Several packages can be used to connect to a database and then you can from your R script send queries to the database. In this tutorial, we will be introduced to a few. The first is the odbc package in R, with it through a few lines of code, you can connect to a database as shown below:
123456library(odbc)con <- dbConnect(odbc(),Driver = "SQL Server",Server = "localhost\\SQLEXPRESS",Database = "datawarehouse",Trusted_Connection = "True")Note that these are just dummy credentials. You must provide these credentials peculiar to your database, in other for you to establish a connection. You can visit here to know the specifics of how to connect to certain databases and the R package needed to establish the connection.
To further explain this concept, we will establish a connection using another package in R, however, we will make our local memory the database:123456library(DBI)# Create an ephemeral in-memory RSQLite databasecon <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")dbListTables(con)dbWriteTable(con, "mtcars", mtcars)dbListTables(con)dbListFields(con, "mtcars")dbReadTable(con, "mtcars")The mtcars dataset is preestablished in R studio, the code above makes the memory storage of R the database connected to, in this case, however, we utilize another connector provided for by the DBI package. This is such that one can select any dataset in that memory (database) and perform queries on it using SQL syntax. We can perform SQL queries on the dataset retrieved as shown below:
123# You can fetch all results:res <- dbSendQuery(con, "SELECT * FROM mtcars") #Extract the all columnsdbFetch(res) #print all results from the query aboveNotice the queries are wrapped in quotation marks and also embedded in a function. This is how you will continue any query you would like to send to the database.
The second method is quite easy, such that all you need to do once you have your R studio, is to start a new SQL script. R studio allows you to have a script solely dedicated to SQL queries alone. You can write your SQL script in here, the same way you would write it in an IDE dedicated solely for SQL purposes as shown in the image below:
Conclusion
We have seen how to configure your R studio to connect to a database and as well have an SQL script sole for database queries. These improvements reduce the need to shuffle between IDE’s, thus it provides ease. The processes of Extraction, Transformation, and Loading (ETL) are made very easy for Analysts.
- AuthorPosts
- You must be logged in to reply to this topic.