- This topic has 0 replies, 1 voice, and was last updated 1 year, 10 months ago by Oluwole.
- February 27, 2020 at 10:30 pm #86764Spectator@oluwole
Microsoft Excel has been in the data science game since the 80s. It is perhaps the most popular tool for handling data and dealing with tables, pivots, and charts. So the odds are that if you are a data scientist, Excel was your first love.
Although it is hyped to a lesser degree compared to tools like R, Python, Spark, Hadoop, JupyterLab, etc., Excel remains relevant in many ways. It’s been around for so long because it excels at its purpose.
Microsoft considered names like “Mr. Spreadsheet” and “Master Plan” before deciding on “Excel”, a double entendre on the many cells that make up the spreadsheets and the excellence of the program.
Indeed, there are many arguments for and against the notion that Excel is a powerful tool for data science, but we would not get into that. However, it is imperative to note that Excel’s functionalities are often downplayed. Its capabilities include executing SQL queries on tables and data sets, fetching data from various data sources (including Azure cloud or Hadoop), scripting with programming languages (including Python), performing advanced analytics, and editing 2-dimensional data. Excellent, right?
There is a lot to explore in Excel if you haven’t already. In this post, our focus is on some excel functions that Data Scientists find useful. They are broadly divided into three categories;
- Generating Insights
- Excel Functions
- Data Cleaning
The objective of any data analysis is to seek insights and draw meaningful conclusions. Excel equips us with tools such as charts and pivot tables that make it easier to find answers to questions we have about our data.
Excel’s Pivot Table feature is one of its most potent. Pivot tables are useful in extracting significance from a large, detailed dataset. They provide summary statistics such as sums and means and presents them in a meaningful way.
Let’s see how to create a Pivot Table.
Click on any single cell inside the data set. On the Insert tab, in the Tables group, select PivotTable. A dialogue box appears, as shown below. The Table/Range tab is usually automatically selected, but if it isn’t, you can drag over the desired area to select it manually. You can choose to create the table on a New or an Existing Worksheet. Click on OK to create the PivotTable.
A PivotTable Fields pane appears, and to get the total amount of each product sold, drag the following fields to these areas;
- Product field to the Rows area
- Country field to the Filters area
- Amount field to the Values area
The result is a Pivot Table that shows that Burger produced the most revenue.
We can change the field settings in each of the PivotTable Fields areas to suit our analysis. Say you wanted to return the mean, rather than the sum of the amounts of the products, here’s how to go about it.
In the Values area, click on the Sum of Amount and scroll to Value Field Settings. Here, you can toggle the Average option, amongst others.
The functionality also extends to creating Pivot charts for better visualization.
Excel charts or graphs are useful tools that present data from a worksheet in visual formats. Excel provides several chart types such as Bar, Column, Line, and Bar charts. Creating Charts in Excel is easy.
Let’s see how to do so using our dataset.
Highlight the area of the data you want to include in your analysis. Under the Insert ribbon, you’d see different chart types in the Chart group, including Pivot Charts. Excel recommends the chart types to use based on the nature of your data, and these can be accessed using the Recommended Charts selection.
A dialogue pops up upon clicking on the Recommended Charts, and you have some graphs, such as a bar chart, in this case. Choosing the desired chart type, and clicking on OK displays the figure on your sheet.
There is an endless list of Excel functions, from the simple to the complex. Here are three important ones;
- VLOOKUP AND HLOOKUP
- Logical Functions
VLOOKUP AND HLOOKUP
Both features are similar and are used to search for a specific category or value in a table and return the opposing value from adjacent rows or columns. VLOOKUP searches in a vertical direction while HLOOKUP searches in a horizontal direction. Hence, the “V” in VLOOKUP stands for vertical, and the “H” in HLOOKUP represents horizontal.
You can read more about VLOOKUP here. The syntax of HLOOKUP is similar to that of VLOOKUP, and it takes in four arguments;
- Value – the value to search for in the first row of the table
- Table – two or more rows of data sorted in ascending order
- Index_number – the row number in the table from which the matching value should be retrieved
- Approximate match – Entering FALSE gives an exact match while entering TRUE gives an approximate match. TRUE is the default value.
The HLOOKUP syntax is thus:12#HLookup syntaxHLOOKUP( value, table, index_number, [approximate_match] )
Let’s see how this works. From the spreadsheet above, the following HLOOKUP examples would return:123456=HLOOKUP(4,A1:K6,4,FALSE)#Returns the value in 4th rowResult: 460=HLOOKUP(6,A1:K6,3,FALSE)#Returns the value in 3rd rowResult: Food
It is also possible to use these functions across different worksheets by including the sheet name in the second parameter.
Across various programming languages (including English), concatenation means linking things (strings, values, etc.) together in a chain or series. Excel’s CONCATENATE function does likewise. It connects data or values across columns and rows together. The “&” operator can also be used to concatenate data across cells. It is generally preferred to the CONCATENATE function because it is shorter and serves the same purpose.
The syntaxes for the CONCATENATE function and “&” operator to join two values together are as follows:12345678#To concatenate the values in A1 and B1= CONCATENATE(A1, B1)#or=A1&B1#To concatenate the values in A1 and B1 separated with space= CONCATENATE(A1, “ “, B1)#or=A1&” “&B1
The following screenshot demonstrates how the results may look like:
IF, AND, OR, and NOT are particularly useful logical functions in data analysis. IF is used to specify selection criteria and return a particular output if the criteria are met or otherwise. The OR and AND operators are intuitive. OR checks if any of the arguments are TRUE while AND checks if all the cases are TRUE. Both operators return either TRUE or FALSE. The NOT operator does the job of changing TRUE to FALSE and vice-versa.
Let’s see the IF function in operation. The sample syntax is as follows:
It returns Expensive if the Price is greater than 500, and Cheap if it is less than 500.
Microsoft Excel provides data cleaning functionalities such as sorting, filtering, and conditional formatting.
When working with a large amount of data, sometimes you may want to group the data and sort it in some way. This can be achieved manually, although it is a time-consuming and challenging activity. Excel provides a sorting function that takes care of such needs in a matter of seconds. The data can be sorted in ascending order, descending order, alphabetical order, amongst others.
Here is how to sort your data:
- Select a cell in the column containing the data you intend to sort
- Select the Home Tab
- In the Editing group, select the Sort and Filter option and choose the order in which you want your information sorted
- The shortcut key ALT + H + S can also be used to access the Sort function
Filtering your data helps you subgroup it in any way you desire. For large datasets, it helps you narrow your focus to what is relevant to your analysis. It also makes the data more convenient to handle.
Here is how to apply the filter to your data:
- Click on a cell in your work area
- Select the Home Tab
- In the Editing group, select the Filter option and choose the order in which you want your data sorted
- Click on any drop-down in the column header to see the possibilities for filtering the data
- The shortcut key ALT + H + S + H can also be used to access the Sort function
It is imperative, when handling datasets, to distinguish between the more critical and less critical data. Conditional formatting enables you to format cells (say highlight them with a particular colour), depending on the cell’s value.
Many rules can be applied in conditional formatting. They include:
- Highlight Cell Rules – greater than, less than, equal to, between, etc.
- Top/Bottom Rules – top 10 items, bottom 10 items, top 10%, bottom 10%, etc.
- Data Bars – gradient fill, and solid fill
- Colour Scales
- Icon Sets
To achieve conditional formatting, say you want to format cells that are less than a 500 in the Amounts column;
- Select the range of the cells under consideration.
- On the Home tab, click on Conditional Formatting in the Styles group.
- Click Highlight Cell Rules, Less Than.
- Enter the value and select a formatting style, i.e., colour scheme
- Click OK
The result is that excel highlights the cells that are less than 500, and this helps in visually categorizing your data.
Your data may come in the form of texts, and they may be separated with delimiters (such as a comma or a semi-colon). Excel has a feature that enables the conversion of this text into columns.
To convert a text, akin to the one above, to columns, Highlight the cells to be converted, and under the Data ribbon, click on Text to Columns in the Data Tools group. A dialogue pops up, and you have to choose the file type, either Delimited or Fixed Width. Our choice is Delimited because our data is separated by a semi-colon, which is a delimiter. There is also a preview of the selected data.
Click on Next to specify the type of Delimiter, which in this case, is a semi-colon. Click on Next and select the data format; in this case, General works fine. Then click on Finish to finalize the conversion process, and your data should be arrayed in columns as shown;
Complex statistical and engineering analysis can be done in Excel using a Microsoft Excel add-in called Analysis ToolPak. It provides useful data analysis functions such as Anova, Correlation, Regression, Percentile, F-Tests, and Descriptive Statistics. The Analysis ToolPak can be accessed by clicking Data Analysis in the Analysis group on the Data tab.
Microsoft Excel remains to date, a powerful tool for handling data. It provides us with features such as sorting, filtering, visualizing, formatting, concatenating, and more. However, its merits considerably wane when the data to be analyzed is either big or unstructured, and the nature of the analysis is advanced.
- You must be logged in to reply to this topic.