Sunday, 30 April 2017

How Computers Work

Here is probably the best explanation of how computers work, ever:

Saturday, 29 April 2017

Programming Languages

I thought I'd publish a very quick overview of programming languages, simplified for non-programmers.

Language Hierarchy 

I'm going to outline, from the CPU up, what the 'language hierarchy' is

You start at the CPU - this is what does all your actual processing

Then you have the assembler, which interacts with the CPU. If you like the idea of writing in 1s and 0s, you can write in the assembler language, which isn't quite writing in 1s and 0s, but it comes close - it will take you a week to create a single button, but it gives you a huge amount of control.

On top of the assembler you have languages like C, which makes it easier to program common tasks that would take you forever to code if you were using only the assembler

On top of C, and really in the place of C in many cases, you have languages like C++, which are used for heavy duty processing. C++ is much faster than languages like Java, because it is closer to the CPU than Java is. C++ is a standard.

On top of C++ are built many other languages. C++ was used to create Java, for example, and there are plenty of other high-level languages such as PHP, Ruby, Python, C# (C-sharp) and so on. These languages are far more productive than C++, meaning, you can get more done by writing less code. Whereas it would take you a week to create a button writing out 1s and 0s for the assembler, it can take you 3 seconds to create a button using the Java UI tools.

And there you have it - a hierarchy of programming.

Compiled Language

When a programmer writes code, they usually have to "compile" it to turn it into the 1s and 0s that the computer can use. The word compile, in this context, just means convert. If you've ever run a .exe file on your Windows computer, this is simply a file filled with 1s and 0s, which has been produced by the developer, who compiled their code into this format.

Web coding doesn't usually require compiling, because it runs from the web-server, which does the work of turning HTML into 1s and 0s. You just need to make sure your code is in HTML. Often web servers will convert other languages into HTML on the fly, so you can write in PHP, for example, and this will run straight from the web server.

Which Language To Learn/Use

Different languages are good for different use cases, and you really need to look into what the pros and cons are of each. Some are very similar. For example, C# was originally Microsoft's attempt to compete with Java. Initially they improved on Java, but then Java improved and they have been facing off. As stated above, C++ is used for lower level programming, which is good for intensive applications, such as gaming engines. C# is also used in game creation using the Unity framework. PHP is a web language for creating web applications. A note for newbies: a web-app doesn't just mean some special app you find on a website. A website is a web-app.

C# is used in the .NET framework, a software framework developed by Microsoft. The idea behind .NET is that you can write apps in various languages and they will all run using .NET.

Intro to Programming Languages

Here is a really good intro to programming languages, given by the creator of C++, Bjarne Stroustrup: https://www.youtube.com/watch?v=JBjjnqG0BP8

Friday, 28 April 2017

SQL Bits 2017

A couple of weeks ago I got to go to SQL Bits, 2017, the largest SQL Server conference in Europe. I had a great time there, and came away with a couple of book recommendations:


  • Show Me the Numbers by Stephen Few - this will help you with the presentation side of your reports and dashboards 
  • R for Data Science by Hadley Wickham - this is an introductory book on writing R
 


Saturday, 11 March 2017

When to use Many to Many Relationships in Multi-dimensional SSAS

In SSAS, there is one thing that is really quite confusing: many-to-many relationships. This is where you have two dimensions that have a many-to-many relationship between them, for example, a Student may have many classes, and a class may have many students. If you want to relate class name to student name in a pivot table, how can you do that? Well, you need a bridge table, also called a Factless Fact table, and then you need to import this as a fact table, create a measure group, which is hidden, called Count, and then in the Dimension Usage tab of the designer, define a many-to-many relationship between the two tables.

Many to many relationships cannot be created between two fact tables, in SSAS. However, you could import those Fact tables as dimensions, along with a bridge table of the kind mentioned above, and treat your facts as dimension attributes. The problem with this is that you won't be able to do summing and all the usual aggregations that you can in a fact table.

Types of Hierarchy in a Data Warehouse

In Data Modeling, there are various types of hierarchy (such as a balanced hierarchy), and then there are various ways to implement a particular type of hierarchy in a database (such as a parent-child hierarchy). This post aims to explain both.

Types of Hierarchy

Balanced Hierarchy

In a balanced Hierarchy all the branches descend to the same level.

Example 1: Year > Quarter > Month > Day
Example 2: Division > Branch > Sales person 

Here are two representations of a balanced hierarchy (both are balanced): 

Un-Balanced Hierarchy

In an un-balanced hierarchy all the branches do not descend to the same level.

Example: A director with two employees who report to him, one of whom is a manager with multiple employees reporting to him, whereas the other has no employees reporting to him.



Ragged Hierarchy

In a ragged hierarchy, branches extend to different levels.

Example: In a company where everybody's pay is set according to what pay band they fall into, a CEO might be in pay band A, and he has Directors reporting to him who are in pay band B, and he also has a PA reporting to him in pay band C.


Multi-Parent Hierarchy

Many may not consider the following to be an example of a real hiearchy, but since we might refer to it as a hierarchy in the real world, I thought I would include it. In a multi-parent hierarchy, a lower level has multiple parents.

Example: Division > Branch > Sales Person, where the same sales person works at two branches.


Hierarchy Implementations

That's all very well and good, but how do we implement these in the database?

Parent-Child Hierarchy Structure

Here is an example of how you might implement any of the types of hierarchy listed above, except a multi-parent hierarchy. This is called a Parent-Child hierarchy, and I have chosen to implement the ragged hierarchy above:

Parent Child
A A
A B1
A B2
A1 C1
B1 C1
B1 C2

To implement a multi-parent hierarchy, you would need to have a multiple rows for the same member with multiple parents.

Natural Hierarchy Structure

A Natural Hierarchy structure cannot represent a multi-parent or a ragged hierarchy. Obviously, you would usually not start with Level 1, since they are all the same. So if Level 1 was the company, and Level 2 was the regional divisions withing the company, and Level 3 was the branch within the regional division, you would just start with Level 2, the regional divisions.

Level 1 Level 2 Level 3
A B1 C1
A B1 C2
A B2 C3
A B2 C4

Many-to-Many Hierarchy Structure

To represent a multi-parent hierarchy, you need a many-to-many structure, and in cases where a node has multiple parents, you need three three tables, one for the parents, one for the child, and one connecting both together with foreign keys from both (this is called a Factless Fact or a Bridge table, in Kimball methodology).

In SSAS, you import a bridge table into SSAS as a Fact table, and use a many-to-many relationship to link together the two dimensions, with a Measure Group of Count, which is hidden.

Facts, Dimensions, Measures, Attributes, Members

With all the dimensional modeling terms out there, it can be easy to get confused. Here is a quick guide to some of them for people who think in tables (which, I'm guessing, is most people who work with databases):

Fact can either refer to a fact table, or a row within that table, or a column within that table, or even a cell within that table. It is a very broad term. But with relation to the data within the table, a fact is always a number: 'a numeric fact about an event that happened'. Fact tables have foreign keys to the primary keys within dimension tables, which link together to describe the facts. So if the fact table records at each row a sale line, the fact might be the line total, and to find out what branch the sale line was recorded at, you'd have to link to the Branch dimension.

Measures are measurements of facts. So if your fact table records sale lines, then the measure might be $4.00 for a particular sale line, or $400,000 for all sale lines this week, i.e. a roll-up of all the sale lines. So a measure can refer to either a particular cell within the fact table, or a roll-up of cells.

Dimension can refer to either a dimension table, or a column within that table. It means something to group your facts into. So if your fact is a sale line, then a dimension might be the branch that sale line was recorded at.

Attributes refer to the columns within the dimension table.

Members refer to the cells within a particular attribute.

Degenerate Dimensions are columns within a fact table that are not numeric or foreign keys. They would normally go in a dimension of their own, but because there are so many of them, the 'carnality' (i.e. the number of distinct values within the column) is so high (i.e. many unique values compared to the number of rows within the column) they are put in the fact table, to save an expensive join to the dimension tables. Joins only work well when the number of rows in one table result set (table expression) is much smaller than the number of rows in another.

Example

So when a manager comes to you and says they want to see total sales by branch by year, they are telling you they want one measure (Total Sales), which means a 'Fact' will have to exist in a 'Fact table' called Sales, and this can be at any grain (sale line, sale, sale person, branch sales) so long as you can sum it up to give the total amount sales in a branch. They are also telling you they want two dimensions, which in our case will be represented in two dimension tables, a date table, and a branch table.

But note, your foreign key in your Fact table will have to join to these two dimension tables, so you need to ensure that your dimension tables go down to a low enough grain. If your fact contains the sales person FK, not the branch FK, your Branch dimension will need to go down to the Sales person grain.

Reasons Not to Use Junk Dimensions

Junk dimensions can be a pain in the neck. Often people will say that it is better just to add extra columns to the fact table, especially if the junk dimension represented only 'Yes' or 'No'. Here are some reasons why:

  • If you ever want to update junk dimensions, it can involve a change in the meaning of existing members (cells) in the junk dimension.
  • It is hard to maintain a surrogate key for a junk dimension
  • If you ever want to move an attribute (column) of a junk dimension into a fully fledged dimension, this will break all the reports that previously used that key from the junk dimension.
Nevertheless, junk dimensions do have these advantages:
  • Junk dimensions reduce the number of dimensions within a cube, making it easier to browse
  • Reducing the number of dimensions means the cube performs better during querying

When to Snowflake

According to standard Kimbal methodology, Star schemas should really be used for everything, except when it is impossible to represent the data in a Star structure, such as when you have many-to-many relationships. Here you need a Factless Fact (or Bridge) table, made up only of Foreign Keys.

The other time you might need an almost Factless Fact table is to represent the current status of data in a Star schema, such as when you want to represent a current list of all products available. This would require a column ProductAvailable in your Fact table, so that it can be summed, and a foreign key relationship to the Product Dimension. You may as well stick this column into the Product Dimension and then import that table into SSAS as both a Fact and a Dimension.

Sunday, 26 February 2017

Indexes over simplified

There is a ton of literature on how SQL Server indexes work, and I won't repeat all that here. But I want to over-simplify it to help beginners understand what kinds of indexes they should be adding to their queries.

Lets say you have a query like this:

SELECT BusinessEntityID
     , FirstName
FROM Person.Person
WHERE LastName 'Smith'

What type of index should you add to this query? 

Well, because its looking at three columns (BusinessEntityID, FirstName and LastName), you'll want an index to cover all three. 

But what type? There are two types of covering index:
  • Key Column cover: A covering index where the key to the index is made up of all the columns
  • Include Column cover: A covering index where the key to the index is made up of only one column (you can specify more), but the rest are included.
A Key column index over Lastname and FirstName (the order is important) looks like this:
  • Jones
    • Abraham
    • Isaac
  • Smith
    • Bob
    • Trudy
An include index with a key of Lastname including Firstname, looks like this:
  • Jones - Isaac
  • Jones - Abraham
  • Smith - Bob
  • Smith - Trudy
In short, a Key Column index is sorted by both LastName, and then FirstName. But it includes the firstname within the index structure itself, so that after having found the key (LastName), it doesn't have to go back to the table to find the other data (FirstName).

So for the following query, the ideal index would have its key on LastName, since that is what SQL Server will have to search through the data to find first, since we're filtering on it, and then once its found all rows with LastName of 'Smith', SQL Server will have to fetch the FirstName and BusinessEntityID for these rows, so we can include them in the index, to make it faster for SQL Server to return them.

SELECT BusinessEntityID
     , FirstName
FROM Person.Person
WHERE LastName 'Smith'

We could also have created an index with three columns LastName>FirstName>BusinessEntityID. This would be just as fast. In fact, it would help out in other scenarios too! For example, if we added an "ORDER BY LastName, FirstName" clause to the end. Because the FirstName would be indexed (i.e. arranged in order), not merely included.

But what if we change the query to this?

SELECT LastName 
     , FirstName
FROM Person.Person
WHERE FirstName 'Bob'

Because the key column index is ordered by LastName first, but the query needs to retrieve rows with FirstName first, the query will have to scan through every row of the index to find the data, making it less efficient.

But even though this is filtering on FirstName, it can actually still use the rowcover index we created. Of course, this won't be as fast as the query before, but SQL Server will still be able to make use of this index, making it a better option than having a key on Lastname and including firstname, which would not be able to be used if we put Firstname in the WHERE clause.

Friday, 3 February 2017

Database Corruption

Got database corruption? Want to check if you've got database corruption? Want to know what to do about database corruption? Here is the link for you!

This link is amazing! After following the link, scroll down to the bottom to find additional links to the following 10 parts:

Part I: What Is Corruption?
Part II: Simulating Corruption
Part III: Preventing Corruption
Part IV: CHECKSUM Page Verification
Part V: Storage Problem Alerts
Part VI: Regular Corruption Checks
Part VII: Backups
Part VIII: The Importance of Backups
Part IX: Responding to Corruption
Part X: Page Level Retore Operations
Part XI: Full Recovery Operations
Part XII: Recovery Sample
Part XIII: Recap

Here is the link:

http://sqlmag.com/blog/sql-server-database-corruption-part-i-what-corruption

Tuesday, 31 January 2017

Power BI Dashboards - No Drill Down - Aghhh!

A new feature was added to Power BI last year: drill down in charts. Microsoft didn't give us drill down in tables, but they did in charts.

But even with charts, there's a snag: the feature disappears when you pin a report to a dashboard. To access it, you need to click the Title of the report in the Dashbord, which will take the user through to see the report. 

Saturday, 14 January 2017

Simple Speak Vs Business Speak

The man said, “I am a digital and social media strategist. I deliver programs, products and strategies to our corporate clients across the spectrum of communications functions.”

When asked by the journalist what this meant, he replied, “I teach big companies how to use facebook.”

I like to use simple speak rather than its bloated counterpart, because people actually understand me when I do. This even applies to technical terminology. For example, I call a Sharepoint Document Library, a folder. I like to call a precedent constraint in SSIS, a line.

"Business Intelligence" itself is too much of a mouthful for me. I prefer to say, "Turning data into information" or, "making data easier to understand."

Why use simple speak in the context of business intelligence, you ask? Because I like to make data easier to understand.

Here is a video outlining the matter perfectly:

Tuesday, 3 January 2017

Agile, Waterfall or Neither for SSRS Reports?

Anyone who is familiar the Waterfall project methodology knows that part of it involves collecting all requirements up front. Compare that to Agile where the project consumes requirements in a more segmented way, through 'user stories'. In my view, these methodologies may have their use in other forms of programming, such as web-development, but when it comes to report writing, neither are ideal (though Agile comes closest).

In BI requirements always change. In fact, its unfair to demand that business user know exactly what they want. The probably do not know the data structure behind the data. It may be that what they want cannot be produced, and this is not realized until later in the development. Once the user has seen the result of the initial database query, they may realize (as they investigate the result) that they needed something quite different, or something extra. They simply couldn't have known that before. BI is all about investigation. That makes it different from other types of development.

In my view, those 'BI Professionals' who demand a REQUIREMENTS DOCUMENT up front, are amateurs. They have often come from development backgrounds, and have great skills in coding. But they don't know about the business realities of BI.

Being a great BI Professional is about so much more than just coding. You have to be able to work with the business; find out what they want, why they want it, and be good at explaining data realities to them, translating technical realities about the data into understandable language.

BI development is a learning process - that's why its called Business Intelligence, not Business Reporting or Business Dashboarding. Its not as simple as creating a report, a dashboard, a cube, a model, or writing a query. That often comes at the end of the process, once the data, usually copied into Excel and looked over by the user several times, has been verified and agreed to. Only then should it be instantiated into a report (turned into a report) or dashboard or tabular model or cube.