A computer

Excel cube program. Creating an OLAP cube using Microsoft Query

OLAP (On-Line Analytical Processing) is an electronic data analytics method that represents the organization of data into hierarchical categories using pre-calculated totals. OLAP data is organized hierarchically and stored in cubes rather than tables. OLAP cubes are a multidimensional dataset with axes on which parameters are plotted and cells containing parameter-dependent aggregate data. Cubes are designed for complex, multidimensional analysis of large amounts of data, because they provide only summary results for reports, instead of a large number of individual records.

The concept of OLAP was described in 1993 by the well-known database researcher and author of the relational data model, E. F. Codd. Currently, OLAP support is implemented in many DBMS and other tools.

An OLAP cube contains two types of data:

Total values, values for which you want to sum up, representing calculated data fields;

descriptive information that measurements or dimensions. Descriptive information is usually broken down into levels of detail. For example: "Year", "Quarter", "Month" and "Day" in the dimension "Time". Breaking down fields into detail levels allows report users to select the level of detail they want to view, starting with a high-level summary and then moving on to a more detailed view, and vice versa.

Microsoft Query tools also let you create OLAP cubes from a query that loads data relational base data, such as Microsoft Access, while converting a linear table into a structural hierarchy (cube).

The OLAP Cube Creation Wizard is a built-in Microsoft Query tool. To create an OLAP cube based on a relational database, you must complete the following steps before running the wizard.

1. Define the data source (see Figure 6.1).

2. C help from Microsoft Query to create a query, including only those fields that will be either data fields or dimension fields of the OLAP cube, if the field in the cube is used more than once, then it must be included in the query the required number of times.

3. At the last step of the Query Creation Wizard, set the radio button to Create an OLAP cube from a given query(see Fig. 6.2) or after the query has been created using the Query tools directly in the menu File choose a team Create OLAP cube , which will launch the OLAP Cube Creation Wizard.

The OLAP Cube Creation Wizard has three steps.

At the first step of the wizard (see Figure 6.6), the data fields– Calculated fields for which you want to define totals.

Rice. 6.6. Defining data fields

Suggested calculated fields (usually numeric fields) are placed by the wizard in upper part list, flags and defines the final function of these fields, usually − Sum. When selecting data fields, at least one field must be selected as a calculated field, and at least one field must remain unchecked to define the dimension.

When creating an OLAP cube, four summary functions can be used − Sum, Number(number of values), Minimum, Maximum for numeric fields and one function Number for all other fields. If you want to use several different summary functions for the same field, that field must be included in the query as many times as needed.

Calculated field name can be changed in a column Data field name.

At the second step of the wizard, descriptive data and their dimensions are defined (see Figure 6.7). To select a dimension field, you need from the list Source fields drag the desired top-level dimension field to the list measurements to the area marked as Drag fields here to create a dimension. To create an OLAP cube, you must define at least one dimension. At the same step of the wizard, using context menu you can change the dimension or level field name.

Rice. 6.7. Definition of dimension fields

Fields that contain isolated or distinct data and do not belong to a hierarchy can be defined as single-level dimensions. However, using a cube will be more efficient if some of the fields are organized into levels. To create a level as part of a dimension, drag a field from the list Source fields on a field that is a dimension or level. Fields containing more than details, should be located at lower levels. For example, in Figure 6.7, the field Position is the field level Department Name.

To move a field to a lower or higher high level, you need to drag it to a lower or higher box within the dimension. The or buttons are used to display or hide the levels, respectively.

If date or time fields are used as the top-level dimension, the Create OLAP Cube Wizard automatically creates levels for those dimensions. The user can then select which levels should be present in the reports. For example, you can select weeks, quarters, and years, or months (see Figure 6.7).

Keep in mind that the wizard automatically creates levels for date and time fields only when you create the top-level dimension; when you add these fields as sublevels of an existing dimension, no automatic levels are created.

At the third step of the wizard, the type of cube created by the wizard is determined, while three options are possible (see Figure 6.8).

Rice. 6.8. Selecting the type of cube to be created at the third step of the wizard

· The first two options involve creating a cube each time the report is opened (if the cube is viewed from Excel, then we are talking about a pivot table). In this case, the request file and the file *.oqy cube definitions A containing instructions for creating the cube. The *.oqy file can be opened in Excel program To create reports based on the cube, and if you need to make changes to the cube, you can open it with Query to run the Create Cube Wizard again.

By default, cube definition files, as well as query files, are stored in the user's profile folder in Application Data\Microsoft\Que-ries. When saving the *.oqy file to standard folder, the cube definition file name is displayed on the tab OLAP cubes when opening a new query in Microsoft Query or when selecting a command Create request(menu Data, submenu Importing external data) in Microsoft Excel.

In the case of choosing the third option of the cube type Saving a cube file containing all the data for a cube, all the data for the cube is retrieved and a cube file with extension * is created in the location specified by the user .cub in which this data is stored. Creation given file does not happen immediately when the button is clicked Ready; the file is created either when the cube definition is saved to a file, or when a report is generated from the cube.

The choice of cube type is determined by several factors: the amount of data that the cube contains; the type and complexity of the reports that will be generated based on the cube; system resources (memory and disk space), etc.

Separate file cube *.cub should be created in the following cases:

1) for interactive reports that change frequently if there is sufficient disk space;

2) when you need to save the cube on a network server to provide access to it for other users when creating reports. A cube file can provide specific data from the source database, while omitting secret or sensitive data that you want to prevent other users from accessing.

OLAP is not a single software product, not a programming language, and not even a specific technology. If you try to cover OLAP in all its manifestations, then this is a set of concepts, principles and requirements that underlie software products that make it easier for analysts to access data. Let's find out why analysts need something special facilitate data access.

The fact is that analysts are special consumers of corporate information. The task of an analyst is to find patterns in large data sets. Therefore, the analyst will not pay attention to the single fact that on Thursday the fourth day a batch of black ink was sold to the counterparty Chernov - he needs information about hundreds and thousands similar events. Single facts in the database may be of interest, for example, to an accountant or the head of the sales department, in whose competence the transaction is located. One record is not enough for an analyst - for example, he may need all the transactions of a given branch or representative office for a month or a year. At the same time analyst discards unnecessary details like the buyer's TIN, his exact address and phone number, contract index and the like. At the same time, the data that an analyst needs to work necessarily contain numerical values - this is due to the very essence of his activity.

So, an analyst needs a lot of data, this data is selective and also has a nature " attribute set - number". The latter means that the analyst works with tables of the following type:

Here " The country", "Product", "Year" are attributes or measurements, a " Volume of sales" - thus a numerical value or measure. The task of the analyst, we repeat, is to identify persistent relationships between attributes and numerical parameters.. Looking at the table, you can see that it can be easily translated into three dimensions: on one of the axes we put countries, on the other - goods, on the third - years. And the values in this three-dimensional array will be the corresponding sales volumes.

3D representation of the table. The gray segment shows that there are no data for Argentina in 1988

It is precisely such a three-dimensional array in terms of OLAP that is called a cube. In fact, from the point of view of strict mathematics, such an array will not always be a cube: for a real cube, the number of elements in all dimensions must be the same, while OLAP cubes do not have such a restriction. However, despite these details, the term "OLAP cubes", due to its brevity and imagery, has become generally accepted. An OLAP cube doesn't have to be 3D at all. It can be both two-dimensional and multidimensional, depending on the problem being solved. Especially seasoned analysts may need about 20 measurements - and serious OLAP products are designed for just such a number. More simple desktop applications support about 6 dimensions.

measurements OLAP cubes are made up of so-called marks or members. For example, the dimension "Country" consists of the labels "Argentina", "Brazil", "Venezuela", and so on.

Not all elements of the cube should be filled in: if there is no information about sales of rubber products in Argentina in 1988, the value in the corresponding cell will simply not be determined. It is also not necessary that an OLAP application store data necessarily in a multidimensional structure - the main thing is that for the user this data looks exactly like that. By the way special ways compact storage of multidimensional data, "vacuum" (unfilled elements) in cubes does not lead to a waste of memory.

However, the cube itself is not suitable for analysis. If it is still possible to adequately represent or depict a three-dimensional cube, then with six or nineteen dimensions the situation is much worse. So before use ordinary cubes are extracted from a multidimensional cube two-dimensional tables. This operation is called "cutting" the cube. Again, this term is figurative. The analyst, as it were, takes and "cuts" the dimensions of the cube according to the marks of interest to him. In this way, the analyst receives a two-dimensional slice of the cube and works with it. In about the same way, lumberjacks count annual rings on a saw cut.

Accordingly, as a rule, only two dimensions remain "uncut" - according to the number of dimensions of the table. It happens that only the dimension remains "uncut" - if the cube contains several types of numerical values, they can be plotted according to one of the dimensions of the table.

If you take a closer look at the table that we depicted first, you can see that the data in it, most likely, are not primary, but are obtained as a result of summation for smaller items. For example, a year is divided into quarters, quarters into months, months into weeks, weeks into days. A country is made up of regions, and regions are made up of localities. Finally, in the cities themselves, districts and specific retail outlets can be distinguished. Products can be combined into product groups and so on. In terms of OLAP, such multilevel joins are quite logically called hierarchies. OLAP tools make it possible at any time to move to the desired level of the hierarchy. Moreover, as a rule, several types of hierarchies are supported for the same elements: for example, day-week-month or day-decade-quarter. The source data is taken from the lower levels of the hierarchies and then summarized to obtain the values of the higher levels. In order to speed up the transition process, the summed values for different levels are stored in a cube. Thus, what looks like one cube from the user's side, roughly speaking, consists of many more primitive cubes.

Hierarchy Example

This is one of the essential points that led to the emergence of OLAP - productivity and efficiency. Let's imagine what happens when an analyst needs to get information, and OLAP tools are not available in the enterprise. The analyst independently (which is unlikely) or with the help of a programmer makes an appropriate SQL query and receives the data of interest in the form of a report or exports it to a spreadsheet. There are a lot of problems with this. Firstly, the analyst is forced to do something other than his job (SQL programming) or wait for the programmers to do the task for him - all this negatively affects labor productivity, the assault, the heart attack and stroke level increase, and so on. Secondly, a single report or table, as a rule, does not save the giants of thought and the fathers of Russian analysis - and the whole procedure will have to be repeated again and again. Thirdly, as we have already found out, analysts do not ask for trifles - they need everything at once. This means (although technology is advancing by leaps and bounds) that the corporate relational DBMS, which the analyst refers to, can think deeply and for a long time, blocking the rest of the transactions.

The concept of OLAP appeared precisely to solve such problems. OLAP cubes are essentially meta-reports. By cutting meta-reports (cubes, that is) by dimensions, the analyst actually receives the "regular" two-dimensional reports of interest to him (these are not necessarily reports in the usual sense of the term - we are talking about data structures with the same functions). The advantages of cubes are obvious - data needs to be requested from a relational DBMS only once - when building a cube. Since analysts, as a rule, do not work with information that is supplemented and changed on the fly, the generated cube is relevant for quite a long time. Thanks to this, interruptions in the operation of the relational DBMS server are not only eliminated (there are no queries with thousands and millions of response lines), but the data access speed for the analyst himself is also dramatically increased. In addition, as already noted, performance is also improved by calculating subsums of hierarchies and other aggregated values at the time of cube construction. That is, if initially our data contained information about the daily revenue for a specific product in a single store, then when forming the cube, the OLAP application calculates the totals for different levels of hierarchies (weeks and months, cities and countries).

Of course, you have to pay for increasing productivity in this way. It is sometimes said that the data structure simply "explodes" - an OLAP cube can take up tens or even hundreds of times more space than the original data.

Answer the questions:

What cube OLAP?

What labels specific dimension? Give examples.

Can they measures in an OLAP cube, contain non-numeric values.

Perhaps, for some, the use of OLAP technology (On-line Analytic Processing) when building reports will seem like some kind of exotic, so the use of OLAP-CUBE for them is not at all one of the most important requirements for automating budgeting and management accounting.

In fact, it is very convenient to use the multidimensional CUBE when working with management reporting. When developing budget formats, you may encounter the problem of multivariate forms (more on this can be found in Book 8 "Technology for setting budgeting in a company" and in the book "Setting and automating management accounting").

This is due to the fact that the effective management of the company requires more and more detailed management reporting. That is, the system uses more and more different analytical slices (in information systems analysts are defined by a set of directories).

Naturally, this leads to the fact that managers want to receive reports in all analytical sections of interest to them. And this means that the reports need to somehow be forced to “breathe”. In other words, we can say that in this case we are talking about the fact that, in terms of meaning, the same report should provide information in various analytical sections. Therefore, static reports no longer suit many modern managers. They need the dynamics that a multidimensional CUBE can provide.

Thus, OLAP technology has already become an indispensable element in modern and promising information systems. Therefore, when choosing a software product, you need to pay attention to whether it uses OLAP technology.

And you need to be able to distinguish real CUBEs from imitations. Pivot tables in MS Excel are one such imitation. Yes, this tool looks like a CUBE, but in fact it is not, since these are static, not dynamic tables. In addition, they have a much worse implementation of the ability to build reports that use elements from hierarchical directories.

To confirm the relevance of the use of the CUB in the construction of management reporting, one can cite the simplest example with a sales budget. In this example, the following analytical slices are relevant for the company: products, branches, and distribution channels. If these three analytics are important for the company, then the sales budget (or report) can be displayed in several ways.

It should be noted that if you create budget lines based on three analytical slices (as in the example under consideration), this allows you to create fairly complex budget models and compile detailed reports using the KUB.

For example, a sales budget can be compiled using only one analytics (reference book). An example of a sales budget based on a single "Products" analytic is shown in figure 1.

Rice. 1. An example of a sales budget built on the basis of one analytic "Products" in an OLAP-CUBE

The same sales budget can be compiled using two analytics (reference books). An example of a sales budget built on the basis of two analytics "Products" and "Affiliates" is presented on figure 2.

Rice. 2. An example of a sales budget built on the basis of two analytics "Products" and "Affiliates" in the OLAP-CUBE of the "INTEGRAL" software package

If there is a need to build more detailed reports, then the same sales budget can be compiled using three analytics (reference books). An example of a sales budget built on the basis of three dimensions "Products", "Affiliates" and "Distribution channels" is presented in figure 3.

Rice. 3. An example of a sales budget built on the basis of three analytics "Products", "Affiliates" and "Distribution channels" in the OLAP-CUBE of the "INTEGRAL" software package

It should be recalled that the KUB used to generate reports allows you to display data in a different sequence. On the figure 3 the sales budget is first "deployed" by product, then by branch, and then by distribution channel.

The same data can be presented in a different sequence. On the figure 4 the same sales budget is "rolled out" first by product, then by distribution channel, and then by branch.

Rice. 4. An example of a sales budget built on the basis of three analytics "Products", "Distribution channels" and "Affiliates" in the OLAP-CUBE of the INTEGRAL software package

On the figure 5 the same sales budget is "rolled out" first by branch, then by product, and then by distribution channel.

Rice. 5. An example of a sales budget built on the basis of three analytics "Branches", "Products" and "Distribution channels" in the OLAP-CUBE of the INTEGRAL software complex

In fact, these are not all possible options for deriving a sales budget.

In addition, you need to pay attention to the fact that the KUB allows you to work with hierarchical structure reference books. In the presented examples, the hierarchical directories are "Products" and "Distribution channels".

From the user's point of view, he this example receives several management reports (see Rice. 1-5), and in terms of settings in software product is one report. Just with the help of the CUBE, it can be viewed in several ways.

Naturally, in practice it is possible to a large number of output options for various management reports, if their articles are based on one or more analytics. And the set of analytics itself depends on the needs of users for detailing. True, one should not forget that, on the one hand, the more analysts, the more detailed reports can be built. But, on the other hand, it means that the financial model of budgeting will be more complex. In any case, if the KUB is available, the company will be able to view the necessary reporting in various options, in accordance with the analytical sections of interest.

It is necessary to mention a few more features of the OLAP-CUBE.

There are several dimensions in a multidimensional hierarchical OLAP-CUBE: row type, date, rows, lookup 1, lookup 2 and lookup 3 (see Fig. Rice. 6). Naturally, the report displays as many buttons with directories as there are in the budget line containing maximum amount reference books. If there is not a single directory in any line of the budget, then the report will not contain any buttons with directories.

Initially, the OLAP-CUBE is built on all dimensions. By default, when a report is initially built, the dimensions are located exactly in those areas, as shown in figure 6. That is, such a dimension as "Date" is located in the area of vertical dimensions (dimensions in the area of columns), the dimensions "Rows", "Lookup 1", "Lookup 2" and "Lookup 3" - in the area of horizontal measurements (dimensions in the area rows) and the "Row Type" dimension in the area of "unexpanded" dimensions (dimensions in the page area). If a dimension is in the last area, then the data in the report will not be "expanded" by that dimension.

Each of these dimensions can be placed in any of the three areas. After the measurements are transferred, the report is instantly rebuilt according to the new measurement configuration. For example, you can swap the date and strings with directories. Or you can transfer one of the reference books to the vertical measurement area (see Fig. Rice. 7). In other words, the report in the OLAP-CUBE can be "twisted" and choose the version of the report output that is most convenient for the user.

Rice. 7. An example of rebuilding a report after changing the measurement configuration of the "INTEGRAL" software package

The measurement configuration can be changed either in the main form of the KUB or in the editor of the map of changes (see. Rice. eight). In this editor, you can also drag and drop measurements from one area to another with the mouse. In addition, you can swap measurements in the same area.

In addition, in the same form, you can configure some measurement parameters. For each dimension, you can customize the location of the totals, the sort order of the elements and the names of the elements (see. Rice. eight). You can also specify what name of the elements to display in the report: abbreviated (Name) or full (FullName).

Rice. 8. Editor of the map of measurements of the software complex "INTEGRAL"

Measurement parameters can be edited directly in each of them (see. Rice. nine). To do this, click on the icon located on the button next to the name of the measurement.

Rice. 9. An example of editing a directory 1 Products and services in

With this editor, you can select the elements that you want to show in the report. By default, all items are displayed in the report, but if necessary, some items or folders can be omitted. For example, if you need to display only one product group in the report, then all the rest must be unchecked in the dimension editor. After that, the report will contain only one product group (see Fig. Rice. ten).

You can also sort items in this editor. In addition, elements can be rearranged different ways. After such a regrouping, the report is instantly rebuilt.

Rice. 10. An example of displaying only one product group (folder) in the report in the "INTEGRAL" software package

In the dimension editor, you can quickly create your own groups, drag elements from directories there, etc. By default, only the Other group is automatically created, but you can create other groups as well. Thus, using the dimension editor, you can configure which elements of reference books and in what order should be displayed in the report.

It should be noted that all such rearrangements are not recorded. That is, after the report is closed or after it is recalculated, all directories will be displayed in the report in accordance with the configured methodology.

In fact, all such changes could have been made initially when setting up the strings.

For example, using restrictions, you can also specify which elements or groups of directories should be displayed in the report, and which should not.

Note: the topic of this article is discussed in more detail at workshops "Business Budget Management" and "Setting up and automation of management accounting" conducted by the author of this article - Alexander Karpov.

If the user almost regularly needs to display in the report only certain elements or directory folders, then it is better to make such settings in advance when creating report lines. If, however, different combinations of elements of directories in reports are important for the user, then no restrictions should be set when setting up the methodology. All such restrictions can be quickly configured using the dimension editor.

What is OLAP today, in general, every specialist knows. At least the concepts of "OLAP" and "multidimensional data" are firmly connected in our minds. Nevertheless, the fact that this topic is being raised again, I hope, will be approved by the majority of readers, because in order for the idea of \u200b\u200bto not become outdated over time, you need to periodically communicate with smart people or read articles in a good publication...

Data warehouses (OLAP's place in information structure enterprises)

The term "OLAP" is inextricably linked with the term "data warehouse" (Data Warehouse).

Here is a definition formulated by the "founding father" of data warehouses, Bill Inmon: "A data warehouse is a domain-specific, time-bound and immutable collection of data to support the process of making managerial decisions."

The data in the storage comes from operational systems (OLTP systems), which are designed to automate business processes. In addition, the repository can be replenished from external sources, such as statistical reports.

Why build data warehouses - after all, they contain obviously redundant information that already "lives" in the databases or files of operating systems? The answer can be short: it is impossible or very difficult to analyze the data of operational systems directly. This is due to various reasons, including the fragmentation of data, storing them in different DBMS formats and in different "corners" corporate network. But even if all the data in the enterprise is stored on a central database server (which is extremely rare), the analyst will almost certainly not understand their complex, sometimes confusing structures. The author has a rather sad experience of trying to "feed" hungry analysts with "raw" data from operational systems - it turned out to be too tough for them.

Thus, the task of the repository is to provide "raw materials" for analysis in one place and in a simple, understandable structure. Ralph Kimball, in the preface to his book "The Data Warehouse Toolkit", writes that if, after reading the entire book, the reader understands only one thing, namely that the warehouse structure should be simple, the author will consider his task completed.

There is another reason that justifies the appearance of a separate storage - complex analytical queries for operational information slow down the current work of the company, blocking tables for a long time and seizing server resources.

In my opinion, storage is not necessarily a giant accumulation of data - the main thing is that it is convenient for analysis. Generally speaking, a separate term is intended for small storages - Data Marts (data kiosks), but in our Russian practice you will not often hear it.

OLAP- handy tool analysis

Centralization and convenient structuring are far from all that an analyst needs. After all, he still needs a tool for viewing, visualizing information. Traditional reports, even those based on single repository, deprived of one thing - flexibility. They cannot be "twisted", "expanded" or "collapsed" to get the desired view of the data. Of course, you can call a programmer (if he wants to come), and he (if he is not busy) will make a new report quite quickly - say, within an hour (I write and I don’t believe it myself - it doesn’t happen so quickly in life; let's give him three hours) . It turns out that an analyst can check no more than two ideas per day. And he (if he is a good analyst) can come up with several such ideas per hour. And the more "slices" and "slices" of data the analyst sees, the more ideas he has, which, in turn, require more and more new "slices" for verification. I wish he had such a tool that would allow him to expand and collapse data simply and conveniently! OLAP is one such tool.

Although OLAP is not a necessary attribute of a data warehouse, it is increasingly used to analyze the information accumulated in this data warehouse.

The components included in a typical storage are shown in fig. one.

Rice. 1. Data warehouse structure

Operational data is collected from various sources, cleaned, integrated and put into a relational store. At the same time, they are already available for analysis using various reporting tools. Then the data (in whole or in part) is prepared for OLAP analysis. They can be loaded into a special OLAP database or left in a relational store. Its most important element is metadata, i.e. information about the structure, placement and transformation of data. Thanks to them, the effective interaction of various storage components is ensured.

Summing up, we can define OLAP as a set of tools for multidimensional analysis of data accumulated in a warehouse. Theoretically, OLAP tools can be applied directly to operational data or exact copies(so as not to interfere with operational users). But by doing so, we run the risk of stepping on the rake already described above, i.e., starting to analyze operational data that are not directly suitable for analysis.

Definition and basic concepts of OLAP

To begin with, let's decipher: OLAP is Online Analytical Processing, that is, online data analysis. The 12 defining principles of OLAP were formulated in 1993 by E. F. Codd, the "inventor" of relational databases. Later, its definition was reworked into the so-called FASMI test, which requires an OLAP application to provide the ability to quickly analyze shared multidimensional information ().

FASMI test

Fast(Fast) - the analysis should be carried out equally quickly on all aspects of the information. Acceptable response time is 5 seconds or less.

analysis(Analysis) - it should be possible to carry out the basic types of numerical and statistical analysis, predefined by the application developer or arbitrarily defined by the user.

shared(Shared) - Multiple users must have access to the data, while access to sensitive information must be controlled.

Multidimensional(Multidimensional) is the main, most essential characteristic of OLAP.

information(Information) - the application must be able to access any necessary information, regardless of its volume and storage location.

OLAP = Multidimensional View = Cube

OLAP provides a convenient, high-speed means of accessing, viewing and analyzing business information. The user gets a natural, intuitive data model, organizing them in the form of multidimensional cubes (Cubes). The axes of the multidimensional coordinate system are the main attributes of the analyzed business process. For example, for sales it can be a product, region, type of buyer. Time is used as one of the measurements. At the intersections of the axes - measurements (Dimensions) - there are data that quantitatively characterize the process - measures (Measures). These can be sales volumes in pieces or in monetary terms, stock balances, costs, etc. A user analyzing information can "cut" the cube in different directions, get summary (for example, by years) or, conversely, detailed ( weekly) information and perform other manipulations that come to his mind in the process of analysis.

As measures in the three-dimensional cube shown in Fig. 2, sales amounts are used, and time, product and store are used as measurements. Measurements are presented at specific grouping levels: products are grouped by category, stores are grouped by country, and transaction times are grouped by month. A little later we will consider the levels of grouping (hierarchy) in more detail.

Rice. 2. Cube example

"Cutting" the cube

Even a three-dimensional cube is difficult to display on a computer screen so that the values of the measures of interest can be seen. What can we say about cubes with more than three dimensions? To visualize the data stored in the cube, as a rule, the usual two-dimensional, i.e., tabular, views are used, which have complex hierarchical row and column headers.

A two-dimensional representation of a cube can be obtained by "cutting" it across one or more axes (dimensions): we fix the values of all dimensions, except for two, and we get a regular two-dimensional table. The horizontal axis of the table (column headers) represents one dimension, the vertical axis (row headers) represents another dimension, and the table cells represent measure values. In this case, the set of measures is actually considered as one of the dimensions - we either select one measure for display (and then we can place two dimensions in the headers of rows and columns), or we show several measures (and then one of the axes of the table will be occupied by the names of the measures, and the other - the value of a single "uncut" dimension).

Take a look at fig. 3 - here is a two-dimensional slice of the cube for one measure - Unit Sales (pieces sold) and two "uncut" dimensions - Store (Store) and Time (Time).

Rice. 3. Two-dimensional cube slice for one measure

On fig. 4 shows only one "uncut" dimension - Store, but it displays the values of several measures - Unit Sales (pieces sold), Store Sales (sales amount) and Store Cost (store expenses).

Rice. 4. 2D Cube Slicing for Multiple Measures

A two-dimensional representation of a cube is also possible when more than two dimensions remain "uncut". In this case, two or more dimensions of the "cut" cube will be placed on the slice axes (rows and columns) - see fig. 5.

Rice. 5. Two-dimensional slice of a cube with several dimensions on the same axis

Tags

Values "set aside" along dimensions are called members or labels (members). Labels are used both for "cutting" the cube and for limiting (filtering) the selected data - when in a dimension that remains "uncut" we are not interested in all values, but in their subset, for example, three cities out of several dozen. The label values appear in the 2D cube view as row and column headings.

Hierarchies and levels

Labels can be combined into hierarchies consisting of one or more levels. For example, the labels of the dimension "Store" (Store) are naturally combined into a hierarchy with levels:

Country (Country)

State (State)

City (City)

Store (Shop).

According to the levels of the hierarchy, aggregate values are calculated, such as sales for the USA ("Country" level) or for California ("State" level). More than one hierarchy can be implemented in one dimension - say for time: (Year, Quarter, Month, Day) and (Year, Week, Day).

OLAP Application Architecture

Everything that was said above about OLAP, in fact, referred to the multidimensional presentation of data. Roughly speaking, neither the end user nor the developers of the tool that the client uses care about how data is stored.

Multidimensionality in OLAP applications can be divided into three levels:

Multidimensional data representation - end-user tools that provide multidimensional visualization and data manipulation; the multidimensional representation layer abstracts from the physical structure of the data and treats the data as multidimensional.
Multidimensional processing - a tool (language) for formulating multidimensional queries (the traditional relational SQL language is unsuitable here) and a processor that can process and execute such a query.
Multidimensional Storage - Tools physical organization data that provide efficient execution of multidimensional queries.

The first two levels are mandatory in all OLAP tools. The third level, although widely used, is not required, since data for multidimensional representation can also be retrieved from ordinary relational structures; the multidimensional query processor in this case translates multidimensional queries into SQL queries that are executed by a relational DBMS.

Specific OLAP products are typically either a multidimensional data presentation tool, an OLAP client (for example, Pivot Tables in Excel 2000 from Microsoft or ProClarity from Knosys), or a multidimensional backend DBMS, OLAP server (for example, Oracle Express Server or Microsoft OLAP Services).

The multidimensional processing layer is usually built into the OLAP client and/or OLAP server, but can be isolated in its purest form, such as Microsoft's Pivot Table Service component.

Technical aspects of multidimensional data storage

As mentioned above, OLAP analysis tools can also extract data directly from relational systems. This approach was more attractive at a time when OLAP servers were not on the price lists of leading database vendors. But today, Oracle, Informix, and Microsoft offer full-fledged OLAP servers, and even those IT managers who do not like to plant a "zoo" of software in their networks different manufacturers, can buy (more precisely, apply with a corresponding request to the company's management) an OLAP server of the same brand as the main database server.

OLAP servers, or multidimensional database servers, can store their multidimensional data in different ways. Before considering these methods, we need to talk about such important aspect like storage aggregates. The fact is that in any data warehouse - both conventional and multidimensional - along with detailed data retrieved from operational systems, summary indicators (aggregated indicators, aggregates) are also stored, such as the sums of sales volumes by months, by categories goods, etc. Aggregates are stored explicitly for the sole purpose of speeding up query execution. After all, on the one hand, as a rule, a very large amount of data is accumulated in the storage, and on the other hand, analysts in most cases are interested not in detailed, but in generalized indicators. And if millions of individual sales had to be summed up each time to calculate the amount of sales for the year, the speed would most likely be unacceptable. Therefore, when loading data into a multidimensional database, all total indicators or part of them are calculated and saved.

But, as you know, you have to pay for everything. And you have to pay for the speed of processing queries to summary data by increasing the amount of data and the time it takes to load them. Moreover, an increase in volume can become literally catastrophic - in one of the published standard tests a complete count of aggregates for 10 MB of original data required 2.4 GB, i.e. the data grew 240 times! The degree of data "swelling" when calculating aggregates depends on the number of cube dimensions and the structure of these dimensions, i.e., the ratio of the number of "fathers" and "children" at different levels of the dimension. To solve the problem of storage of aggregates are sometimes used complex schemes, which allow, when calculating far from all possible aggregates, to achieve a significant increase in the performance of query execution.

Now about the various options for storing information. Both detail data and aggregates can be stored in either relational or multidimensional structures. Multidimensional storage allows you to treat data as a multidimensional array, which provides the same fast calculation of totals and various multidimensional transformations on any of the dimensions. Some time ago, OLAP products supported either relational or multidimensional storage. Today, as a rule, the same product provides both of these types of storage, as well as a third type - mixed. The following terms apply:

MOLAP(Multidimensional OLAP) - both detailed data and aggregates are stored in a multidimensional database. In this case, the greatest redundancy is obtained, since multidimensional data completely contains relational data.
ROLAP(Relational OLAP) - detailed data remains where they "lived" originally - in a relational database; aggregates are stored in the same database in specially created service tables.
HOLAP(Hybrid OLAP) - detailed data stays in place (in a relational database), while aggregates are stored in a multidimensional database.

Each of these methods has its advantages and disadvantages and should be used depending on the conditions - the amount of data, the power of the relational DBMS, etc.

When storing data in multidimensional structures, there is potential problem"swelling" due to storage empty values. After all, if a place is reserved in a multidimensional array for all possible combinations of measurement labels, and only a small part is actually filled (for example, a number of products are sold only in a small number of regions), then most of the cube will be empty, although the place will be occupied. Modern OLAP products are able to cope with this problem.

To be continued. In the future, we will talk about specific OLAP products produced by leading manufacturers.

In the previous article of this series (see No. 2'2005), we talked about the main innovations of analytical services SQL Server 2005. Today we will take a closer look at the tools for creating OLAP solutions included in this product.

Briefly about the basics of OLAP

Before starting a conversation about the tools for creating OLAP solutions, let us recall that OLAP (On-Line Analytical Processing) is a technology for complex multidimensional data analysis, the concept of which was described in 1993 by E.F. Codd, the famous author of the relational data model. Currently, OLAP support is implemented in many DBMS and other tools.

OLAP cubes

What is OLAP data? To answer this question, consider a simple example. Let's assume that in the corporate database of a certain enterprise there is a set of tables containing information about the sales of goods or services, and based on them, the Invoices view was created with the fields Country (country), City (city), CustomerName (name of the client company), Salesperson (manager by sales), OrderDate (order date), CategoryName (product category), ProductName (product name), ShipperName (carrier company), ExtendedPrice (payment for the product), while the last of the listed fields, in fact, is the object of analysis .

Selecting data from such a view can be done using the following query:

SELECT Country, City, CustomerName, Salesperson,

OrderDate, CategoryName, ProductName, ShipperName, ExtendedPrice

FROM Invoices

Suppose we are interested in what is the total cost of orders placed by customers from different countries. To get an answer to this question, you need to make the following request:

SELECT Country, SUM (ExtendedPrice) FROM Invoices

GROUP BY Country

The result of this query will be a one-dimensional set of aggregate data (in this case, sums):

Country	SUM (ExtendedPrice)
Argentina	7327.3
Austria	110788.4
Belgium	28491.65
Brazil	97407.74
Canada	46190.1
Denmark	28392.32
Finland	15296.35
France	69185.48
	209373.6
	...

If we want to know what is the total cost of orders placed by customers from different countries and delivered by different delivery services, we must execute a query that contains two parameters in the GROUP BY clause:

SELECT Country, ShipperName, SUM (ExtendedPrice) FROM Invoices

GROUP BY COUNTRY, ShipperName

Based on the results of this query, you can create a table like this:

Such a set of data is called a pivot table.

SELECT Country, ShipperName, SalesPerson SUM (ExtendedPrice) FROM Invoices

GROUP BY COUNTRY, ShipperName, Year

Based on the results of this query, you can build a three-dimensional cube (Fig. 1).

By adding additional parameters for analysis, you can create a cube with theoretically any number of dimensions, while along with the sums in the cells of the OLAP cube, the results of calculating other aggregate functions (for example, averages, maximum, minimum values, the number of records of the original view, corresponding to this set parameters). The fields on which results are calculated are called cube measures.

Hierarchies in dimensions

Suppose we are interested not only in the total cost of orders placed by customers in different countries, but also the total cost of orders placed by customers in different cities one country. In this case, you can take advantage of the fact that the values applied to the axes have different levels of detail - this is described within the concept of a hierarchy of changes. Let's say countries are on the first level of the hierarchy, cities are on the second. Note that, beginning with SQL Server 2000, analytics services support so-called unbalanced hierarchies, which contain, for example, members whose "children" are not in adjacent levels of the hierarchy or are missing for some change members. A typical example of such a hierarchy is taking into account the fact that in different countries there may or may not be such administrative-territorial units as a state or region, located in a geographical hierarchy between countries and cities (Fig. 2).

Note that recently it has become common practice to single out typical hierarchies, for example, containing geographic or temporal data, and also to support the existence of several hierarchies in one dimension (in particular, for the calendar and fiscal year).

Creating OLAP Cubes in SQL Server 2005

SQL Server 2005 cubes are created using SQL Server Business Intelligence Development Studio. This tool is a special version visual studio 2005, designed to solve this class of problems (and in the presence of an already installed development environment, the list of project templates is replenished with projects designed to create solutions on SQL based Sever and its analytical services). In particular, the Analysis Services Project template (Figure 3) is designed to create solutions based on analytical services.

To create an OLAP cube, you first need to decide on the basis of what data to form it. Most often, OLAP cubes are built on the basis of relational data stores with star or snowflake schemas (we talked about them in the previous part of the article). The SQL distribution package contains an example of such a repository - the AdventureWorksDW database, to use which as a source you need to find the Data Sources folder in the Solution Explorer, select the New Data Source context menu item and sequentially answer the questions of the corresponding wizard (Fig. 4).

Then it is recommended to create a Data Source View - a view on the basis of which the cube will be created. To do this, select the appropriate item from the context menu of the Data Source Views folder and answer the wizard's questions in sequence. The result of these actions will be a data schema, with the help of which a representation of data sources will be built, while in the resulting schema, instead of the original ones, you can specify "friendly" table names (Fig. 5).

The cube described in this way can be transferred to the server of analytical services by selecting the Deploy option from the context menu of the project, and viewing its data (Fig. 7).

When creating cubes, many features are currently used new version SQL Server, such as the view of data sources. The description of the initial data for constructing a cube, as well as the description of the cube structure, is now made using the Visual Studio tool familiar to many developers, which is a considerable advantage of the new version of this product - the study of analytical solutions by developers of new tools in this case is minimized.

Note that in the created cube, you can change the composition of measures, delete and add dimension attributes, and add calculated attributes of dimension members based on existing attributes (Fig. 8).

Rice. 8. Adding a Computed Attribute

In addition, in SQL Server 2005 cubes, you can automatically group or sort dimension members by attribute value, define relationships between attributes, implement many-to-many relationships, define key business indicators, and perform many other tasks (details on how all these steps are performed can be found in the SQL Server Analysis Services Tutorial help system this product).

In the subsequent parts of this publication, we will continue our acquaintance with SQL Server 2005 analytical services and find out what's new in the field of Data Mining support.