Internet

The enterprise data model includes. Relational data model

It seems that now the topic of the development of data warehouses has slipped into a new round of development. New technologies, approaches and tools are emerging. Their study, testing and reasonable application allows us to create really interesting and useful solutions. And bring them to implementation, enjoying the fact that your developments are used in real work and bring benefits.

Epilogue

In preparing this article, I tried to focus primarily on architects, analysts and developers who directly work with data warehouses. But it turned out that I inevitably “took the topic a little wider” - and other categories of readers fell into the field of vision. Some points will seem controversial, some are not clear, some are obvious. People are different - with different experiences, backgrounds and positions.
For example, typical questions of managers are “when to attract architects?”, “When should I do architecture?”, “Architecture – won't it be too expensive?” sound rather strange for us (developers, designers), because for us the architecture of the system appears with its birth - it doesn’t matter if we realize it or not. And even if there is no formal role of an architect in the project, a normal developer always “turns on his internal architect”.

In the grand scheme of things, it doesn't matter who the architect is, what matters is that someone asks these questions and explores the answers to them. If the architect is clearly singled out, this only means that he is primarily responsible for the system and its development.
Why did the topic of “antifragility” seem relevant to me in relation to this subject?

“The uniqueness of antifragility is that it allows us to work with the unknown, to do something in conditions where we do not understand what exactly we are doing – and to succeed”/Nassim N.Taleb/

Therefore, the crisis and a high degree of uncertainty are not an excuse for the lack of architecture, but factors that reinforce its need.

Tags: Add tags

5.1. Organization of data in corporate information systems.

Considering the CIS at the most simplified level, we can say that it contains a corporate computer (computer) network and a specialized application software package (APP) for solving problems in the subject area. In turn, both the PPP and the computer network imply the use of information data on the state and development of systems controlled and managed by them. Historically, CIS consists of separate branched subsystems of individual enterprises, interconnected and often representing a hierarchical system. It is natural to assume that such subsystems have both their own sources and their own places for storing related data. Uniting in single system, there are questions of joint correct use of data geographically located in different places of their storage. Therefore, in order to successfully manage a production association equipped with CIS, it needs a reliable system for collecting, storing and processing data. In other words, you need a unified information infrastructure that satisfies strategic BI (Business Intelligence) projects or an integrated database for storing and using data. The main goal of data integration is to obtain a single and complete picture of the state of corporate business data. Integration itself is a complex process, based on which it is advisable to single out:

Technology,

Products,

Applications.

Methods are approaches to data integration.

Technology- these are processes that implement certain methods of data integration.

Products are commercial solutions that support one or another data integration technology.

Applications- these are ready-made technical solutions supplied by developers in accordance with the wishes of customers - customers.

Depending on the complexity of corporate information systems and the tasks they are designed to solve, the organization of data in them is somewhat different. In particular, in the CIS, designed to ensure the effective management of business processes of both individual branches and the corporation as a whole, it is customary to talk about the presence of corporate databases. In corporate information systems used at the highest levels of management and mostly associated with the processes of operational analysis and decision-making, in the process of planning, designing and forecasting various types of management activities, the terminology of the data warehouse is used. It is appropriate to note that the phrase integrated storage of information belongs to both.

5.2. Enterprise databases and their requirements

Being a system-wide integrated storage of data, the corporate database is designed to provide information for the effective management of all business processes and divisions of the corporation. Data integration involves the creation of a new structure that organically includes data from the databases of individual separate divisions, so such a structure should provide certain requirements:

Simple and user-friendly data entry into the database,

Storing data in a form that does not lead to excessive data growth,

Accessibility to general information of employees of all divisions of the corporation, subject to the obligatory condition of delimitation of access rights,

Fast finding and selection of the required information,

Sorting and filtering necessary data,

Grouping similar data

Intermediate and final calculations over fields,

Transformation and visibility of output data,

scalability,

· Security against accidental failures, permanent data loss and unauthorized access.

In addition, when integrating separate (distributed) databases into a single corporate database, it is important to ensure the ability to work with the database in such a way that the user works with it as if it were not distributed.

The creation of an integrated corporate database is possible by various methods, the main of which are:

· Consolidation,

federalization,

· Spreading.

5.3. Characteristics of integration solutions for corporate databases

Consolidation. Under consolidation usually refers to the addition of data of the same name. A similar term is widely used in the banking sector, where an annual consolidated balance sheet is formed, which allows you to present all the assets and liabilities of the parent bank together with its branches.

With regard to a corporation, when using this method, data is copied and collected from primary databases (DB - Slave) by integrating into a single storage location (DB - Master). As a rule, the server of the central (head) office is chosen as such a storage location (Fig. 5.1).

Fig.5.1. Data Consolidation Method

The data in the DB - Master is used for reporting, analysis, development and decision making, as well as a source of data for other branches of the corporation.

The most common technologies for supporting such solutions during consolidation are the following technologies:

Extraction, transformation and loading - ETL (Extract Transform Load);

· Corporate content management - ECM (Enterprise Content Management).

The advantages of the consolidation method are:

1. Ability to transform(restructuring, reconciliation, cleaning and / or aggregation) of significant amounts of data in the process of their transfer from primary systems to final storage locations due to ETL technology,

2. Ability to manage unstructured data, such as documents, reports and pages thanks to ECM technology solutions.

To work with the consolidated CIS database, special business applications, which allow you to create queries to database data, reports and, based on them, perform data analysis.

The disadvantage of integrating through consolidation is that the consolidated data in the integrated storage location cannot be updated in sync with data updates in the primary systems due to synchronization conflicts.

The presence of a time delay between the moments of updating data in primary systems and in the final storage location.

This delay can range from a few seconds to several hours or even days.

Federalization. Under federalization commonly referred to as a union. A similar term is often used in politics when arranging the borders of a state (for example, Germany, Russia, USA).

The process of data federalization in a corporate database is the creation of a virtual (apparent) picture that combines several primary data files into a single virtual whole (see Figure 5.2). Data federalization itself consists in extracting data from primary systems based on external requirements. The management of the corporate database integrated according to the federal method is carried out by federalization processor.

Fig.2. Data federation method

Turning to the virtual database for data, any business application forms a request to the virtual picture. Based on this request, the federation processor extracts data from the relevant primary systems, integrates them in accordance with the virtual picture, and returns the result to the business application that generated the request. In this case, all the necessary data transformations are carried out when they are extracted from the primary systems.

Support for a federated approach to data integration is provided by the Enterprise information integration (E I I) technology, which means - Corporate Information Integration.

A feature of the federated solution is that to access the primary data, the federation processor uses metadata(knowledge), which includes data on the composition and characteristics of the virtual picture, on the amount of data, semantic relationships between them and ways of accessing them, which help the federative solution to optimize access to primary systems.

The main advantages of the federated approach are:

the ability to access current data without creating additional new base data,

the expediency of application after the acquisition or merger of companies,

indispensable in cases where, for security reasons, there are license restrictions on copying data from primary systems,

use, if necessary, of high autonomy of local divisions of the corporation and the flexibility of centralized control of their activities,

· a high degree of utility for large transnational corporations.

The disadvantages of the approach include:

Decreased performance due to the additional cost of accessing multiple data sources,

federalization is most suitable for extracting small amounts of data,

High quality requirements for primary data.

Spreading. Under spread usually refers to the territorial transfer of multiplied objects. Data propagation refers to the reproduction of primary databases and their movement from one place to another. When implementing this method business applications operate online and move data to destinations based on certain events occurring. For this technical solution, the issue of updating data becomes important, which is possible in synchronous or asynchronous modes. Synchronous mode assumes that updates both in the primary system and in end system occurred during the same physical transaction.

Examples of technologies that support the implementation of the data propagation method are:

Integration of corporate applications EAI - Enterprise Application Integration,

· Replication of corporate data EDR – Enterprise Data Replication.

The generalized structure of the implementation of the data dissemination method has the form of Fig.5.3.

Fig.5.3. Data dissemination method

A distinctive feature of the data distribution method is the guaranteed delivery of data to the destination system with a minimum delay close to real time.

The combination of technology integration (EAI) and replication (EDR) in the method provides multiple advantages, in the form of the following advantages:

· High performance,

Possibility of data restructuring and cleaning,

· Load balancing by creating backups and restoring data.

hybrid approach. The reality of economic activity is such that there are no two identical enterprises, especially two identical corporations. This circumstance leaves its mark on the process of creating and filling CIS. This entirely applies to the methods of integrating data in databases. For this reason, many EIS use the so-called data integration in their data integration applications. hybrid an approach that simultaneously includes several methods of integration. Examples of this approach are technologies that provide a consistent picture of customer information:

Integration of customer data in CDI systems - Customer Data Integration,

· Integration of customer data in CRM – Customer Relations Management modules.

In particular, the CDI implementation approach can be taken in various ways.

The easiest way is to create a consolidated customer database that contains data from primary systems. At the same time, the backlog of information can be regulated by using different modes of consolidation: operational or batch, depending on the frequency of updating this information.

The second way is data federation, when virtual business presentations customer data contained in primary systems. And the metadata file can contain common key elements that can be used to relate customer information.

Thus, general (for example, details) customer data can be consolidated as the most static data. And more dynamic data (such as order details) can be federated.

Moreover, the hybrid approach can be extended using the data dissemination method. For example, a client using the services of an Internet store changes his details during the service. These changes can be sent to the consolidated part of the database, and from there propagated to all primary systems containing store customer data.

Keeping in mind the advantages and disadvantages of each of the methods, it is advisable to be creative in their application and sharing.

For example, data federation is useful when the cost of data consolidation outweighs the business benefits that consolidation provides. In particular, prompt processing of requests and preparation of reports is just such a situation.

The practical application of the data dissemination method is very diverse, both in terms of performance and in terms of the ability to restructure and clean the data.

5.4. The concept and structural solutions of data warehouses

Data store - it is a subject-oriented integrated storage of information that accumulates external and operational data, as well as data from other systems, on the basis of which decision-making and data analysis processes are built.

Unlike databases and databanks, data warehouses are based not on internal, but on external data sources: various Information Systems, electronic archives, public electronic catalogs, directories and collections.

The concept of data warehouses is based on two main ideas:

1. Integration of disparate detailed data (describing specific facts, properties, events, etc.) in a single repository.

2. Separation of datasets and applications used for processing and analysis.

Data storage is organized in cases where it is necessary to obtain:

Integration of current and historical data values,

Consolidation of data from disparate sources,

Creation of a reliable data platform for analytical purposes,

Ensuring data homogeneity across the organization,

Facilitate the implementation of corporate data standards without changing existing ones operating systems,

· Providing a broad historical picture and opportunities for analyzing development trends.

Historically, data warehouses have been built on a one-, two-, and three-tier scheme.

Single-level schemes were originally intended for the simplest architectures, which include functional DSS, with an underdeveloped information infrastructure, when the analysis is carried out using data from operational systems, according to the principle: data - presentation forms.

The advantages of such schemes are:

Fast data transfer from operational systems to a specialized system without intermediate links,

· Minimum costs due to the use of a single platform.

Disadvantages:

Narrow range of issues to be resolved due to a single source of data,

· Poor data quality due to the lack of a cleaning step.

Two-level schemes provide a chain: data - data marts - presentation forms. They are used in corporations with a large number of independent divisions using their own information technologies.

Advantages:

The showcases used are designed to answer a specific set of questions,

· It is possible to optimize the data in the storefronts, which improves performance.

Disadvantages:

Difficulty in ensuring data consistency due to their repeated repetition in storefronts,

Potential difficulty of filling showcases with a large number of data sources,

· In view of the lack of data consolidation at the corporate level, there is no single picture of the business.

The evolution of development has led to the fact that the construction of a full-fledged data warehouse for modern corporate systems has been carried out according to three-tier architecture (see fig.5.4).

On the first level there are various recording systems that are sources of data. Such systems can be enterprise resource planning (ERP) systems, reference (operational) systems, external sources or systems that supply data from news agencies, etc.

On the second level contains a central repository, where data from all sources of the first level are collected, as well as an operational data warehouse, which is designed to perform two functions:

The warehouse is a source of analytical information used for operational management,

· In the operational warehouse, data is prepared for subsequent loading into the central storage. Under the preparation of data means the conduct of checks and transformation of data in connection with the different regulations for the receipt of data from the first level.

The third a level is a collection of domain-specific data marts.

Data marts – these are relatively small functionally oriented drives, the contents of which contribute to the solution of analytical problems of individual divisions of the corporation. In fact, data marts are subsets of data from a warehouse. At the same time, end users have the ability to access detailed warehouse data in case there is not enough data in the data mart, as well as to get a more complete picture of the state of the business.

Fig.5.4. Data warehouse architecture

The main technological operations of such organized data warehouses are:

· extraction data is the process of transferring data from heterogeneous sources to an operational warehouse,

· transformation data is the modification of data based on special rules with their subsequent transfer to the central storage,

· cleaning data is the elimination of duplication of data coming from different sources,

· Update data is the distribution of data updates to the source data of the base tables and derived data hosted in the warehouse.

Advantages:

The filling of showcases is simplified due to the use of a single source of purified data,

· Data marts are synchronized with the corporate business picture, which makes it easy to expand the central repository and add data marts,

· Guaranteed performance.

Disadvantages:

The presence of data redundancy, leading to an increase in requirements for data storage technology,

5. 5. Database management systems and data access technologies in CIS

Database management system(DBMS) is a set of language and software tools designed to create, maintain and share a database by one or many users.

Currently, the most widely used DBMS are built on the basis of a relational data model described by a strict mathematical apparatus. relationship theory.

A feature of DBMS operating in CIS is the fact that they have to manage databases located on media distributed in space.

In order to avoid additional duplication or copying of data in the CIS, the main emphasis is on the principle of remote data processing. Databases in CIS contain data needed by many users. Obtaining simultaneous access of several users to the database is possible when installing on a local computer network DBMS that work with users and with a single database.

The main technological solutions for multi-user work with databases are file/server and client/server technologies. Taking the most acceptable option from these technologies, the client / server in the CIS organizes specialized systems for processing distributed databases. At the same time, distributed databases are managed in such a way that the data is distributed not on a logical, but on physical level and the database itself is treated as a single "superschema". In a distributed database, the functions of the administrator are shared between the federated database administrator and the local database administrators. The administrator of the integrated database monitors the differentiation of access of different users to the database and ensures the integrity and safety of data, as well as data protection from simultaneous correction by several users. Access control is carried out in accordance with the rights granted to individual users in the network operating system.

A characteristic feature of programs created with the help of DBMS for working with remote and distributed corporate databases is the use of an open data access interface - ODBC (Open Data Base Connectivity). All data transfer functions are assigned to the ODBC interface, which is a connecting bridge between the DBMS of the integrated database and the DBMS of client applications. At the same time, the client's DBMS can interact not only with their local databases, but also with data located in the integrated database. The client has the ability to send requests to the DBMS of the integrated database, receive data on them and send their own updated data.

Industry Data Models

The main purpose of models is to facilitate orientation in the data space and help in highlighting the details that are important for business development. In today's business environment, it is absolutely essential to have a clear understanding of the relationships between the various components and a good understanding of the big picture of the organization. Identification of all the details and relationships using models allows the most efficient use of time and tools for organizing the work of the company.

Data models are abstract models that describe how data is represented and accessed. Data models define data elements and relationships between them in a given area. A data model is a navigational tool for both business and IT professionals that uses a specific set of symbols and words to accurately explain a specific class of real information. This improves communication within the organization and thus creates a more flexible and stable application environment.

The data model uniquely defines the meaning of the data, which in this case is structured data (as opposed to unstructured data such as an image, binary file, or text, where the value can be ambiguous).

As a rule, models of a higher level (and more general in content) and a lower level (respectively, more detailed) are distinguished. The upper level of modeling is the so-called conceptual data models(conceptual data models), which give the most general picture of the functioning of an enterprise or organization. The conceptual model includes the main concepts or subject areas that are critical to the functioning of the organization; usually their number does not exceed 12-15. Such a model describes classes of entities important to the organization (business objects), their characteristics (attributes) and associations between pairs of these classes (ie relationships). Since the terminology in business modeling has not yet been completely settled, in various English-language sources, conceptual data models can also be called subject area model (which can be translated as subject area models) or subject enterprise data model (subject corporate data models).

The next hierarchical level is logical data models(logical data models). They may also be referred to as enterprise data models or business models. These models contain data structures, their attributes, and business rules, and represent information used by an enterprise from a business perspective. In such a model, data is organized in the form of entities and relationships between them. The logical model represents the data in a way that is easily understood by business users. IN logical model a data dictionary can be allocated - a list of all entities with their exact definitions, which allows different categories of users to have a common understanding of all input and information output flows of the model. The next, lower level of modeling is already the physical implementation of the logical model using specific software tools and technical platforms.

The logical model contains the detailed enterprise business decision, which usually takes the form of a normalized model. Normalization is the process that ensures that each data element in the model has only one value and is completely and uniquely dependent on the primary key. Data elements are organized into groups according to their unique identification. The business rules that control the data elements must be fully included in the normalized model with a preliminary check of their validity and correctness. For example, a data element such as Customer Name would most likely be split into First Name and Last Name and grouped with other relevant data elements into a Customer entity with a primary key of Customer ID.

The logical data model is independent of application technologies such as database, networking, or reporting tools and their physical implementation. An organization can only have one enterprise data model. Logical models typically include thousands of entities, relationships, and attributes. For example, a data model for a financial institution or a telecommunications company may contain about 3,000 industry concepts.

It is important to distinguish between logical and semantic data model. The logical data model represents the corporate business solution, while the semantic data model represents the applied business solution. The same corporate logical data model can be implemented using different semantic models, i.e. semantic models can be considered as the next level of modeling approaching physical models. In addition, each of these models will represent a separate "slice" of the corporate data model in accordance with the requirements of various applications. For example, in a corporate logical data model, the entity Client will be completely normalized, and in a semantic model for a data mart, it can be represented as a multidimensional structure.

A company can have two ways to create an enterprise logical data model: build it yourself or use a ready-made industry model(industry logical data model). In this case, the differences in terms reflect only different approaches to building the same logical model. In the event that a company independently develops and implements its own logical data model, then such a model, as a rule, is simply called the corporate logical model. If the organization decides to use the finished product of a professional supplier, then we can talk about an industry logical data model. The latter is a ready-made logical data model that reflects the functioning of a particular industry with a high degree of accuracy. An industry logical model is a domain-specific and integrated view of all the information that must be in an enterprise data warehouse to answer both strategic and tactical business questions. Like any other logical data model, the industry model does not depend on applied solutions. It also does not include derived data or other calculations for faster data retrieval. As a rule, most of the logical structures of such a model find a good embodiment in its effective physical implementation. Such models are being developed by many vendors for a wide variety of areas: finance, manufacturing, tourism, healthcare, insurance, etc.

An industry logical data model contains information that is common to an industry and therefore cannot be a complete solution for a company. Most companies have to increase the model by an average of 25% by adding data elements and expanding definitions. The finished models contain only the key data elements, and the rest of the elements must be added to the appropriate business objects during the installation of the model in the company.

Industry logical data models contain a significant number of abstractions. Abstraction refers to the union of similar concepts under common names such as Event or Participant. This adds flexibility to industry models and makes them more unified. Thus, the concept of Event is applicable to all industries.

Business Intelligence expert Steve Hoberman outlines five factors to consider when deciding whether to purchase an industry data model. The first is the time and resources needed to build the model. If an organization needs to achieve results quickly, then the industry model will give an advantage. Using an industry model may not immediately provide a picture of the entire organization, but it can save a significant amount of time. Instead of actual modeling, time will be spent linking existing structures to the industry model, as well as discussing how best to customize it to the needs of the organization (for example, which definitions should be changed and which data elements should be added).

The second factor is the time and money needed to keep the model running. If an enterprise data model is not part of a methodology that keeps it accurate and up-to-date, then the model will become obsolete very quickly. The industry data model can prevent this risk as it is kept up to date by external resources. Of course, changes occurring within the organization must be reflected in the model by the company itself, but industry changes will be reproduced in the model by its supplier.

The third factor is experience in risk assessment and modeling. Creating an enterprise data model requires skilled resources from both the business and IT staff. As a rule, managers know well either the work of the organization as a whole, or the activities of a particular department. Few of them have both broad (company-wide) and deep (unit-wide) knowledge of their business. Most managers usually know only one area well. Therefore, in order to get a corporate-wide picture, significant business resources are required. This also increases the requirements for IT staff. The more business resources required to create and test a model, the more experienced the analysts need to be. They must not only know how to get information from business personnel, but also be able to find common ground in controversial areas and be able to present all this information in an integrated way. The one who creates the model (in many cases, this is the same analyst) must have good modeling skills. Creating corporate logic models requires modeling “for the future” and the ability to convert complex business into literally “squares and lines”.

On the other hand, the industry model allows you to use the experience of third-party specialists. Industry-specific logic models use proven modeling methodologies and teams of experienced professionals to avoid common and costly problems that can arise when developing enterprise data models within an organization.

The fourth factor is the existing application infrastructure and vendor relationships. If an organization already uses many tools from the same vendor and has established relationships with them, then it makes sense to order the industry model from them as well. Such a model will be able to work freely with other products of the same supplier.

The fifth factor is intra-industry exchange of information. If a company needs to share data with other organizations operating in the same field, then an industry model can be very helpful in this situation. Organizations within the same industry use similar structural components and terminology. Nowadays, in most industries, companies are forced to share data to run their business successfully.

The industry models offered by professional vendors are the most effective. The high efficiency of their use is achieved due to a significant level of detail and accuracy of these models. They usually contain many data attributes. In addition, the creators of these models not only have extensive modeling experience, but also are well versed in building models for a particular industry.

Industry data models provide companies with a single, integrated view of their business information. Many companies find it difficult to integrate their data, although it is necessary condition for most corporate projects. According to a study by The Data Warehousing Institute (TDWI), more than 69% of organizations surveyed found integration to be a significant barrier to new application adoption. On the contrary, the implementation of data integration brings a significant income to the company.

The industry data model, in addition to linking with existing systems, provides great benefits for enterprise-wide projects such as enterprise resource planning (ERP), master data management, business intelligence, data quality improvement and employee development.

Thus, industry logical data models are an effective tool for integrating data and obtaining a holistic picture of the business. The use of logical models seems to be a necessary step towards the creation of corporate data warehouses.

Publications

Steve Hoberman. Leveraging the Industry Logical Data Model as Your Enterprise Data Model
Claudia Imhoff. Fast-Tracking Data Warehousing & Business Intelligence Projects via Intelligent Data Modeling

The corporate database is the central link of the corporate information system and allows you to create a single information space corporations. Corporate databases

Share work on social networks

If this work does not suit you, there is a list of similar works at the bottom of the page. You can also use the search button

THEME V CORPORATE DATABASES

V .one. Organization of data in corporate systems. Corporate databases.

V .2. DBMS and structural solutions in corporate systems.

V.3. Internet / Intranet technologies and corporate database access solutions.

V .one. DATA ORGANIZATION IN CORPORATE SYSTEMS. CORPORATE DATABASES

Corporate base data is the central link of the corporate information system and allows you to create a single information space of the corporation. Corporate databases (Figure 1.1).

There are various definitions of databases.

Under the database (DB) understand a set of information logically related in such a way as to constitute a single set of data stored in the storage devices of a computer. This set acts as the initial data of tasks solved in the process of functioning of automated control systems, data processing systems, information and computing systems.

You can briefly formulate the term database as a collection of logically related data intended for sharing.

Under database refers to a collection of data stored together with minimal redundancy such that it can be used optimally for one or more applications.

Purpose of creating databases as a form of data storage building a data system that does not depend on the adopted algorithms (software), the technical means used, the physical location of the data in the computer. The database assumes multi-purpose use (several users, many forms of documents and queries of one user).

Basic database requirements:

Completeness of data presentation. The data in the database should adequately represent all information about the object and should be sufficient for ODS.
Database integrity. The data must be preserved during the processing of their ODS and in any situations that arise in the course of work.
Flexibility of the data structure. The database should allow changing data structures without violating its integrity and completeness when external conditions change.
Realizability. This means that there must be an objective representation of various objects, their properties and relationships.
Availability. It is necessary to provide differentiation of access to data.
redundancy. The database should have minimal redundancy in representing data about any object.

Knowledge is understood a set of facts, patterns and heuristic rules with which you can solve the problem.

Knowledge base (KB)  collection of databases and rules used, received from decision makers. The knowledge base is an element of expert systems.

should be distinguished different ways of presenting data.

Physical Data - This is data stored in the computer's memory.

Logical representation of data corresponds to the user's representation of physical data. The difference between a physical and a corresponding logical representation of data is that the latter reflects some important relationships between physical data.

Under corporate database understand a database that combines in one form or another all the necessary data and knowledge about an automated organization. In corporate information systems, such a concept asintegrated databases, in which the principle of single entry and multiple use of information is implemented.

Rice. 1.1. The structure of the interaction of departments with the information resources of the corporation.

Corporate databases are concentrated (centralized) and distributed.

Concentrated (centralized) database is a database whose data is physically stored in the storage devices of one computer. On fig. 1.2 shows the scheme server application to access databases in various platforms.

Fig.1.2. Diagram of a heterogeneous centralized database

The centralization of information processing made it possible to eliminate such shortcomings of traditional file systems as incoherence, inconsistency and data redundancy. However, as databases grow, and especially when used in geographically dispersed organizations, problems arise. For example, for concentrated databases located in a telecommunications network node, through which various departments of an organization access data, with an increase in the volume of information and the number of transactions, the following difficulties arise:

Large data exchange flow;
High network traffic;
Low reliability;
Low overall performance.

Although it is easier to ensure the security, integrity, and consistency of information during updates in a concentrated database, these problems create certain difficulties. Data decentralization is proposed as a possible solution to these problems. Decentralization achieves:

Higher degree of processing simultaneity due to load sharing;
Improving the use of data in the field when performing remote (remote) queries;
lower costs;
Easy to manage local databases.

The costs of creating a network with workstations (small computers) at its nodes are much lower than the costs of creating a similar system using a mainframe. Figure 1.3 shows a logical diagram of a distributed database.

Fig.1.3. Distributed corporate database.

We give the following definition of a distributed database.

Distributed database - this is a collection of information, files (relations) stored in different nodes of the information network and logically linked in such a way as to constitute a single set of data (the link can be functional or through copies of the same file). Thus, it is a set of databases that are logically interconnected, but physically located on several machines that are part of the same computer network.

The most important requirements for the characteristics of a distributed database are as follows:

Scalability;
Compatibility;
Support for various data models;
portability;
Location transparency;
Autonomy of distributed database nodes (Site Autonomy);
Processing of distributed requests;
Execution of distributed transactions.
Support for a homogeneous security system.

Location transparency allows users to work with databases without knowing anything about their location. The autonomy of the distributed database nodes means that each database can be maintained independently of the others. A distributed query is a query (SQL statement) during which access to objects (tables or views) of different databases occurs. When executing distributed transactions, concurrency control is exercised over all involved databases. Oracle7 uses two-phase information transfer technology to perform distributed transactions.

The databases that make up a distributed database need not be homogeneous (i.e. run by the same DBMS) or run on the same operating system environment and/or on the same type of computers. For example, one database could be an Oracle database on a SUN computer running SUN OS(UNIX), a second database could be run by a DB2 DBMS on an IBM 3090 mainframe running an MVS operating system, and a third database could be run by a SQL/DS DBMS also on IBM mainframe, but with a VM operating system. Only one condition is obligatory - all machines with databases must be accessible over the network they are part of.

The main task of a distributed database – distribution of data over the network and providing access to it. There are the following ways to solve this problem:

Each node stores and uses its own set of data that is available for remote queries. This distribution is divided.
Some data that is frequently used at remote sites may be duplicated. Such a distribution is called partially duplicated.
All data is duplicated in each node. Such a distribution is called fully redundant.
Some files can be split horizontally (a subset of records is selected) or vertically (a subset of attribute fields is selected), while the split subsets are stored in different nodes along with unsplit data. Such distribution is called split (fragmented).

When creating a distributed database at the conceptual level, you have to solve the following tasks:

It is necessary to have a single conceptual scheme for the entire network. This will provide logical data transparency for the user, as a result of which he will be able to form a request to the entire database, being at a separate terminal (it works, as it were, with a centralized database).
A schema is needed to locate data on the network. This will provide transparency in data placement so that the user does not have to specify where to forward the request to get the required data.
It is necessary to solve the problem of heterogeneity of distributed databases. Distributed databases can be homogeneous or heterogeneous in terms of hardware and software. The problem of heterogeneity is relatively easy to solve if the distributed database is heterogeneous in terms of hardware, but homogeneous in terms of software (the same DBMS in the nodes). If different DBMS are used in the nodes of a distributed system, means of converting data structures and languages are needed. This should provide transparency of the transformation in the distributed database nodes.
It is necessary to solve the problem of managing dictionaries. To provide all kinds of transparency in a distributed database, programs that manage numerous dictionaries and reference books are needed.
It is necessary to define methods for executing queries in a distributed database. Methods for executing queries in a distributed database differ from similar methods in centralized databases, since individual parts of queries must be executed at the location of the corresponding data and transfer partial results to other nodes; at the same time, coordination of all processes should be ensured.
It is necessary to solve the problem of parallel execution of queries. In a distributed database, a complex mechanism for managing concurrent processing is needed, which, in particular, must ensure synchronization when information is updated, which guarantees data consistency.
The need for a developed methodology for the distribution and placement of data, including splitting, is one of the main requirements for a distributed database.

One of the actively developing new areas of computer systems architecture, which is a powerful tool for non-numerical information processing, are database machines. Database machines are used to solve non-numerical tasks, such as storing, searching and transforming documents and facts, working with objects. Following the definition of data as digital and graphic information about the objects of the surrounding world, different content is embedded in the concept of data in numerical and non-numerical processing. Numeric processing uses objects such as variables, vectors, matrices, multi-dimensional arrays, constants, and so on, while non-numeric processing uses objects such as files, records, fields, hierarchies, networks, relationships, and so on. non-numeric processing is concerned directly with information about the objects (for example, a particular employee or group of employees), and not the employee file itself. It does not index the employee file to select a particular person; here more interested in the content of the desired entry. Huge volumes of information are usually subjected to non-numerical processing. In various applications, such operations can be performed on this data, for example:

increase the salary of all employees of the company;
calculate the bank interest on the accounts of all customers;
make changes to the list of all goods in stock;
find the required abstract from all texts stored in the library or in the bibliographic information retrieval system;
find the description of the desired contract in a file containing legal documents;
view all files containing descriptions of patents and find a patent (if any) similar to the one proposed again.

To implement the database engine, parallel and associative architectures as an alternative to uniprocessorvon Neumannstructure, allowing you to work with large amounts of information in real time.

Database engines are gaining importance due to the exploration and application of the concepts artificial intelligence such as knowledge representation, expert systems, inference, pattern recognition, etc.

Information storages. Today, many recognize that most companies already operate several databases and, in order to successfully work with information, not just different types of databases are required, but different generations of DBMS. According to statistics, each organization uses an average of 2.5 different DBMS. The need to "isolate" the business of companies, or rather, the people involved in this business, from the technological features of databases, to provide users with a single view of corporate information, regardless of where it is physically stored, has become obvious. This stimulated the emergence of information warehousing technology ( Data Warehousing, DW).

The main goal of DW is creation of a single logical representation of data contained in different types of databases, or, in other words, a single corporate data model.

A new round of DW development became possible thanks to the improvement of information technology in general, in particular the emergence of new types of databases based on parallel query processing, which in turn relied on advances in the field of parallel computers. Were created query builderswith intuitive GUI, which made it easy to build complex database queries. Miscellaneous softwaremiddlewareprovided communicationbetween different types of databases, and finally fell sharply in priceinformation storage devices.

A data bank may be present in the structure of a corporation.

Database - functional and organizational component in automated control systems and information and computing systems, which performs centralized Information Support a group of users or a set of tasks solved in the system.

Database is considered as an information and reference system, the main purpose of which is:

in the accumulation and maintenance in working condition of the totality of information constituting information base the entire automated system or some set of tasks solved in it;
in the issuance of the data required by the task or the user;
in providing collective access to stored information;
in ensuring the necessary management of the use of information contained in the infobase.

Thus, a modern data bank is a complex software and hardware complex, which includes technical, system and network tools, databases and DBMS, information retrieval systems for various purposes.

V .2. DBMS AND STRUCTURAL SOLUTIONS IN CORPORATE SYSTEMS

Database and knowledge management systems

An important component of modern information systems are database management systems (DBMS).

DBMS - a set of software and language tools designed to create, maintain and use databases.

The database management system provides data processing systems with access to databases. As already noted, an important role of DBMS is acquired in the creation of corporate information systems and a particularly important role in the creation of information systems using distributed information resources based on modern network computer technologies.

The main feature of modern DBMS is that modern DBMS support such technologies as:

client/server technology.
Support for database languages. Thisschema definition language DB (SDL - Schema Definition Language),data manipulation language (DML - Data Manipulation Language), integrated languages SQL (Structured Queue Language), QDB (Query - By - Example) and QMF (Query Management Facility ) is an advanced peripheral tool for query specification and report generation for DB 2 etc.;
Direct management of data in external memory.
Memory buffer management.
Transaction management. OLTP technology (On-Line Transaction Processing), OLAP - technology (On-Line Analysis Processing) for DW.
Ensure data protection and integrity. The use of the system is allowed only to users who have the right to access the data. When users perform operations on data, the consistency of the stored data (integrity) is maintained. This is important in corporate multi-user information systems.
Journalization.

Modern DBMS must meet the database requirements listed above. In addition, they must comply with the following principles:

Data independence.
Versatility. The DBMS must have powerful support for the conceptual data model to display custom logical views.
Compatibility. The DBMS must remain operational with the development of software and hardware.
Data redundancy. Unlike file systems, a database must be a single set of integrated data.
Data protection. The DBMS must provide protection against unauthorized access.
Data integrity. The DBMS must prevent users from tampering with the database.
Managing concurrent work. The DBMS must protect the database from inconsistencies in the shared access mode. To ensure a consistent state of the database, all user requests (transactions) must be performed in a certain order.
The DBMS must be universal. It should support different data models on a single logical and physical basis.
The DBMS should support both centralized and distributed databases and thus become an important link in computer networks.

Considering a DBMS as a class of software products focused on maintaining databases in automated systems, we can distinguish two of the most significant features that determine the types of DBMS. According to them, the DBMS can be considered from two points of view:

their capabilities in relation to distributed (corporate) databases;
their relationship to the type of data model implemented in the DBMS.

In relation to corporate (distributed) databases, the following types of DBMS can be conventionally distinguished:

DBMS "desktop". These products are primarily focused on working with personal data (desktop data). They have command sets for sharing common databases, but they are small in size (small office type). First of all, it is a DBMS like Access, dBASE, Paradox, ExPro. Why Access, dBASE, Paradox, ExPro have poor access to corporate data. The fact is that there is no easy way to overcome the barrier between personal and corporate data. And the point is not even that the mechanism of a personal data DBMS (or a small office) is focused on accessing data through many gateways, gateway products, etc. The problem is that these mechanisms typically involve full file transfers and a lack of extensive index support, resulting in queues to the server that are practically a halt in large systems.
Specialized high-performance multi-user DBMS. Such DBMSs are characterized by the presence of a multi-user system kernel, a data manipulation language, and the following functions that are typical for developed multi-user DBMSs:
organizing a buffer pool;
the presence of a system for processing transaction queues;
the presence of mechanisms for multi-user data blocking;
transaction logging;
availability of access control mechanisms.

This is a DBMS Oracle type, DВ2, SQL/Server, Informix, Sybase, ADABAS, Titanium and others provide a wide service for processing corporate databases.

When working with databases, the mechanism of transactions is used.

transaction is a logical unit of work.

transaction is a sequence of data manipulation statements that is executedas one(all or nothing) and translating databasefrom one integral state to another integral state.

A transaction has four important properties, known as ASID properties:

(A) Atomicity . The transaction is executed as an atomic operation - either the entire transaction is executed, or the entire transaction is not executed.
(C) Consistency. A transaction moves a database from one consistent (consistent) state to another consistent (consistent) state. Within a transaction, database consistency can be broken.
(I) Isolation . Transactions of different users should not interfere with each other (for example, as if they were performed strictly in turn).
(D) Durability. If the transaction is completed, then the results of its work should be saved in the database, even if the system crashes at the next moment.

The transaction usually starts automatically from the moment the user joins the DBMS and continues until one of the following events occurs:

A COMMIT WORK command has been issued (to commit a transaction).
ROLLBACK WORK command issued.
The user has disconnected from the DBMS.
There was a failure of the system.

For the user, she wears usually atomic character. In fact, this is a complex mechanism of interaction between the user (application) and the database. Enterprise systems software uses a real-time transaction processing engine (Online Transaction Processing Systems, OLTP), in particular accounting programs, software for receiving and processing client applications, financial applications, produce a lot of information. These systems are designed (and appropriately optimized) for processing large amounts of data, complex transactions, and intensive read/write operations.

Unfortunately, the information placed in the databases of OLTP systems is not very suitable for use by ordinary users (due to the high degree of table normalization, specific data presentation formats, and other factors). Therefore, data from different information pipelines is sent (in the sense of being copied) to storage warehouse, sorting and subsequent delivery to the consumer. In information technology, the role of warehouses is played byinformation storages.

Delivery of information to the end user - systems of analytical data processing in real time are engaged (On-line Analytical Processing, OLAP), which provide extremely easy access to data through convenient tools for generating queries and analyzing results. In OLAP systems, the value of an information product is increased through the use of various methods of analysis and statistical processing. In addition, these systems are optimized in terms of data extraction speed, collection of generalized information and are focused on ordinary users (they have an intuitive interface). If OLTP system gives answers to simple questions like "what was the level of sales of product N in region M in January 199x?", then OLAP systems are ready for more complex user requests, for example: "Give an analysis of sales of product N for all regions according to the plan for the second quarter compared to the previous two years."

Client/server architecture

In modern systems distributed information processingtechnology takes center stage client/server. In system client-server architecturesdata processing is divided between a client computer and a server computer, the communication between which takes place over a network. This separation of data processing processes is based on the grouping of functions. Typically, a database server computer is allocated to perform database operations, while a client computer performs application programs. Figure 2.1 shows a simple client-server architecture system that includes a computer acting as a server and another computer acting as its client. Each machine performs different functions and has its own resources.

		Database
Server computer
Net
IBM compatible PC	IBM compatible PC	IBM compatible PC
		Applications

Rice. 2.1. Client-server architecture system

The main function of the client computer is to run the application (user interface and presentation logic) and communicate with the server when required by the application.

Server - This is an object (computer) that provides services to other objects at their request.

As the term implies, the main function of the server computer is to serve the needs of the client. The term "Server" is used to refer to two different groups of functions: a file server and a database server (hereinafter, these terms mean, depending on the context, either the software that implements these groups of functions, or computers with this software). File servers are not designed to perform database operations, their main function is to share files among several users, i.e. providing simultaneous access of many users to files on a computer - a file server. An example of a file server is Novell's NetWare operating system. The database server can be installed and run on a file server computer. Oracle DBMS in the form of NLM (Network Loadable Module) runs in a NetWare environment on a file server.

The local network server must have resources that correspond to its functional purpose and the needs of the network. Note that due to the orientation towards the open systems approach, it is more correct to speak of logical servers (meaning a set of resources and software tools that provide services over these resources), which are not necessarily located on different computers. A feature of a logical server in an open system is that if, for reasons of efficiency, it is expedient to move the server to a separate computer, then this can be done without the need for any modification, both of itself and of the application programs using it.

One of the important server requirements is that the operating system in which the database server is hosted must be multitasking (and preferably, but not necessarily, multiuser). For example, the Oracle DBMS installed on a personal computer with an MS-DOS (or PC-DOS) operating system that does not meet the requirement for multitasking cannot be used as a database server. And the same Oracle DBMS installed on a computer with a multitasking (though not multiuser) OS / 2 operating system can be a database server. Many varieties of UNIX, MVS, VM, and some other operating systems are both multitasking and multiuser.

Distributed Computing

The term "distributed computing" is often used to refer to two different, albeit complementary, concepts:

Distributed database;
Distributed data processing.

The application of these concepts makes it possible to organize access to information stored on several machines for end users using various means.

There are many types of servers:

Database server;
Print server;
Remote access server;
Fax server;
Web server, etc.

At the core of Client/Server technology such basic technologies, how:

Technologies of operating systems, the concept of interaction of open systems, creation of object-oriented environments for the functioning of programs;
Telecommunication technologies;
Network technologies;
Graphics technologies user interface ( GUI);
Etc.

Advantages of client-server technology:

The client/server technology allows computing on heterogeneous computing environments. Platform Independence: Access to heterogeneous network environments that include different types of computers with different operating systems.
Independence from data sources: access to information from heterogeneous databases. Examples of such systems are DB2, SQL/DS, Oracle, Sybase.
Load balance between client and server.
Performing calculations where it happens most efficiently;
Provides efficient scaling capability;
Cross platform computing. Cross-platform computing is defined simply as the implementation of technologies in heterogeneous computing environments. The following options should be provided here:
The application must run on multiple platforms;
On all platforms, it should have the same interface and logic of work;
The application must integrate with the native operating environment;
It should behave the same on all platforms;
It should have simple and consistent support.

 Distributed Computing. Distributed computing involves the distribution of work among several computers (although distributed computing is a broader concept).

 Downscaling. Downscaling is the transfer of mainframe applications to small computer platforms.

Reduce infrastructure and hardware costs. Cost-Effective: The availability of low-cost computing hardware and the growing prevalence of local area networks make client-server technology more cost-effective than other data processing technologies. Equipment can be upgraded as needed.

 Reducing the overall application execution time;

 Reduced client memory usage;

 Reducing network traffic.

Ability to work with multimedia: To date, a lot of programs for working with multimedia for PCs have been created. There are either no such programs for terminal-host configuration, or they are very expensive.
The ability to use more computing resources for database operations: since applications run on client computers, additional (compared to the terminal-host configuration) resources are freed up on the server computer for database operations, such as CPU and operational resources. memory.
Increased programmer productivity: Programmer productivity is increased by using tools such as SQL*Forms and CASE to develop applications faster than programming languages such as C, PL1, or COBOL.
Increasing end user productivity: Nowadays, many end users have embraced systems such as Lotus, Paradox, Word Perfect, Harvard Graphics, etc.

The back-end interface is defined and fixed. Therefore, it is possible to create new client parts existing system(an example of interoperability at the system level).

Rice. 2.2. An illustration of client access to a server share.

How to implement client-server technology

The installation of a system based on client-server technology and capable of distributed data processing is discussed below. The following computer hardware and software is required:

database server computer;
client computers;
communication network;
network software;
application software.

SQL language . High level query language - SQL (Structured Query Language ) is used to implement queries to databases, such as NMD, NDL and PJD, and has been adopted as a standard. Language SQL was originally adopted as the data language of the firm's software products IBM and YMD of a relational DBMS SYSTEM R by IBM . An important feature of the language SQL lies in the fact that the same language is represented through two different interfaces, namely: through the interactive interface and through the interface application programming(dynamic SQL). Dynamic SQL consists of many built-in language features SQL , provided specifically for constructing interactive applications, where an interactive application is a program that is written to support access to the database by the end user running on the interactive terminal. Language SQL provides the functions of defining, manipulating and managing database data and is transparent to the user from the point of view of the implemented DBMS.

Rice. 2.3. Scheme for executing user requests to distributed databases.

The internal structure of databases is determined by the data models used. The conceptual model has more abstraction capabilities and richer semantics than external models. External models are often called syntactic or operational models, referring to the syntactic nature of the management and application as a means of user interaction with the database. There are various levels of abstraction in information modeling, from the level of the conceptual model to the level physical model data that affects the architecture of the DBMS.

The data model consists of three components:

A data structure to represent from the user's perspective on the database.
Valid operations to be performed on the data structure. It is necessary to be able to work with this structure using various DDL and NML operations. A rich structure is worthless if you can't manipulate its content.
Constraints for integrity control. The data model must be provided with means to preserve its integrity and protect it. As an example, consider the following two constraints:
Each subtree must have a source node. Hierarchical databases cannot store child nodes without a parent node.
In relation to a relational database, there cannot be identical tuples. For a file, this requirement requires that all records be unique.

One of the most important characteristics of the DBMS is the ability to link objects.

There are the following types of links between objects:

One-to-One (1:1). One object of one set can be associated with one object of another set.
One-to-Many (1:M). One object of one set can be related to many objects of another set.
Many-to-Many (M:N). One object of one set can be associated with many objects of another set, but at the same time, one object of another set can be associated with many objects of the first set.
branched . One object of one set can be associated with objects of many sets.
Recursive . One object of a given set can be associated with an object of the same set.

There are the following main data models:

Relational data model.
Hierarchical data model.
Incomplete network data model.
CODASYL data model.
Extended network data model.

V.3. INTERNET / INTRANET TECHNOLOGIES AND CORPORATE DATABASE ACCESS SOLUTIONS

The main problem of systems based on the "client-server" architecture is that, in accordance with the concept of open systems, they are required to be mobile in the widest possible class of open systems hardware and software solutions. Even if we restrict ourselves to UNIX-based local area networks, different networks use different equipment and communication protocols. Attempting to create systems that support all possible protocols leads to their overload with network details at the expense of functionality.

An even more complex aspect of this problem is related to the possibility of using different representations of data in different nodes of a heterogeneous local network. Different computers may have different addressing, representation of numbers, character encoding, etc. This is especially important for high-level servers: telecommunications, computing, databases.

General solution The problem of mobility of systems based on the "client-server" architecture is the reliance on software packages that implement remote procedure call protocols (RPC - Remote Procedure Call). Using these tools, calling a service at the remote host looks like a normal procedure call. The RPC tools, which, of course, contain all the information about the specifics of the local network equipment and network protocols, translates the call into a sequence of network interactions. Thus, the specifics of the network environment and protocols are hidden from the application programmer.

When a remote procedure is called, RPC programs convert client data formats to intermediate machine-independent formats and then convert to server data formats. When passing response parameters, similar transformations are performed.

Other related works that may interest you.vshm>
6914.		Database concept	11.56KB
	The database is a set of independent materials presented in an objective form of articles of calculation of normative acts of court decisions and other similar materials systematized in such a way that these materials can be found and processed using an electronic computer Civil Code of the Russian Federation Art. A database organized in accordance with certain rules and maintained in the computer's memory, a set of data characterizing the current state of some ...
8064.		Distributed databases	43.66KB
	Distributed databases distributed database RDB data is understood as a set of logically interconnected shared data that are physically distributed over different nodes of a computer network. Data access should not depend on the presence or absence of data replicas. The system should automatically determine the methods for performing a data aggregation connection, a network link capable of handling the volume transmitted information and a node having sufficient processing power to join the tables. The RDBMS must be capable of...
20319.		DATABASES AND THEIR PROTECTION	102.86KB
	Operational network bases data appeared in the mid-1960s. Operations on operational databases were processed interactively using terminals. The simple index-sequential record organization quickly evolved to a more powerful set-oriented record model. Charles Bachmann received the Turing Prize for leading the work of the Data Base Task Group (DBTG), which developed a standard language for describing data and manipulating data.
5031.		Database Development Library	11.72MB
	Database design technology. Defining relationships between entities and creating a data model. The main ideas of modern information technology are based on the concept that data should be organized into databases in order to adequately reflect the changing real world and meet the information needs of users. These databases are created and operated under the control of special software systems called DBMS database management systems.
13815.		HIERARCHICAL DATABASE MODEL	81.62KB
	The main ideas of modern information technology are based on the concept of databases, according to which the basis of information technology is data organized in databases that adequately reflect the state of a particular subject area and provide the user with relevant information in this subject area. It must be acknowledged that data is...
14095.		Library database development	11.72MB
	The increase in the volume and structural complexity of stored data, the expansion of the circle of users of information systems have led to the widespread use of the most convenient and relatively easy to understand relational (tabular) DBMS.
5061.		Creation of a polyclinic database	2.4MB
	The development of computer technology and information technology has provided opportunities for the creation and widespread use of automated information systems (AIS) for various purposes. Information systems for managing economic and technical facilities are being developed and implemented
13542.		Databases of geological information	20.73KB
	Recently, the introduction of computer technologies and, in particular, databases, into the scientific sphere has been taking place at a rapid pace. This process does not bypass geology either, since it is in the natural sciences that there is a need to store and process large amounts of information.
9100.		Database. Basic concepts	26.28KB
	A database is a collection of information about specific objects of the real world in any subject area, economics, management, chemistry, etc. The purpose of an information system is not just to store data about objects, but also to manipulate these data, taking into account the relationships between objects. Each object is characterized by some set of data properties, which are called attributes in the database.
5240.		Creation of the database "Dean's office of the university"	1.57MB
	A database (DB) is a collection of interrelated data stored together on external storage media of a computer with such an organization and minimal redundancy that allows their use in an optimal way for one or more applications.

The purpose of the lecture

After studying the material of this lecture, you will know:

what's happened enterprise data model ;
how to convert enterprise data model into the data warehouse model;
essential elements enterprise data model ;
presentation layers of the corporate data model ;
algorithm for converting an enterprise data model into a multidimensional data warehouse model ;

and learn:

develop data warehouse models based on enterprise data model organizations;
develop a star schema using CASE tools;
partition tables multidimensional model using CASE tools.

Enterprise data model

Introduction

The core of any data warehouse is its data model. Without a data model, it will be very difficult to organize data in a data warehouse. Therefore, DW developers must spend time and effort developing such a model. The development of the HD model falls on the shoulders of the CD designer.

Compared with the design of OLTP systems, the methodology for designing a data warehouse has a number of distinctive features related to the orientation of storage data structures towards solving problems of analysis and information support for the decision-making process. The data model of the data warehouse should provide an effective solution of these problems.

The starting point in the design of a data warehouse can be the so-called enterprise data model(corporate data model or enterprise data model, EDM), which is created in the process of designing an organization's OLTP systems. When designing enterprise data model usually an attempt is made to create, on the basis of business operations, such a data structure that would collect and synthesize all the information needs of the organization.

In this way, enterprise data model contains the necessary information to build a DW model. Therefore, at the first stage, if such a model exists in the organization, a data warehouse designer can start designing a data warehouse by solving a transformation problem enterprise data model in model HD.

Enterprise data model

How to solve the conversion problem enterprise data model in the HD model? To solve this problem, you need to have this model, i.e. enterprise data model should be built and documented. And you need to understand what from this model and how should be transformed into a HD model.

Let us clarify the concept enterprise data model. Under corporate data model understand a multi-level, structured description of the subject areas of the organization, data structures of subject areas, business processes and business procedures, data flows adopted in the organization, state diagrams, data-process matrices and other model representations that are used in the activities of the organization. Thus, in a broad sense, enterprise data model is a set of models of various levels that characterize (model at some abstract level) the activities of the organization, i.e. content corporate model directly depends on what model structures were included in it in a given organization.

Main elements enterprise data model are:

description of the subject areas of the organization (definition of areas of activity);
relationships between the subject areas defined above;
information data model (ERD-model or "entity-relationship" model);
for each subject area description:
- entity keys;
- entity attributes;
- subtypes and supertypes;
- relationships between entities;
- attribute groupings;
- relationships between subject areas;
functional model or business process model;
data flow diagrams;
state diagrams;
other models.

In this way, enterprise data model contains entities, attributes, and relationships that represent the information needs of the organization. On fig. 16.1 shows the main elements enterprise data model.

Presentation layers of the enterprise data model

The enterprise data model is subdivided according to subject areas, which represent groups of entities related to supporting specific business needs. Some subject areas may cover specific business functions such as contract management, while others may group entities that describe products or services.

Each logical model must correspond to an existing subject area enterprise data model. If the logical model does not meet this requirement, a model that defines the subject area must be added to it.

An enterprise data model usually has several layers of presentation. Actually high level(high level) enterprise data model is a description of the main subject areas of the organization and their relationships at the level of entities. On fig. 16.2 is a fragment enterprise data model top level.

Rice. 16.2.

The diagram shown in the figure shows four subject areas: "Customer" ( customer), "Check" ( account), "Order" ( order) and "Product" ( Product). Typically, at the top level of the model view, only direct connections between subject areas, which, for example, fix the following fact: the buyer pays the invoice for the order of goods. Detailed information and indirect relationships at this level corporate model are not given.

On the next middle level(mid level) enterprise data model shows detailed information about objects of subject areas, i.e. keys and entity attributes, their relationships, subtypes and supertypes, etc. For each domain of the top-level model, there is one middle-level model. On fig. 16.3 depicts the middle level of presentation corporate model for a fragment of the subject area "Order".

From fig. 16.3 it can be seen that the subject area "Order" ( order) includes several entities, defined through their attributes, and the relationships between them. The presented model allows you to answer questions such as the date of the order, who made the order, who sent the order, who receives the order, and a number of others. From the above diagram, it can be seen that in this organization there are two types of orders - orders for an advertising campaign ( commercial) and retail orders ( Retail).

notice, that enterprise data model can represent various aspects of the organization's activities and with varying degrees of detail and completeness. If corporate model represents all aspects of the organization, it is also called organization data model(enterprise data model).

From the point of view of designing a data warehouse, an important factor in deciding to create a data warehouse model from enterprise data model is the state completeness enterprise data model.

The corporate data model of an organization has an evolutionary characteristic, i.e. it is constantly evolving and improving. Some subject areas enterprise data model may be well developed, for some the work may not yet have begun. If a fragment of the subject area is not worked out in enterprise data model, then there is no way to use this model as a starting point for designing a data warehouse.

Degree of completion corporate model can be leveled in the design of HD as follows. Since the development process of a data warehouse is usually divided into a sequence of stages in time, the process of its design can be synchronized with completion process development of individual fragments enterprise data model organizations.

At the lowest presentation layer of the corporate data model displays information about the physical characteristics of the database objects corresponding to logical data model middle presentation layer of the enterprise data model.

Zaitsev S.L., Ph.D.

Repeating groups

Repeating groups are attributes for which a single entity instance can have more than one value. For example, a person may have more than one skill. If, in terms of business requirements, we need to know the skill level for everyone, and each person can only have two skills, we can create the entity shown in Fig. 1.6. Here is the entity A PERSON with two attributes to store skills and skill levels for each.

Rice. 1.6. This example uses repeating groups.

The problem with repeating groups is that we can't know exactly how many skills a person might have. In real life, some people have one skill, some have several, and some have none yet. Figure 1.7 shows the model reduced to first normal form. Notice the added Skill ID , which uniquely defines each SKILL.

Rice. 1.7. Model reduced to first normal form.

One fact in one place

If the same attribute is present in more than one entity and is not a foreign key, then that attribute is considered redundant. The logical model should not contain redundant data.

Redundancy requires additional space, but while memory efficiency is important, the real problem lies elsewhere. Guaranteed synchronization of redundant data comes with an overhead, and you always run the risk of conflicting values.

In the previous example SKILL depends on Person ID and from Skill ID. This means that you will not have SKILL until it appears A PERSON, having this skill. It also makes it harder to change the Skill Name. You need to find each Skill Name entry and change it for each Person who owns that skill.

Figure 1.8 shows the model in second normal form. Note that the entity has been added SKILL, and attribute TITLE skill transferred to this entity. The skill level remained, respectively, at the intersection PERSONS and SKILLS.

Rice. 1.8. In second normal form, the repeating group is moved to another entity. This provides the flexibility to add as many Skills as needed and change the Skill Name or Skill Description in one place.

Each attribute depends on a key

Each attribute of an entity must depend on the primary key of that entity. In the previous example School name And Geographic area present in the table A PERSON but do not describe a person. To achieve the third normal form, you need to move the attributes to the entity, where they will depend on the key. Figure 1.9. shows the model in third normal form.

Rice. 1.9. In third normal form School name And Geographic region moved to the entity, where their values depend on the key.

Many-to-Many Relationships

Relations many-to-many reflect the reality of the environment. Note that in Figure 1.9 there is a many-to-many relationship between PERSON And SCHOOL. The ratio accurately reflects the fact that A PERSON can study in many SCHOOLS and in SCHOOL can learn a lot PERSON. To achieve fourth normal form, an associative entity is created that eliminates the monogie-to-many relationship by generating a separate entry for each unique combination of school and person. Figure 1.10 shows the model in fourth normal form.

Rice. 1.10. In fourth normal form, the monogie-to-many relation between PERSON And SCHOOL resolved by introducing an associative entity, in which a separate entry is assigned for each unique combination SCHOOLS And PERSONS.

Formal definitions of normal forms

The following definitions of normal forms may seem intimidating. Think of them simply as formulas for achieving normalization. Normal forms are based on relational algebra and can be interpreted as mathematical transformations. Although this book does not cover a detailed discussion of normal forms, modellers are encouraged to delve deeper into the subject.

In a given relation R, attribute Y is functionally dependent on attribute X. Symbolically, RX -> RY (read as "RX functionally defines RY") if and only if each X value in R is associated with exactly one Y value in R (at any given time). Attributes X and Y can be compound (Date K.J. Introduction to Database Systems. 6th edition. Ed. Williams: 1999, 848 pp.).

A relation R is in first normal form (1NF) if and only if all its domains contain only atomic values (Date, ibid.).

A relation R is in second normal form (2NF) if and only if it is in 1NF and every non-key attribute is completely dependent on the primary key (Date, ibid.).

A relation R is in third normal form (3NF) if and only if it is in 2NF and every non-key attribute is not transitively dependent on the primary key (Date, ibid.).

The relation R conforms to Boyce-Codd normal form (BCNF) if and only if each determinant is a candidate for use as a key.

NOTE Below is a brief explanation of some of the abbreviations used in Date's definitions.

MVD (multi-valued dependency) - multi-valued dependency. Used only for entities with three or more attributes. In a multivalued dependency, the value of an attribute depends on only part of the primary key.

FD (functional dependency) - functional dependency. In a functional dependency, the value of an attribute depends on the value of another attribute that is not part of the primary key.

JD (join dependency) - join dependency. In a join dependency, the primary key of the parent entity is traceable to at least the third level of descendants while retaining the ability to be used in the original key join.

A relation is in fourth normal form (4NF) if and only if there is an MVD in R, such as A®®B. In this case, all attributes of R are functionally dependent on A. In other words, in R there are only dependencies (FD or MVD) of the form K®X (ie, the functional dependence of the attribute X on the candidate for use as a key K). Accordingly, R meets the requirements of 4NF if it complies with BCNF and all MVDs are in fact FDs (Date, ibid.).

For the fifth normal form, the relation R satisfies the union relation (JD)*(X, Y, …, Z) if and only if R is equivalent to its projections onto X, Y,..., Z, where X, Y,. .., Z subsets of the set of attributes R.

There are many other normal forms for complex data types and specific situations that are beyond the scope of our discussion. Every enthusiast of model development would like to explore other normal forms.

Business Normal Forms

In his book Clive Finklestein (Finklestein Cl. An Introduction to Information Engineering: From Strategic Planning to Information Systems. Reading, Massachusetts: Addison-Wesley, 1989) took a different approach to normalization. It defines business normal forms in terms of reductions to those forms. Many modellers find this approach to be more intuitive and pragmatic.

First Business Normal Form (1BNF) maps repeating groups to another entity. This entity gets its own name and primary (composite) key attributes from the original entity and its repeating group.

Second Business Normal Form (2BNF) maps attributes that partially depend on a primary key to another entity. The primary (composite) key of this entity is the primary key of the entity it originally resided in, along with additional keys on which the attribute is entirely dependent.

The third business normal form (3BNF) maps attributes that do not depend on a primary key to another entity, where they are completely dependent on that entity's primary key.

Fourth Business Normal Form (4BNF) maps attributes that depend on the value of the primary key or are optional to a secondary entity, where they depend entirely on the value of the primary key, or where they must (mandatory) be present in that entity.

Fifth Business Normal Form (5BNF) appears as a structural entity if there is a recursive or other dependency between instances of a secondary entity, or if a recursive dependency exists between instances of its primary entity.

Completed logical data model

The completed logical model must satisfy the requirements of the third business normal form and include all entities, attributes, and relationships necessary to support the data requirements and business rules associated with the data.

All entities must have names that describe the content and a clear, concise, complete description or definition. In one of the following publications, an initial set of recommendations for the correct formation of names and descriptions of entities will be considered.

Relationships should include a verb construction that describes the relationship between entities, along with characteristics such as plurality, the need for existence, or the possibility of non-existence of the relationship.

NOTE Plurality relationships describes the maximum number of secondary entity instances that can be associated with an instance of the original entity.The need for existence orpossibility of absence relationship is used to define the minimum number of instances of a secondary entity that can be associated with an instance of the original entity.

Physical data model

After creating a complete and adequate logical model, you are ready to make a decision on the choice of implementation platform. The choice of platform depends on the requirements for data use and the strategic principles of the organization's architecture. Platform selection is a complex issue that is beyond the scope of this book.

In ERwin, the physical model is a graphical representation of the actual database. The physical database will consist of tables, columns, and relationships. The physical model depends on the platform chosen for implementation and the data usage requirements. The physical model for IMS will be very different from the same model for Sybase. The physical model for OLAP reports will look different than the model for OLTP (Online Transaction Processing).

The data modeler and the database administrator (DBA) use the logical model, usage requirements, and corporate architecture strategic principles to develop the physical data model. You can denormalize the physics model to improve performance, and create views to support usage requirements. The following sections detail the process of denormalization and view creation.

This section provides an overview of the process of building a physical model, collecting requirements for using data, and defining the components of a physical model and reverse engineering. These issues will be covered in more detail in future publications.

Collection of data usage requirements

Typically, you collect data usage requirements early on during interviews and work sessions. At the same time, the requirements should define the use of data by the user as fully as possible. Superficial attitude and gaps in the physical model can lead to unplanned costs and delay the project. Usage requirements include:

Access and performance requirements

Volumetric characteristics (an estimate of the amount of data to be stored), which allow the administrator to represent the physical volume of the database

Estimating the number of users who need concurrent access to data to help you design your database for an acceptable level of performance

Summary, summary, and other calculated or derived data that may be considered candidates for storage in durable data structures

Requirements for generating reports and standard queries to help the database administrator build indexes

Views (permanent or virtual) that will assist the user in performing data merging or filtering operations.

In addition to the chairperson, secretary, and users, the usage requirements session should include the modeler, database administrator, and database architect. User requirements for historical data should be discussed. The length of time that data is stored has a significant impact on the size of the database. Often, older data is stored in aggregate form, and atomic data is archived or deleted.

Users should bring sample queries and reports with them to the session. Reports must be strictly defined and must include the atomic values used for any summary and summary fields.

Components of the physical data model

The components of the physical data model are tables, columns, and relationships. Entities in the logical model are likely to become tables in the physical model. Boolean attributes will become columns. Logical relationships will become constraints on the integrity of relationships. Some logical relationships cannot be implemented in a physical database.

reverse engineering

When the logical model is not available, it becomes necessary to recreate the model from the existing database. At ERwin, this process is called reverse engineering. Reverse engineering can be done in several ways. The modeler can explore the data structures in the database and recreate the tables in a visual modeling environment. You can import a data definition language (DDL) into a tool that supports reverse engineering (eg Erwin). Advanced tools such as ERwin include functions that provide ODBC communication with an existing database to create a model by directly reading data structures. Reverse engineering using ERwin will be discussed in detail in a future publication.

Use of corporate functional boundaries

When building a logical model, it is important for the modeler to ensure that the new model matches the enterprise model. Using corporate functional boundaries means modeling data in terms used within a corporation. The way data is used in a corporation is changing faster than the data itself. In each logical model, the data must be represented holistically, regardless of the business domain it supports. Entities, attributes, and relationships should define business rules at the corporate level.

NOTE Some of my colleagues refer to these corporate functional boundaries as real-world modeling. Real-world modeling encourages the modeler to view information in terms of its real-life relationships and relationships.

The use of corporate functional boundaries for a properly constructed data model provides a framework to support the information needs of any number of processes and applications, enabling a corporation to more effectively exploit one of its most valuable assets, information.

What is an enterprise data model?

Enterprise Data Model (EDM) contains entities, attributes, and relationships that represent the information needs of a corporation. EDM is usually subdivided into subject areas, which represent groups of entities related to supporting specific business needs. Some subject areas may cover specific business functions such as contract management, while others may group entities that describe products or services.

Each logical model must correspond to an existing enterprise data model domain. If the logical model does not meet this requirement, a model that defines the subject area must be added to it. This comparison ensures that the corporate model is improved or adjusted and all logical modeling efforts are coordinated within the corporation.

EDM also includes specific entities that define the scope of values for key attributes. These entities have no parents and are defined as independent. Independent entities are often used to maintain the integrity of relationships. These entities are identified by several different names, such as code tables, link tables, type tables, or classification tables. We will use the term "corporate business object". An enterprise business object is an entity that contains a set of attribute values that are independent of any other entity. Enterprise business objects within a corporation should be used consistently.

Building an Enterprise Data Model by Scaling

There are organizations where the corporate model from start to finish was built as a result of a single concerted effort. On the other hand, most organizations build fairly complete enterprise models by building up.

Growth means building something up, layer by layer, just like an oyster grows a pearl. Each data model created provides input to the formation of the EDM. Building an EDM in this way requires additional modeling steps to add new data structures and domains or extend existing data structures. This makes it possible to build an enterprise data model by building up, iteratively adding levels of detail and refinement.

The concept of modeling methodology

There are several methodologies for visual data modeling. ERwin supports two:

IDEF1X (Integration Definition for Information Modeling - integrated description of information models).

IE (Information Engineering - information engineering).

IDEF1X is a good methodology and its notation is widely used

Integrated description of information models

IDEF1X is a highly structured data modeling methodology that extends the IDEF1 methodology adopted as a FIPS (Federal Information Processing Standards) standard. IDEF1X uses a highly structured set of modeling construct types and results in a data model that requires an understanding of the physical nature of the data before such information can be made available.

The rigid structure of IDEF1X forces the modeler to assign characteristics to entities that may not correspond to the realities of the world around them. For example, IDEF1X requires all entity subtypes to be exclusive. This leads to the fact that a person cannot be both a client and an employee. While real practice tells us otherwise.

Information engineering

Clive Finklestein is often referred to as the father of information engineering, although James Martin shared similar concepts with him (Martin, James. Managing the Database Environment. Upper Saddle River, New Jersey: Prentice Hall, 1983.). Information engineering uses a business driven approach to manage information and uses a different notation to represent business rules. IE serves as an extension and development of the notation and basic concepts of the ER methodology proposed by Peter Chen.

IE provides the infrastructure to support information requirements by integrating corporate strategic planning with the information systems being developed. Such integration makes it possible to more closely link the management of information resources with the long-term strategic prospects of the corporation. This business-driven approach leads many modelers to choose IE over other methodologies that primarily focus on solving immediate development problems.

IE provides a workflow that leads a corporation to identify all of its information needs to collect and manage data and identify relationships between information objects. As a result, information requirements are articulated based on management directives and can be directly translated into a management information system that will support strategic information needs.

Conclusion

Understanding how to use a data modeling tool like ERwin is only part of the problem. In addition, you must understand when data modeling tasks are performed and how information requirements and business rules are collected to be represented in the data model. Conducting work sessions provides the most favorable conditions for collecting information requirements in an environment that includes subject matter experts, users, and information technology specialists.

Building a good data model requires the analysis and research of information requirements and business rules collected during work sessions and interviews. The resulting data model should be compared to the enterprise model, if possible, to ensure that it does not conflict with existing object models and includes all required objects.

The data model consists of logical and physical models representing information requirements and business rules. The logical model must be reduced to the third normal form. Third normal form limits, adds, updates, and removes data structure anomalies to support the "one fact, one place" principle. The collected information requirements and business rules should be analyzed and researched. They need to be compared with the enterprise model to ensure that they do not conflict with existing object models and that they include all required objects.

IN ERwin model data includes both logical and physical models. ERwin implements the ER approach and allows you to create logical and physical model objects to represent information requirements and business rules. Logical model objects include entities, attributes, and relationships. Physical model objects include tables, columns, and relationship integrity constraints.

In one of the following publications, the issues of identifying entities, determining entity types, choosing entity names and descriptions, as well as some tricks to avoid the most common modeling errors associated with the use of entities, will be considered.

Entities must have a complete set of attributes, so that every fact about each entity can be represented by its attributes. Each attribute must have a name that reflects its values, a boolean data type, and a clear, short, complete description or definition. In one of the following publications, we will consider the initial set of recommendations for the correct formation of names and descriptions of attributes. Relationships should include a verb construction that describes the relationship between entities, along with characteristics such as plurality, the need for existence, or the possibility of non-existence of the relationship.

NOTE Plurality relationships describes the maximum number of secondary entity instances that can be associated with an instance of the original entity.The necessity of existence or the possibility of absence relationship is used to determine the minimum number of instances of a secondary entity that can be associated with an instance of the original

The purpose of the lecture

After studying the material of this lecture, you will know:

what's happened enterprise data model ;
how to convert enterprise data model into the data warehouse model;
essential elements enterprise data model ;
presentation layers of the corporate data model ;
algorithm for converting an enterprise data model into a multidimensional data warehouse model ;

and learn:

develop data warehouse models based on enterprise data model organizations;
develop a star schema using CASE tools;
partition tables multidimensional model using CASE tools.

Enterprise data model

Introduction

Enterprise data model

Let us clarify the concept enterprise data model. Under corporate data model understand the multi-level, structured description of the subject areas of the organization, data structures of subject areas, business processes and business procedures, data flows adopted in the organization, state diagrams, data-process matrices and other model representations that are used in the activities of the organization. Thus, in a broad sense, enterprise data model is a set of models of various levels that characterize (model at some abstract level) the activities of the organization, i.e. content corporate model directly depends on what model structures were included in it in a given organization.

Main elements enterprise data model are:

description of the subject areas of the organization (definition of areas of activity);
relationships between the subject areas defined above;
information data model (ERD-model or entity-relationship model);
for each subject area description:
- entity keys;
- entity attributes;
- subtypes and supertypes;
- relationships between entities;
- attribute groupings;
- relationships between subject areas;
functional model or business process model;
data flow diagrams;
state diagrams;
other models.

Presentation layers of the enterprise data model

Enterprise data model is subdivided according to subject areas, which represent groups of entities related to supporting specific business needs. Some subject areas may cover specific business functions such as contract management, while others may group entities that describe products or services.

Each logical model must correspond to an existing subject area enterprise data model. If the logical model does not meet this requirement, a model that defines the subject area must be added to it.

Enterprise data model usually has several levels of presentation. Actually high level(high level) enterprise data model is a description of the main subject areas of the organization and their relationships at the level of entities. On fig. 16.2 is a fragment enterprise data model top level.

Rice. 16.2.

From the point of view of designing a data warehouse, an important factor in deciding to create a data warehouse model from enterprise data model is the state completeness enterprise data model.

Enterprise data model organization has the characteristic of evolution, i.e. it is constantly evolving and improving. Some subject areas enterprise data model may be well developed, for some the work may not yet have begun. If a fragment of the subject area is not worked out in enterprise data model, then there is no way to use this model as a starting point for designing a data warehouse.

This article will focus on the architecture of data warehouses. What to be guided by when building it, what approaches work - and why.

"The fairy tale is a lie - but there is a hint in it ..."

Grandfather planted ... storage. And the storehouse grew big and big. I just didn't really know how it worked. And grandfather started a review. The grandfather called the grandmother, granddaughter, cat and mouse for a family council. And he says the following topic: “Our storage has grown. Data from all systems flock, tables are visible and invisible. Users prepare their reports. It seems that everything is fine - to live and live. Yes, only one sadness - no one knows how it works. It requires disks apparently-invisibly - you will not get enough! And then there are users who come to me with different complaints: either the report freezes, or the data is outdated. And sometimes it’s quite a disaster - we come with reports to the tsar-father, but the numbers don’t agree with each other. The hour is not even - the king will be angry - then do not demolish your head - neither for me, nor for you. So I decided to gather you and consult: what are we going to do?

He cast his eyes over the assembly and asked:
- Here you are, grandmother, do you know how our storage is arranged?
- No, grandfather, I don't know. And how should I know? Over there, what brave lads are guarding him! Some mustaches! Don't step up. I went to visit them somehow, baked pies. And they ate some pies, wiped their mustaches and said: “Why did you come, grandmother? What is your storage? You tell us what kind of report you need - we will do it for you! You most importantly bring pies more often! Painfully, they taste delicious.”
- And you, my beloved granddaughter, do you know how our storage is arranged?
- No, grandfather, I don't know. Gave me some access to it. I connected, I look - and there are tables - apparently invisible. And different schemes are hidden. Eyes widen.... I was confused at first. And then I looked closely - some of them are empty, others are filled, but only half. Also, the data seems to be repeated. No wonder you can't stock up on disks with such redundancy!
- Well, you, cat, what can you say about our storage? Is there something good in it?
- Yes, how not to say, grandfather - I will say. At the request of my granddaughter, I tried to make a pilot pilot in a separate scheme - a small display case. In order to understand what kind of trade is beneficial for our state - what products are good for merchants, they pay tribute - the treasury is replenished. And which ones are bad. And I began to pick up data from this repository. Collected facts. And he began to try to compare them against products. And what, grandfather, I saw - the products seem to be the same, but you look at the signs - they are different! I then began to comb them with my granddaughter's comb. He scratched, scratched - and led to a certain uniformity, caressing the eye. But early I rejoiced - the next day I launched my scripts to update the wonderful data in the window - and everything went away for me! "How so?" - I think, - the granddaughter is going to be upset - today it would be necessary to show our pilot to the minister. How do we go about it - with such data?
- Yes, sad tales, cat, you tell. Well, you, little mouse, didn’t you really try to find out about the vault? You are a lively, nimble, sociable girl! What will you tell us?
- Yes, how, grandfather, do not try - of course, I am a quiet mouse, but agile. Somehow the granddaughter of the cat asked the data model of our state repository to get it. And the cat, of course, came to me - on you, says the mouse, all hope! Well, what a good deed good people (and cats) can not do? I went to the castle, where the head of the repository hides the data model in a safe. And hid. I waited for him to take that model out of the safe. As soon as he went out for coffee - I jumped on the table. I look at the model - I can not understand anything! How so? I don't recognize our vault! We have countless thousands of tables, data - indefatigable streams! And here - everything is harmonious and beautiful ... He looked at this model - and put it back in the safe.
- Yes, very strange things, you told us, mouse.
Grandpa thought hard.
What shall we do, my friends? After all, you won’t live long with such a repository ... Users will soon lose patience completely.

Whatever our grandfather from the fairy tale decides - to build a new storage facility or try to reanimate the existing one - we must draw conclusions before “rolling up our sleeves” again.
Let's put aside organizational aspects - such as the danger of focusing expertise in some narrow closed group, the lack of control processes and ensuring transparency of the architecture of systems used in the enterprise, etc.
Today I would like to focus on building the architecture of a particular system (or group of systems) - data warehouses. What should be kept in focus in the first place when an organization starts building such a complex and expensive system as storage.

Debriefing

None of us, working on the creation and development of any system, does not want it to be a “temporary house”, or a solution that will “wither away” in a year or two, because. will be unable to meet the requirements and expectations of Customers and Business. No matter how strong the shift towards “flexible methodologies” is today, it is much more pleasant for a person to feel like a “master” who makes violins than an artisan who carves sticks for disposable drums.
Our intention sounds natural: to make systems that are solid and high-quality, which will not require us to regularly “night vigils with a file”, which we will not be ashamed of in front of end users, and which will not look like a “black box” to all “uninitiated” followers.

First, let's list the typical problems that we regularly encounter when working with storages. Let's just write down what we have - so far without trying to streamline and formalize.

In principle, we have a good storage: if you do not touch it, then everything works. True, as soon as a change is required, “local collapses” begin.
Data is loaded daily, according to the regulations, within one large process, within 8 hours. And it suits us. But if a failure suddenly occurs, this requires manual intervention. And then everything can work unpredictably for a long time, because. human participation is required in the process.
Rolled release - expect problems.
Some one source could not give the data in time - all processes are waiting.
Data integrity is controlled by the database - so our processes crash when it is broken.
We have a very large storage - 2000 tables in one common schema. And 3000 more in many other schemes. We already have little idea how they are arranged and for what reason they appeared. Therefore, it can be difficult for us to reuse something. And many problems have to be solved again. Because, it's easier and faster (than to understand "in someone else's code"). As a result, we have discrepancies and duplicate functionality.
We expect the source to provide quality data. But it turns out that this is not the case. As a result, we spend a lot of time reconciling our final reports. And they were very successful at it. We even have a streamlined process. True, it takes time. But users are used to...
The user does not always trust our reports and requires justification for a particular figure. In some cases he is right and in others he is wrong. But it is very difficult for us to substantiate them, because we do not provide means of "end-to-end analysis" (or data lineage).
We could bring in additional developers. But we have a problem - how do we turn them into work? What is the most efficient way to parallelize work?
How to develop the system gradually, without going into the development of the “core of the system” for a whole year?
The data warehouse is associated with the corporate model. But we know for sure (we saw it in the XYZ bank) that it is possible to build a model indefinitely (in the XYZ bank we went around and discussed business entities for six months, without any movement). Why is she at all? Or maybe it's better without her, if there are so many problems with her? Maybe generate it somehow?
We decided to lead the model. But how to systematically develop the warehouse data model? Do we need "rules of the game" and what can they be? What will it give us? What if we make a mistake with the model?
Should we save the data, or the history of their changes, if "the business does not need them"? I would not like to "store garbage" and complicate the use of this data for real tasks. Should the vault keep history? What is it like? How does storage work over time?
Is it necessary to try to unify the data in the storage if we have a NSI management system? If there is MDM, does this mean that now the whole master data problem is solved?
We are expected to replace key accounting systems soon. Should the data store be ready for a source change? How to achieve this?
Do we need metadata? What shall we understand by this? Where exactly can they be used? How can it be implemented? Do they need to be kept "in one place"?
Our Customers are extremely unstable in their requirements and desires - something is constantly changing. In general, our business is very dynamic. While we are doing something, it already becomes unnecessary. How can we make sure that we produce results as quickly as possible - like hot cakes?
Users demand speed. But we cannot run our main boot processes often, because this loads the source systems (has a bad effect on performance) - therefore, we hang up additional data streams - which will take pointwise - what we need. True, it turns out a lot of flows. And then we will throw out some of the data. In addition, there will be a problem of convergence. But there is no other way...

Quite a lot has already happened. But this is not a complete list - it is easy to supplement and develop it. We will not hide it in the table, but hang it in a conspicuous place - keeping these issues in the focus of our attention in the process of work.
Our task is to develop a comprehensive solution as a result.

antifragility

Looking at our list, one conclusion can be drawn. It is not difficult to create some kind of “database for reporting”, throw data there, or even build some kind of routine data update processes. The system begins to live somehow, users appear, and with them obligations and SLAs, new requirements arise, additional sources are connected, methodologies change - all this must be taken into account in the development process.

After some time, the picture is as follows:
"Here's the vault. And it works if you don't touch it. Problems arise when we have to change something.”

A change comes to us, the impact of which we are unable to evaluate and comprehend (because we did not put such tools into the system initially) - and in order not to take risks, we do not touch what is, but make one more extension on the side, and one more, and more - turning our decision into slums, or as they say in Latin America, "favelas", where even the police are afraid to go.
There is a feeling of loss of control over one's own system, chaos. More and more hands are required to maintain existing processes and solve problems. And it's getting harder to make changes. In other words, the system becomes unstable to stresses, non-adaptive to changes. And besides, there is a strong dependence on characters who "know the fairway", since no one has a "card".

This property of an object is to collapse under the influence of chaos, random events and upheavals - Nassim Nicholas Taleb calls fragility . It also introduces the opposite concept: antifragility – when the object is not destroyed by stress and accidents, but receives a direct benefit from it. ("Antifragility. How to benefit from chaos")
Otherwise it can be called adaptability or resistance to change .

What does this mean in this context? What are the "sources of chaos" for IT systems? And what does it mean to "capitalize on chaos" in terms of IT architecture?
The first thought that comes to mind is the changes that come from outside. What is the outside world for the system? For storage in particular. Of course, first of all - changes from the data sources for the warehouse:

changing the formats of incoming data;
replacement of some data source systems with others;
changing rules/platforms for system integration;
changing the interpretation of data (formats are saved, the logic of working with data changes);
changing the data model, if the integration is done at the data level (parsing the database transaction log files);
growth in data volumes - while there was little data in the source system, and the load was small - it was possible to take them at any time, with an arbitrarily heavy request, the data and load have grown - now there are strict restrictions;
etc.

The source systems themselves, the composition of information and its structure, the type of integration interaction, as well as the very logic of working with data can change. Each system implements its own data model and approaches to working with them that meet the goals and objectives of the system. And no matter how hard they try to unify industry models and reference practices, nuances will inevitably emerge anyway. (And besides, the process of industry unification itself, for various reasons, is not moving forward much.)
The culture of working with corporate data - the presence and control of information architecture, a single semantic model, master data management systems (MDM) somewhat facilitate the task of consolidating data in the warehouse, but do not exclude its necessity.

No less critical changes are initiated by storage consumers (requirement changes):

previously, there was enough data to build a report - now it was necessary to connect additional fields or a new data source;
previously implemented data processing methods are outdated - algorithms and everything that it affects need to be reworked;
Previously, everyone was satisfied with the current value of the dictionary attribute on the information panel - now the value that is relevant at the time of the occurrence of the analyzed fact / event is required;
there was a requirement for the depth of the history of data storage, which was not there before - to store data not for 2 years, but for 10 years;
previously there was enough data as of the “end of the day / period” state - now the data state is needed “intraday”, or at the time of a certain event (for example, making a decision on a loan application - for Basel II);
earlier we were satisfied with reporting on data for yesterday (T-1) or later, now we need T0;
etc.

Both integration interactions with source systems and requirements from data warehouse consumers are external factors for the data warehouse: one source system replaces another, data volumes grow, incoming data formats change, user requirements change, etc. And all these are typical external changes for which our system - our repository - must be ready. With the right architecture, they shouldn't kill the system.

But that is not all.
Speaking of variability, we, first of all, recall external factors. After all, inside we can control everything, it seems to us, right? Yes and no. Yes, most of the factors that are outside the zone of influence are external. But there is also “internal entropy”. And it is precisely because of its presence that we sometimes need to return “to point 0”. Start the game over.
In life, we often tend to start from scratch. Why do we tend to do this? And is it that bad?
Applied to IT. For the system itself - this can be very good - the ability to reconsider individual decisions. Especially when we can do it locally. Refactoring is the process of unraveling the "web" that periodically arises in the process of system development. Returning "to the beginning" can be useful. But it has a price.
With proper architecture management, this price is reduced - and the process of system development itself becomes more controllable and transparent. A simple example: if the principle of modularity is observed, it is possible to rewrite a separate module without affecting external interfaces. And this cannot be done with a monolithic structure.

The antifragility of a system is determined by its architecture. And it is this property that makes it adaptive.
When we talk about adaptive architecture- we mean that the system is able to adapt to changes, and not at all that we are constantly changing the architecture itself. On the contrary, the more stable and stable the architecture, the fewer requirements that entail its revision, the more adaptive the system.

Solutions that require a revision of the entire architecture will have a much higher price. And for their adoption, you need to have very good reasons. For example, such a reason could be a requirement that cannot be implemented within the current architecture. Then they say - there was a requirement that affects the architecture.
Thus, we also need to know our “antifragility limits”. Architecture is not developed "in a vacuum" - it is based on current requirements and expectations. And if the situation fundamentally changes - we must understand that we have gone beyond the current architecture - and we need to revise it, develop a different solution - and think about transition paths.
For example, we have pledged that we will always need data in the warehouse at the end of the day, we will do data collection daily using standard system interfaces (through a set of views). Then, from the risk management department, requests came about the need to receive data not at the end of the day, but at the time of making a decision on lending. No need to try to "stretch the unstretched" - you just need to recognize this fact - the sooner the better. And start working on an approach that will allow us to solve the problem.
There is a very fine line here - if we only consider "requirements in the moment" and do not look a few steps ahead (and several years ahead), then we increase the risk of encountering an architecture-affecting requirement too late - and the cost of our change will be very high. Looking a little ahead - within the boundaries of our horizon - has never harmed anyone.

The example of a system from the “storage fairy tale” is just an example of a very shaky system built on fragile design approaches. And if this happens, destruction occurs rather quickly, for this particular class of systems.
Why can I say so? The topic of storage is not new. The approaches and engineering practices that have been developed during this time have been aimed at precisely this - maintaining the viability of the system.
To take a simple example, one of the most common reasons why takeoff storage projects fail is trying to build storage on top of source systems under development without matching integration interfaces - trying to pull data directly from tables. As a result, they went into development - during this time the source database changed - and the download streams in the storage became inoperable. It's too late to redo something. And if you haven’t secured yourself by making several layers of tables inside the storage, then you can throw everything away and start over. This is just one example, and one of the simplest.

Taleb's criterion for fragile and antifragile is simple. The chief judge is time. If the system stands the test of time and shows its "survivability" and "indestructibility" - it has the property of antifragility.
If, when designing a system, we take into account antifragility as a requirement, then this will encourage us to use such approaches to building its architecture that will make the system more adaptive to both “chaos from the outside” and “chaos from the inside”. And ultimately the system will have a longer lifespan.
None of us wants to make "temporaries". And do not delude yourself that there is no other way now. Looking a few steps ahead is normal for a person at any time, especially in times of crisis.

What is a data warehouse and why do we build it

The storage architecture article assumes that the reader not only knows what it is, but also has some experience with such systems. Nevertheless, I considered it necessary to do this - to return to the origins, to the beginning of the path, because. it is there that the “fulcrum” of development is located.

How did people come to the conclusion that data warehouses are needed? And how are they different from just "a very large database"?
A long time ago, when there were simply “business data processing systems” in the world, there was no division of IT systems into such classes as front-end oltp systems, back-office dss, text data processing systems, data warehouses, etc.
This was the time when the first relational DBMS Ingres was created by Michael Stonebreaker.
And this was the time when the era of personal computers burst into the computer industry like a whirlwind and forever turned all the ideas of the IT community of that time.

Then it was easy to find enterprise applications written on the basis of desktop-class DBMS - such as Clipper, dBase and FoxPro. And the market client-server applications and the DBMS was just gaining momentum. One after another, database servers appeared that would occupy their niche in the IT space for a long time - Oracle, DB2, etc.
And the term "database application" was circulated. What did such an application include? Simplified - some input forms through which users could simultaneously enter information, some calculations that were launched “on a button” or “on a schedule”, as well as some reports that could be seen on the screen or saved as files and sent to seal.
"Nothing special - just a simple application, just a database," one of my early mentors remarked. "Is it nothing special?" - I thought then.

If you look closely, then there are still features. As users grow, the volume of incoming information increases, as the load on the system increases, its developers-designers, in order to maintain performance at an acceptable level, go to some "tricks". The very first is the division of a monolithic “business data processing system” into an accounting application that supports the work of users in on-line mode, and a separate application for batch data processing and reporting. Each of these applications has its own database and is even hosted on a separate instance of the database server, with different settings for a different nature of the load - OLTP and DSS. And data streams are built between them.

This is all? It would seem that the problem is solved. What happens next?
And then companies grow, their information needs multiply. The number of interactions with the outside world is also growing. And in the end there is more than one great application, which fully automates all processes, and several different ones, from different manufacturers. The number of systems that generate information - data source systems in the company is increasing. And sooner or later, there will be a need to see and compare information received from different systems. This is how Data Warehousing, a new class of systems, appears in the company.
The generally accepted definition of this class of systems is as follows.

Data Warehouse (or Data Warehouse)- a domain-specific information database, specially designed and intended for the preparation of reports and business analysis in order to support decision-making in an organization

In this way, consolidation data from different systems, the ability to look at them in a certain “single” (unified) way is one of the key properties of data storage class systems. This is the reason why storage came into existence during the evolution of IT systems.

Key Features of Data Warehouses

Let's take a look in more detail. What are the key features of these systems? What makes data warehouses different from other enterprise IT systems?

First, these are large volumes. Very big. VLDB - this is how leading vendors call such systems when they give their recommendations on the use of their products. From all company systems, data flows into this big base data and stored there "eternally and unchanged", as they say in textbooks (in practice, life is more difficult).

Secondly, it is historical data − "Corporate memory" - so called data warehouses. In terms of working with time in storage, everything is quite interesting. In accounting systems, data is relevant at the moment. Then the user performs some operation - and the data is updated. At the same time, the history of changes may not be preserved - it depends on accounting practice. Take, for example, a bank account balance. We may be interested in the current balance at "now", at the end of the day or at the time of some event (for example, at the time the score is calculated). If the first two are solved quite simply, then the latter will most likely require special efforts. When working with the repository, the user can access past periods, compare them with the current one, and so on. It is these time-related capabilities that significantly distinguish data warehouses from accounting systems - obtaining the state of data at various points on the time axis - to a certain depth in the past.

Thirdly, this consolidation And data unification . In order to make their joint analysis possible, it is necessary to bring them to a common form - unified data model , compare facts with unified reference books. There can be several aspects and difficulties here. First of all - conceptual – under the same term, different people from different departments can understand different things. And vice versa - to call differently something that is essentially the same thing. How to ensure a "single view", and at the same time preserve the specifics of the vision of a particular group of users?

Fourthly, it is work with data quality . In the process of loading data into the storage, they are cleaned, general transformations and transformations are performed. General transformations must be done in one place - and then used to build various reports. This will avoid the discrepancies that cause so much irritation for business users - especially for management, who are brought to the table with numbers from different departments that do not agree with each other. Poor data quality gives rise to errors and discrepancies in reports, the consequence of which is a decrease in the level user trust to the entire system, to the entire analytical service as a whole.

architectural concept

Everyone who came across the repository, most likely observed some kind of "layered structure" - because. it is this architectural paradigm that has taken root for systems of this class. And not by accident. Storage layers can be perceived as separate components of the system - with their own tasks, areas of responsibility, "rules of the game".
Layered architecture is a means of dealing with the complexity of the system - each subsequent layer is abstracted from the complexities of the internal implementation of the previous one. This approach allows you to identify tasks of the same type and solve them in a uniform way, without reinventing the “bicycle” every time from scratch.
A schematic conceptual architectural diagram is shown in the figure. This is a simplified diagram that reflects only the key idea - the concept, but without the "anatomical details" that will arise with a deeper study of the details.

As shown in the diagram, conceptually select the following layers. Three main layers that contain the data storage area (indicated by a filled rectangle) and data loading software (conditionally shown by arrows of the same color). As well as an auxiliary - service layer, which, however, plays a very important connecting role - managing data loading and quality control.

Primary Data Layer - primary data layer (or staging , or operating layer ) - is designed to be loaded from source systems and save primary information, without transformations - in its original quality and with support for a complete history of changes.
The task of this layer– to abstract subsequent storage layers from the physical device of data sources, methods of data collection and methods for extracting the delta of changes.

Core Data Layer - storage core - the central component of the system, which distinguishes the storage from just a “batch integration platform”, or a “big data dump”, since its main role is data consolidation from different sources, reduction to uniform structures, keys. It is when loading into the kernel that the main work with data quality and general transformations are carried out, which can be quite complex.
The task of this layer- abstract their consumers from the peculiarities of the logical structure of data sources and the need to compare data from different systems, ensure the integrity and quality of data.

Data Mart Layer - analytical showcases - a component whose main function is to convert data to structures convenient for analysis (if BI works with storefronts, then this is usually a dimensional model), or according to the requirements of the consumer system.
As a rule, data marts take data from the core - as a reliable and verified source - i.e. use the service of this component to bring data to a single form. We will call such windows regular . In some cases, storefronts can take data directly from staging - operating with primary data (in source keys). This approach, as a rule, is used for local tasks where data consolidation from different systems is not required and where efficiency is needed more than data quality. Such displays are called operating . Some analytical indicators can have very complex calculation methods. Therefore, for such non-trivial calculations and transformations, so-called secondary showcases .
Storefront layer task– preparation of data according to the requirements of a particular consumer – a BI platform, a group of users, or an external system.

The layers described above consist of a permanent data storage area, as well as a software module for loading and transforming data. This division into layers and regions is logical. The physical implementation of these components can be different - you can even use different platforms to store or transform data on different layers, if this is more efficient.
Storage areas contain technical (buffer tables) that are used in the process of data transformation and target tables, which are accessed by the consumer component. It's good practice to "cover" target tables with views. This facilitates subsequent maintenance and development of the system. The data in the target tables of all three layers are marked with special technical fields (meta-attributes), which serve to ensure data loading processes, as well as to enable informational audit of data flows in the storage.

A special component (or set of components) is also distinguished, which provides service functions for all layers. One of its key tasks - the control function - is to provide "single rules of the game" for the entire system as a whole, leaving the right to use different options for implementing each of the layers described above - incl. use different technologies for loading and processing data, different storage platforms, etc. Let's call him service layer (Service Layer) . It does not contain business data, but it has its own storage structures - it contains a metadata area, as well as an area for working with data quality (and possibly other structures, depending on the functions assigned to it).

Such a clear division of the system into separate components significantly increases the controllability of the system development:

the complexity of the task that is assigned to the developer of the functionality of a particular component is reduced (he does not have to simultaneously solve integration issues with external systems, and think over data cleaning procedures, and think about the optimal presentation of data for consumers) - the task is easier to decompose, evaluate and perform a small delivery;
you can involve various performers (and even teams or contractors) in the work - because this approach allows you to effectively parallelize tasks, reducing their mutual influence on each other;
the presence of persistent staging allows you to quickly connect data sources without designing the entire core or showcases for the entire subject area, and then gradually build the rest of the layers according to priorities (moreover, the data will already be in the repository - available to system analysts, which will greatly facilitate the tasks of subsequent development of the repository);
the presence of the core allows all work with data quality (as well as possible misses and errors) to be hidden from storefronts and from the end user, and most importantly, using this component as a single data source for storefronts, you can avoid problems with data convergence due to the implementation of common algorithms in one place;
highlighting storefronts allows you to take into account the differences and specifics of understanding data that users of different departments may have, and designing them for BI requirements allows you not only to issue aggregated figures, but also to ensure data reliability by providing opportunities to drill down to primary indicators;
the presence of the service layer allows you to perform end-to-end data analysis (data lineage), use unified data audit tools, common approaches to highlighting the delta of changes, working with data quality, load management, error monitoring and diagnostic tools, and speeds up problem resolution.

This approach to decomposition also makes the system more resistant to change (compared to a "monolithic structure") - it ensures its antifragility:

changes from the source systems are worked out on staging - in the kernel, only those threads that are affected by these staging tables are modified, the effect on storefronts is minimal or absent;
changes to customer requirements are processed mostly in storefronts (unless it requires additional information that is not already in the warehouse).

Next, we will go through each of the above components and look at them in a little more detail.

System core

Let's start "from the middle" - the core of the system or the middle layer. Not labeled as Core Layer. The core performs the role of data consolidation - reduction to single structures, directories, keys. Here the main work with data quality is carried out - cleaning, transformation, unification.

The presence of this component allows you to reuse data flows that transform primary data received from source systems into a single format, following common rules and algorithms, rather than repeating the implementation of the same functionality separately for each application storefront, which, in addition to inefficient use of resources, may lead to also discrepancies in the data.
The storage core is implemented in a data model, in the general case, different both from the models of source systems and from the formats and structures of consumers.

Storage engine model and enterprise data model

The main task of the middle storage layer is stability. That is why the main focus here is on the data model. It is commonly referred to as the "enterprise data model". Unfortunately, a certain halo of myths and absurdities has developed around it, which sometimes lead to the abandonment of its construction altogether, but in vain.

Myth 1. An enterprise data model is a huge model consisting of thousands of entities (tables).
Actually. In any subject area, in any business domain, in the data of any company, even the most complex, there are few basic entities - 20-30.

Myth 2. There is no need to develop any "own model" - we buy an industry reference model - and do everything according to it. We spend money - but we get a guaranteed result.
Actually. Reference models can really be very useful, because. contain industry experience in modeling this area. From them you can draw ideas, approaches, naming practices. Check the "depth of coverage" of the area, so as not to miss something important. But we are unlikely to be able to use such a model "out of the box" - as it is. This is the same myth as, for example, buying an ERP system (or CRM) and implementing it without any “twisting for yourself”. The value of such models is born in their adaptation to the realities of this particular business, this particular company.

Myth 3. Development of the storage core model can take many months, during which time the project will actually be frozen. In addition, it requires an insane amount of meetings and the participation of many people.
Actually. The repository model can be developed iteratively, piece by piece, along with the repository. For uncovered areas, "extension points" or "stubs" are placed - i.e. some "universal constructions" are applied. At the same time, you need to know when to stop so that you don’t get a super-universal thing of 4 tables, into which it is difficult both to “put data” and (even more difficult) to get it. And which is extremely non-optimal in terms of performance.

It will take time to develop the model. But this is not the time spent on “drawing entities” - this is the time needed to analyze the subject area, understanding how the data is structured. That is why analysts are very closely involved in this process, as well as various business experts are involved. And this is done selectively. And not by organizing meetings with an insane number of people, mailing huge questionnaires, etc.
Quality business and system analysis is the key to building a storage core model. You need to understand a lot of things: where (in what systems) data is generated, how it is arranged, in what business processes it circulates, etc. Qualitative analysis has never harmed any system. Rather, on the contrary, problems arise from “blank spots” in our understanding.

Developing a data model is not a process of inventing and coming up with something new. In fact, the data model in the company already exists. And the process of its design is more like "excavations". The model is gently and carefully brought to light from the "ground" of corporate data and clothed in a structured form.

Myth 4. In our company, the business is so dynamic, and everything is changing so quickly that it is useless for us to make a model - it will become outdated before we put this part of the system into operation.
Actually. Recall that the key factor in the core is stability. And above all, the topology of the model. Why? Because it is this component that is central and affects everything else. Stability is also a requirement for the kernel model. If the model becomes obsolete too quickly, then it is incorrectly designed. For its development, the wrong approaches and “rules of the game” were chosen. It is also a question of qualitative analysis. The key entities of the corporate model change extremely rarely.
But if it comes to our mind to do for a company that sells, say, confectionery products, instead of the “Products” directory, make “Sweets”, “Cakes” and “Pies”. Then when pizza appears in the list of goods - yes, you will need to enter a lot of new tables. And it's just a matter of approach.

Myth 5. Creating a corporate model is a very serious, complex and responsible business. And it's scary to make a mistake.
Actually. The core model, although it should be stable, is still not “cast in metal”. Like any other design decisions, its structure can be reviewed and modified. Just do not forget about this quality of her. But this does not mean at all that you “cannot breathe” on it. And this does not mean that temporary solutions and "stubs" that should be planned for processing are unacceptable.

Myth 6. If we have a data source - for example, a NSI system (or a master data management system - MDM), then it should in a good way correspond to the corporate model (especially if it was recently designed and did not have time to acquire “side effects”, “traditions ” and temporary buildings). It turns out that for this case - we do not need a kernel model?
Actually. Yes, in this case, the construction of the storage core model is greatly facilitated - because we follow a top-level ready-made conceptual model. But it is not excluded at all. Why? Because when building a model of a certain system, certain rules apply - what types of tables to use (for each entity), how to version data, with what granularity to keep history, what meta-attributes (technical fields to use), etc.

In addition, no matter how wonderful and comprehensive the NSI and MDM system we have, as a rule, there will be nuances associated with the existence of local directories “about the same” in other accounting systems. And this problem, whether we like it or not, will have to be solved at the storage, because reporting and analytics are collected here.

Primary data layer (or historizable staging or operational layer)

On it is designated as Primary Data Layer. The role of this component: integration with source systems, loading and storing primary data, as well as preliminary data cleaning - checking for compliance with the rules of format-logical control, fixed in the "interaction interface agreement" with the source.
In addition, this component solves a very important task for the storage - highlighting the "true change delta" - regardless of whether the source allows you to track changes in the data or not and how (by what criterion they can be "caught"). As soon as the data got into staging, the issue of delta selection is already clear for all other layers, thanks to the marking with meta-attributes.

The data in this layer is stored in structures that are as close as possible to the source system - in order to keep the primary data as close as possible to their original form. Another name for this component is "operational layer".
Why not just use the established term “staging”? The fact is that earlier, before the "era of big data and VLDB", disk space was very expensive - and often the primary data, if stored, was only for a limited period of time. And often the name "staging" is called cleanable buffer.
Now, technology has stepped forward - and we can afford not only to store all the primary data, but to historize them with the degree of granularity that is only possible. This does not mean that we should not control the growth of data and does not eliminate the need to manage the information lifecycle by optimizing the cost of data storage, depending on the "temperature" of use - i.e. moving "cold data", which is less in demand, to cheaper media and storage platforms.

What gives us the presence of "historic staging":

the possibility of making mistakes (in structures, in transformation algorithms, in the granularity of history keeping) - having completely historizable primary data in the storage availability zone, we can always reload our tables;
an opportunity to think - we can take our time with the development of a large fragment of the core in this iteration of the development of the repository, because in our staging, in any case, they will be, and with an even time horizon (there will be one “starting point of history”);
the possibility of analysis - we will save even those data that are no longer in the source - they could be overwritten there, go to the archive, etc. – with us, they remain available for analysis;
the possibility of an information audit - thanks to the most detailed primary information, we will then be able to figure out how the download worked for us, that we eventually got such numbers (for this, you also need to have marking with meta-attributes and the corresponding metadata on which the download works - this is decided on the service layer).

What difficulties may arise in the construction of "historic staging":

it would be convenient to set requirements for the transactional integrity of this layer, but practice shows that this is difficult to achieve (this means that in this area we do not guarantee the referential integrity of parent and child tables) - integrity alignment occurs on subsequent layers;
this layer contains very large volumes (the largest in storage - despite all the redundancy of analytical structures) - and you need to be able to handle such volumes - both in terms of loading and in terms of queries (otherwise, you can seriously degrade the performance of the entire storage).

What else can be said about this layer.
Firstly, if we move away from the “end-to-end loading processes” paradigm, then the rule “the caravan moves at the speed of the last camel” no longer works for us, or rather, we abandon the “caravan” principle and switch to the “conveyor” principle: we took data from source - put in your layer - ready to take the next portion. It means that
1) we don't wait for processing to happen on other layers;
2) we do not depend on the schedule of data provision by other systems.
Simply put, we schedule a load process that takes data from one source through a specific connection method to it, checks, extracts the delta - and puts the data into staging target tables. And that's all.

Secondly, these processes, apparently, are arranged very simply - one might say trivially, from the point of view of logic. And this means that they can be very well optimized and parameterized, reducing the load on our system and speeding up the process of connecting sources (development time).
For this to happen, you need to know very well the technological features of the platform on which this component works - and then you can make a very effective tool.

Layer of analytical showcases

Storefront layer ( Data mart layer) is responsible for preparing and providing data to end users - people or systems. At this level, the requirements of the consumer are taken into account as much as possible - both logical (conceptual) and physical. The service should provide exactly what is needed - no more, no less.

If the consumer is an external system, then, as a rule, it dictates the data structures that it needs and the rules for collecting information. A good approach is one in which the consumer is responsible for the correct data collection. The data warehouse prepared, formed the storefront, provided the possibility of incremental data collection (marking with meta-attributes for subsequent selection of delta changes), and the consumer system then manages and is responsible for how it uses this storefront. But there are peculiarities: when the system does not have an active component for data collection, either an external component is needed that will perform an integrating function, or the storage will act as an “integration platform” and ensure the correct incremental upload of data further – outside the storage. Many nuances come up here, and the rules of interface interaction should be thought out and understood by both parties (however, as always, when it comes to integration). As a rule, routine cleaning/archiving of data is applied to such storefronts (it is rarely necessary that these “transit data” be stored for a long time).

Of greatest importance in terms of analytical tasks are storefronts "for people" - more precisely, for the BI tools with which they work.
However, there is a category of "especially advanced users" - analysts, data scientists - who do not need either BI tools or routine processes for filling external specialized systems. They need some kind of "common storefront" and "their own sandbox", where they can create tables and transformations at their discretion. In this case, the responsibility of the repository is to ensure that these common data marts are populated in accordance with the regulations.
Separately, we can single out such consumers as Data Mining tools - deep data analysis. These tools have their own data preparation requirements and are also used by data scientists. For the repository, the task is reduced - again, to supporting the service for downloading certain showcases of an agreed format.

However, let's get back to analytical storefronts. It is they that are of interest from the point of view of storage designers in this data layer.
In my opinion, the best time-tested approach to designing data marts, which almost all BI platforms are now "sharpened" for, is Ralph Kimball's approach. He is known by the name dimensional modeling – multidimensional modeling. There are a great many publications on this subject. For example, the basic rules can be found in the publication of Marga Ross. And of course, you can recommend from the gurus of multivariate modeling. Another helpful resource is Kimball's Tips.
The multidimensional approach to creating storefronts has been described and worked out so well - both by method evangelists and by leading software vendors - that it makes no sense to dwell on it here in any detail - the original source is always preferable.

I would like to make only one emphasis. "Reporting and analytics" is different. There is "heavy reporting" - pre-ordered reports that are generated in the form of files and delivered to users via the provided delivery channels. And there are information panels - BI dashboards. Basically, they are web applications. And the response time requirements of these applications are the same as for any other web application. This means that the normal refresh time for the BI panel is seconds, not minutes. It is important to keep this in mind when designing a solution. How to achieve this? Standard method optimization: we look at what the response time is made up of and what we can influence. What do you spend the most time on? For physical (disk) reading of the database, for data transfer over the network. How to reduce the amount of data read and transmitted per request? The answer is obvious and simple: you need to either aggregate the data, or apply a filter on large fact tables participating in the query, and exclude the join of large tables (references to fact tables should only go through dimensions).

What is BI for? How is it convenient? Why is the multivariate model effective?
BI allows the user to perform so-called "ad hoc queries". What does it mean? This means that we do not exactly know the request in advance, but we know what indicators in which sections the user can request. The user generates such a query by selecting the appropriate BI filters. And the task of the BI developer and the showcase designer is to ensure such an application operation logic so that the data is either filtered or aggregated, avoiding a situation where too much data is requested and the application “hangs”. Usually they start with aggregated figures, then delving into more detailed data, but along the way setting the necessary filters.

It is not always enough to simply build the "right star" - and get a convenient structure for BI. Sometimes you need to apply denormalization somewhere (while looking back at how it will affect the load), and somewhere to make secondary storefronts and aggregates. Somewhere to add indexes or projections (depending on the DBMS).

Thus, through “trial and error”, you can get a structure that is optimal for BI - which will take into account the features of both the DBMS and the BI platform, as well as the user's requirements for data presentation.
If we take data from the "core", then such processing of storefronts will be of a local nature, without in any way affecting the complex processing of primary data received directly from source systems - we only "shift" the data into a format convenient for BI. And we can afford to do it many times, in different ways, in accordance with different requirements. It is much easier and faster to do this on the basis of the kernel data than to assemble from the “primary” (the structure and rules of which, as we know, can also “float”).

service layer

The service layer ( - Service Layer) is responsible for the implementation of common (service) functions that can be used to process data in various storage layers - load management, data quality management, problem diagnosis and monitoring tools, etc.
Availability given level provides transparency and structured data flows in the storage.

This layer includes two data storage areas:

metadata area - used for the data loading control mechanism;
data quality area - to implement off-line data quality checks (i.e. those that are not built directly into ETL processes).

You can build the load management process in different ways. One of the possible approaches is this: we split the entire set of storage tables into modules. Only tables of one layer can be included in a module. The tables included in each module are loaded as part of a separate process. Let's call it control process . The launch of the control process is put on its own schedule. The control process orchestrates calls to atomic processes, each of which loads one target table, and also contains some common steps.
Obviously, it is enough to simply divide the staging tables into modules - according to the source systems, or rather their connection points. But for the kernel, this is already more difficult to do - because. there we need to ensure data integrity, which means we need to take into account dependencies. Those. there will be conflicts that need to be resolved. And there are different ways to resolve them.

An important point in load management is the development of a unified approach to error handling. Errors are classified according to the level of criticality. When a critical error occurs, the process should stop, and as soon as possible, because. its occurrence indicates a significant problem that can lead to data corruption in the storage. Thus, load management is not only about starting processes, but also stopping them, as well as preventing untimely start (by mistake).

A special metadata structure is created for the service layer to work. This area will store information about loading processes, loaded data sets, checkpoints that are used to maintain the increment (which process has read up to which point) and other service information necessary for the system to function.
It is important to note that all target tables in all layers are marked with a special set of meta-fields, one of which is the ID of the process that updated this string. For tables within a repository, this process marking allows for a unified way to subsequently extract delta changes. When loading data into the primary data layer, the situation is more complicated - the algorithm for extracting delta for different loaded objects can be different. On the other hand, the logic of processing the accepted changes and their rolling onto target tables for the core and storefronts is much more complicated than for staging, where everything is quite trivial - it is easy to parameterize and think over reusable typical steps (procedures).

I do not set the task here to fully cover this topic - the organization of loading - I only place accents that are worth paying attention to.
The above approach is just one of the options. He's pretty adaptable. And his “conceptual prototype” was the Toyota conveyor and the “just-in-time” system. Those. we are moving away from the widespread paradigm of exclusively “nightly loading of data”, and we are loading in small portions during the day - as the data is ready in various sources: what has come is what has been loaded. At the same time, we have many parallel processes running. And the “hot tail” of fresh data will constantly “blink” - and even out after a while. We must take into account this feature. And, if necessary, to form custom showcases "slices", where everything is already integral. Those. it is impossible to achieve both efficiency and consistency (integrity) at the same time. We need a balance - somewhere one thing is important, somewhere else.

It is extremely important to provide means of logging and monitoring. A good practice is to use typed events, where you can set different parameters and set up a notification system - a subscription to certain events. Because it is very important that when the intervention of the system administrator is required, he would know about it as early as possible and receive all the necessary diagnostic information. The logs can also be used for post-factum problem analysis, as well as for investigating incidents of system malfunctions, incl. data quality.

Design and maintain warehouse data models

Why is it important to pay attention to the design of data models when developing any system where a database is involved (and especially in a warehouse)? Why not just throw a set of tables anywhere - even in text editor? Why do we need these pictures?
Oddly enough, even experienced developers raise such questions.
Actually, yes, nothing prevents you from sketching out tables - and starting to use them. If ... if at the same time in the head (!) The developer has a harmonious overall picture of the structure that he is sculpting. What if there are multiple developers? But what if someone else will use these tables? But what if time passes - a person leaves this area, and then returns to it again?

Is it possible to figure it out without a model? Basically, you can. And to figure it out, and “estimate pictures on a piece of paper”, and “sweep - settle” the data. But it is much easier, clearer and faster to use a ready-made artifact - a data model. And also to understand the “logic of its structure” - i.e. It would be nice to have common rules of the game.

And the most important thing is not even that. Most importantly, when designing a model, we are forced (simply without options!) To study the subject area more closely and deeply, the features of the data structure and their use in various business cases. And those questions that we would easily “push aside” as complex, “blurred” by throwing our signs, without trying to design model - we will be forced to set and decide now, during analysis and design, and not later - when we build reports and think about “how to reduce the incompatible” and “reinvent the wheel” every time.

This approach is one of those engineering practices that make it possible to create antifragile systems. Since they are understandable, transparent, easy to develop, and their “boundaries of fragility” are immediately visible, one can more accurately assess the “scale of the disaster” when new requirements appear and the time required for a redesign (if necessary).
Thus, the data model is one of the main artifacts that must be maintained during the development of the system. In a good way, it should be “on the table” for every analyst, developer, etc. – all those involved in system development projects.

Designing data models is a separate, very extensive topic. There are two main approaches to storage design.
The approach is good for the kernel "entity-relationship" - when a normalized (3NF) model is built on the basis of the study of the subject area, more precisely, its selected area. Here the same “corporate model” that was discussed above plays.

When designing analytical showcases suitable multidimensional model . This approach lends itself well to the understanding of business users. this is a model that is simple and convenient for human perception - people operate with understandable and familiar concepts of metrics (indicators) and the sections by which they are analyzed. And this allows us to simply and clearly build the process of collecting requirements - we draw a set of "matrices of cuts and indicators", communicating with representatives of various departments. And then we bring it into one structure - the “analysis model”: we form the “measurement bus” and determine the facts that are defined on them. Along the way, we are working on hierarchies and aggregation rules.

Further, it is very easy to move on to the physical model, adding optimization elements taking into account the features of the DBMS. For example, for Oracle it would be partitioning, a set of indexes, and so on. For Vertica, other techniques will be used - sorting, segmentation, sectioning.
Special denormalization may also be required - when we deliberately introduce redundancy into the data, thanks to which we improve the performance of queries, but at the same time complicate the data update (because the redundancy will need to be taken into account and supported during the data loading process). Perhaps, in order to improve performance, we will also have to create additional aggregate tables, or use such additional features DBMS as projections in Vertica.

So, when modeling warehouse data, we actually solve several problems:

the task is to build a conceptual (logical) model of the core - system and business analysis - study of the subject area, deepening into details and taking into account the nuances of "live data" and their use in business;
the task of building an analysis model - and then a conceptual (logical) model of storefronts;
the task of building physical models is data redundancy management, optimization taking into account the features of the DBMS for queries and data loading.

When developing conceptual models, we may not take into account the features of a particular DBMS for which we are designing a database structure. Moreover, we can use one conceptual model to create several physical ones - for different DBMS.

Let's summarize.

A data model is not a set of “pretty pictures”, and the process of designing it is not the process of drawing them. The model reflects our understanding of the subject area. And the process of its compilation is the process of its study and research. This is where time is wasted. And not at all to “draw and color”.
A data model is a design artifact, a way to share information in a structured way between team members. To do this, it must be understandable to everyone (this is provided by the notation and explanation) and accessible (published).
The data model is not created once and frozen, but is created and developed in the process of system development. We ourselves set the rules for its development. And we can change them if we see how to make them better, simpler, more efficient.
The data model (physical) allows you to consolidate and use a set of best practices aimed at optimization - i.e. use the techniques that have already worked for this DBMS.

Features of data warehouse projects

Let's dwell on the features of projects within which the company builds and develops data warehouses. And let's look at them from the point of view of the influence of the architectural aspect. Why is it important to build architecture for such projects, and from the very beginning. And it is the presence of a well-thought-out architecture that gives flexibility to the data warehouse project, allows you to effectively distribute work between performers, and also makes it easier to predict the result and make the process more predictable.

Data Warehouse is custom software

A data warehouse is always a “custom development”, not a boxed solution. Yes, there are industry-specific BI applications that include a reference data model, pre-configured ETL processes from common sources (for example, ERP systems), a set of typical BI dashboards and reports. But in practice, the storage is rarely implemented - as a "box". I have been working with storage for about 10 years and have never seen such a story. There are always nuances associated with the unique features of the company - both business and IT landscape. Therefore, it is somewhat reckless to hope that the “vendor” providing the solution will provide the architecture. The architecture of such systems often matures within the organization itself. Or it is formed by specialists of the contractor company, which is the main contractor for the project.

The data warehouse is an integration project

The data warehouse loads and processes information from many source systems. And in order to maintain “friendly relations” with them, you need to be extremely careful with them. Among other things, it is necessary to minimize the load on the source systems, take into account the “availability and inaccessibility” windows, select interaction interfaces taking into account their architecture, etc. Then the storage will be able to collect data as early as possible and with the required frequency. Otherwise, you will be "transferred" to a backup circuit, which is not updated with the most operational frequency.
In addition, the "human factor" must be taken into account. Integration is not only the interaction of machines. It is also communication between people.

Data Warehouse is a team project

In a large company, such a system can rarely be done solely by one team. As a rule, several teams work here, each of which solves a specific problem.

The architecture should provide the possibility of organizing their parallel work, while maintaining its integrity and avoiding duplication of the same functionality in different places, by different people. In addition to unnecessary labor costs, such duplication can lead to discrepancies in the data later.

In addition, when so many people and teams, often scattered, are involved in the process of system development, the question inevitably arises: how to build communications and information interaction between them. The more standard and understandable approaches and practices are used, the easier, more convenient and efficient it is to set up such work. And including it is worth thinking about the composition of "working artifacts", among which for data warehouses No. 1 are data models (see the previous section).

The data warehouse has a longer lifespan compared to other systems

Let me clarify - the statement is true for a "live", working storage, integrated with key sources, possessing historical data and providing information and analytical services to many divisions of the company.

What grounds do I have for believing so?
Firstly, building a storage is a very resource-intensive process: in addition to the actual costs of equipment, licenses for the necessary technological software and development, almost all systems and divisions of the company are also involved in this. To repeat this whole process from scratch again is a very bold undertaking.

Secondly, if the storage has the right architecture, then it can easily survive both the change of source systems, the emergence of new requirements from end users, and the growth of data volumes.
If the architecture is correct, information flows are transparent, then such a system can be developed for a long time without the risk of being in a situation of stupor when making changes due to difficulties in assessing the impact.

Gradual iterative development

The last thing the Customer would like, getting involved in the story with the storage, is to freeze his requirements for a year or two, until the full corporate data model is designed, all sources are connected in full, etc.

The data warehouse in the eyes of Customers often looks like an absolute monster - the tasks, goals and horizon of the system development are so voluminous. And often the Customer is afraid that “at the expense of his budget” the IT department will solve some “own tasks”. And again, we are faced with the issue of interaction between people and the ability to calmly state one's position and negotiate.

Competent architectural approaches allow you to develop the system iteratively, increasing the functionality gradually, without going into “development” for several years before starting to give results.

Although it should be noted that "miracles do not happen" - and the "start" also takes time. For storages, it can be quite large - since these are large amounts of data, this is historical data - for old periods when the rules for processing information could differ from the current ones. Therefore, sufficient time is required for analytical development, interaction with source systems and a series of "trial and error", including load tests on real data.

Data Warehouses - "multi-project story"

It is difficult to single out a single business customer for a data warehouse. And it is believed (not without reason) that the key factor in the success of the storage project is the support of the company's management - directly the first person.
A repository is rarely built and developed within a single project. As a rule, there are various needs for data consolidation and analytics, behind them are different Customers and user groups. Therefore, the repository is often developed within the framework of several parallel projects.

Balance of innovation and proven solutions

Despite the fact that the topic of storage is very “ancient” (if such a word is applicable to such a young industry as IT) and quite conservative. Nevertheless, progress does not stand still - and the limitations that previously existed due to expensive and slow disks, expensive memory, etc. - are now removed. And at the same time, it's time to reconsider some architectural approaches. Moreover, this applies both to technological platforms and to the architecture of applied systems that are based on them.

It is important to strike a balance here - and maintain a fairly “green” approach to both resources and stored information. Otherwise, you can very quickly turn the repository into a semi-structured "garbage dump", which, if it can be sorted out, will be through quite a lot of effort.
Yes, we have more opportunities, but this does not mean that we need to deny all the practices that have been developed and tested by time, which are clear how and why to use, and “indulge in all serious” only led by the foggy ghost of “innovation”.
Keeping a balance means using new methods and approaches where they open up new opportunities, but at the same time using old proven ones to solve urgent problems that no one has canceled.
What can we do as developers and designers of applied solutions? First of all, to know and understand the technological changes of the platforms on which we work, their capabilities, features and limits of application.

Let's look at the DBMS - as the most critical and important technology platform for storage.
Recently, there has been a clear drift of relational databases, originally created as "universal", towards specialization. For a long time, leading vendors have been releasing various options - for applications of different classes (OLTP, DSS & DWH). In addition, there are additional opportunities for working with text, geo-data, etc.

But the matter was not limited to this - products began to appear that were initially focused on a certain class of tasks - i.e. specialized DBMS. They may or may not use the relational model. The important thing is that they are initially "sharpened" not just for the storage and processing of "business information" in general, but for certain tasks.

Apparently, centralization and specialization are two complementary trends that periodically replace each other, ensuring development and balance. As well as evolutionary (gradual) gradual development and cardinal changes. So, in the 90s, Michael Stonebreaker was one of the authors of the Generation III Database Manifesto, which clearly sounded the idea that the world did not need another revolution in the world of databases. However, 10 years later, he publishes works in which he announces the prerequisites for the beginning of a new era in the world of DBMS - precisely based on their specialization.
He focuses on the fact that widespread universal DBMSs are built on a “one-size-fits-all” architecture that does not take into account changes in hardware platforms or division of applications into classes for which you can come up with a better solution than implementing universal requirements.
And he begins to develop a number of projects in accordance with this idea. One of them is C-Store, a columnar DBMS designed in the shared nothing (SN) architecture, originally created specifically for data storage class systems. This product was further commercialized as HP Vertica.

Epilogue

“The uniqueness of antifragility is that it allows us to work with the unknown, to do something in conditions where we do not understand what exactly we are doing – and to succeed”/Nassim N.Taleb/

Therefore, the crisis and a high degree of uncertainty are not an excuse for the lack of architecture, but factors that reinforce its need.

The corporate database is the central link of the corporate information system and allows you to create a single corporate information space. Corporate databases

Share work on social networks

If this work does not suit you, there is a list of similar works at the bottom of the page. You can also use the search button

Page 15

THEME V CORPORATE DATABASES

LECTURE 8

V .one. Organization of data in corporate systems. Corporate databases.

V .2. DBMS and structural solutions in corporate systems.

V.3. Internet / Intranet technologies and corporate database access solutions.

V .one. DATA ORGANIZATION IN CORPORATE SYSTEMS. CORPORATE DATABASES

Corporate basedata is the central link of the corporate information system and allows you to create a single information space of the corporation. Corporate databases (Figure 1.1).

There are various definitions of databases.

Under the database (DB)understand a set of information logically related in such a way as to constitute a single set of data stored in the storage devices of a computer. This set acts as the initial data of tasks solved in the process of functioning of automated control systems, data processing systems, information and computing systems.

You can briefly formulate the term database as a collection of logically related data intended for sharing.

Under databaserefers to a collection of data stored together with minimal redundancy such that it can be used optimally for one or more applications.

Purpose of creating databasesas a form of data storage building a data system that does not depend on the adopted algorithms (software), the technical means used, the physical location of the data in the computer. The database assumes multi-purpose use (several users, many forms of documents and queries of one user).

Basic database requirements:

Completeness of data presentation. The data in the database should adequately represent all information about the object and should be sufficient for ODS.
Database integrity. The data must be preserved during the processing of their ODS and in any situations that arise in the course of work.
Flexibility of the data structure. The database should allow changing data structures without violating its integrity and completeness when external conditions change.
Realizability. This means that there must be an objective representation of various objects, their properties and relationships.
Availability. It is necessary to provide differentiation of access to data.
redundancy. The database should have minimal redundancy in representing data about any object.

Knowledge is understooda set of facts, patterns and heuristic rules with which you can solve the problem.

Knowledge base (KB)  collection of databases and rules used, received from decision makers. The knowledge base is an element of expert systems.

should be distinguisheddifferent ways of presenting data.

Physical Data -This is data stored in the computer's memory.

Logical representation of datacorresponds to the user's representation of physical data. The difference between a physical and a corresponding logical representation of data is that the latter reflects some important relationships between physical data.

Under corporate databaseunderstand a database that combines in one form or another all the necessary data and knowledge about an automated organization. In corporate information systems, such a concept asintegrated databases, in which the principle of single entry and multiple use of information is implemented.

Rice. 1.1. The structure of the interaction of departments with the information resources of the corporation.

Corporate databases areconcentrated (centralized) and distributed .

Concentrated (centralized) database is a database whose data is physically stored in the storage devices of one computer. On fig. 1.2 shows a diagram of a server application for accessing databases in various platforms.

Fig.1.2. Diagram of a heterogeneous centralized database

Large data exchange flow;
High network traffic;
Low reliability;
Low overall performance.

Higher degree of processing simultaneity due to load sharing;
Improving the use of data in the field when performing remote (remote) queries;
lower costs;
Easy to manage local databases.

Fig.1.3. Distributed corporate database.

We give the following definition of a distributed database.

Distributed database -this is a collection of information, files (relations) stored in different nodes of the information network and logically linked in such a way as to constitute a single set of data (the link can be functional or through copies of the same file). Thus, it is a set of databases that are logically interconnected, but physically located on several machines that are part of the same computer network.

The most important requirements for the characteristics of a distributed database are as follows:

Scalability;
Compatibility;
Support for various data models;
portability;
Location transparency;
Autonomy of distributed database nodes (Site Autonomy);
Processing of distributed requests;
Execution of distributed transactions.
Support for a homogeneous security system.

The main task of a distributed database– distribution of data over the network and providing access to it. There are the following ways to solve this problem:

Each node stores and uses its own set of data that is available for remote queries. This distribution is divided.
Some data that is frequently used at remote sites may be duplicated. Such a distribution is called partially duplicated.
All data is duplicated in each node. Such a distribution is called fully redundant.
Some files can be split horizontally (a subset of records is selected) or vertically (a subset of attribute fields is selected), while the split subsets are stored in different nodes along with unsplit data. Such distribution is called split (fragmented).

When creating a distributed database at the conceptual level, you have to solve the following tasks:

It is necessary to have a single conceptual scheme for the entire network. This will provide logical data transparency for the user, as a result of which he will be able to form a request to the entire database, being at a separate terminal (it works, as it were, with a centralized database).
A schema is needed to locate data on the network. This will provide transparency in data placement so that the user does not have to specify where to forward the request to get the required data.
It is necessary to solve the problem of heterogeneity of distributed databases. Distributed databases can be homogeneous or heterogeneous in terms of hardware and software. The problem of heterogeneity is relatively easy to solve if the distributed database is heterogeneous in terms of hardware, but homogeneous in terms of software (the same DBMS in the nodes). If different DBMS are used in the nodes of a distributed system, means of converting data structures and languages are needed. This should provide transparency of the transformation in the distributed database nodes.
It is necessary to solve the problem of managing dictionaries. To provide all kinds of transparency in a distributed database, programs that manage numerous dictionaries and reference books are needed.
It is necessary to define methods for executing queries in a distributed database. Methods for executing queries in a distributed database differ from similar methods in centralized databases, since individual parts of queries must be executed at the location of the corresponding data and transfer partial results to other nodes; at the same time, coordination of all processes should be ensured.
It is necessary to solve the problem of parallel execution of queries. In a distributed database, a complex mechanism for managing concurrent processing is needed, which, in particular, must ensure synchronization when information is updated, which guarantees data consistency.
The need for a developed methodology for the distribution and placement of data, including splitting, is one of the main requirements for a distributed database.

One of the actively developing new areas of computer systems architecture, which is a powerful tool for non-numerical information processing, aredatabase machines. Database machines are used to solve non-numerical tasks, such as storing, searching and transforming documents and facts, working with objects. Following the definition of data as digital and graphic information about the objects of the surrounding world, different content is embedded in the concept of data in numerical and non-numerical processing. Numeric processing uses objects such as variables, vectors, matrices, multi-dimensional arrays, constants, and so on, while non-numeric processing uses objects such as files, records, fields, hierarchies, networks, relationships, and so on. non-numeric processing is concerned directly with information about the objects (for example, a particular employee or group of employees), and not the employee file itself. It does not index the employee file to select a particular person; here more interested in the content of the desired entry. Huge volumes of information are usually subjected to non-numerical processing. In various applications, such operations can be performed on this data, for example:

increase the salary of all employees of the company;
calculate the bank interest on the accounts of all customers;
make changes to the list of all goods in stock;
find the required abstract from all texts stored in the library or in the bibliographic information retrieval system;
find the description of the desired contract in a file containing legal documents;
view all files containing descriptions of patents and find a patent (if any) similar to the one proposed again.

To implement the database engine, parallel and associative architectures as an alternative to uniprocessorvon Neumannstructure, allowing you to work with large amounts of information in real time.

Database engines are gaining importance in connection with the exploration and application of artificial intelligence concepts such as knowledge representation, expert systems, inference, pattern recognition, etc.

Information storages.Today, many recognize that most companies already operate several databases and, in order to successfully work with information, not just different types of databases are required, but different generations of DBMS. According to statistics, each organization uses an average of 2.5 different DBMS. The need to "isolate" the business of companies, or rather, the people involved in this business, from the technological features of databases, to provide users with a single view of corporate information, regardless of where it is physically stored, has become obvious. This stimulated the emergence of information warehousing technology ( Data Warehousing, DW).

The main goal of DW iscreation of a single logical representation of data contained in different types of databases, or, in other words, a single corporate data model.

A new round of DW development became possible thanks to the improvement of information technology in general, in particular the emergence of new types of databases based on parallel query processing, which in turn relied on advances in the field of parallel computers. Were createdquery builderswith an intuitive graphical interface that made it easy to build complex database queries. Miscellaneous softwaremiddlewareprovided communicationbetween different types of databases, and finally fell sharply in priceinformation storage devices.

The structure of a corporation may contain database.

Database - functional and organizational component in automated control systems and information and computing systems, which provides centralized information support for a group of users or a set of tasks solved in the system.

Database is considered as an information and reference system, the main purpose of which is:

in the accumulation and maintenance in working condition of a set of information constituting the information base of the entire automated system or a certain set of tasks solved in it;
in the issuance of the data required by the task or the user;
in providing collective access to stored information;
in ensuring the necessary management of the use of information contained in the infobase.

Thus, a modern data bank is a complex software and hardware complex, which includes technical, system and network tools, databases and DBMS, information retrieval systems for various purposes.

V .2. DBMS AND STRUCTURAL SOLUTIONS IN CORPORATE SYSTEMS

Database and knowledge management systems

An important component of modern information systems are database management systems (DBMS).

DBMS - a set of software and language tools designed to create, maintain and use databases.

The main feature of modern DBMS is that modern DBMS support such technologies like:

client/server technology.
Support for database languages. Thisschema definition language DB (SDL - Schema Definition Language),data manipulation language (DML - Data Manipulation Language), integrated languages SQL (Structured Queue Language ), QDB (Query - By - Example ) and QMF (Query Management Facility ) is an advanced peripheral tool for query specification and report generation for DB 2 etc.;
Direct management of data in external memory.
Memory buffer management.
Transaction management. OLTP technology (On-Line Transaction Processing), OLAP - technology (On-Line Analysis Processing) for DW.
Ensure data protection and integrity. The use of the system is allowed only to users who have the right to access the data. When users perform operations on data, the consistency of the stored data (integrity) is maintained. This is important in corporate multi-user information systems.
Journalization.

Modern DBMS must meet the database requirements listed above. In addition, they must comply with the following principles:

Data independence.
Versatility. The DBMS must have powerful support for the conceptual data model to display custom logical views.
Compatibility. The DBMS must remain operational with the development of software and hardware.
Data redundancy. Unlike file systems, a database must be a single set of integrated data.
Data protection. The DBMS must provide protection against unauthorized access.
Data integrity. The DBMS must prevent users from tampering with the database.
Managing concurrent work. The DBMS must protect the database from inconsistencies in the shared access mode. To ensure a consistent state of the database, all user requests (transactions) must be performed in a certain order.
The DBMS must be universal. It should support different data models on a single logical and physical basis.
The DBMS should support both centralized and distributed databases and thus become an important link in computer networks.

their capabilities in relation to distributed (corporate) databases;
their relationship to the type of data model implemented in the DBMS.

In relation to corporate (distributed) databases, the following types of DBMS can be conventionally distinguished:

DBMS "desktop". These products are primarily focused on working with personal data (desktop data). They have command sets for sharing common databases, but they are small in size (small office type). First of all, it is a DBMS like Access, dBASE, Paradox, ExPro. Why Access, dBASE, Paradox, ExPro have poor access to corporate data. The fact is that there is no easy way to overcome the barrier between personal and corporate data. And the point is not even that the mechanism of a personal data DBMS (or a small office) is focused on accessing data through many gateways, gateway products, etc. The problem is that these mechanisms typically involve full file transfers and a lack of extensive index support, resulting in queues to the server that are practically a halt in large systems.
Specialized high-performance multi-user DBMS. Such DBMSs are characterized by the presence of a multi-user system kernel, a data manipulation language, and the following functions that are typical for developed multi-user DBMSs:
organizing a buffer pool;
the presence of a system for processing transaction queues;
the presence of mechanisms for multi-user data blocking;
transaction logging;
availability of access control mechanisms.

These are DBMS such as Oracle, DВ2, SQL/Server, Informix, Sybase, ADABAS, Titanium and others provide a wide service for processing corporate databases.

When working with databases, the mechanism of transactions is used.

transaction is a logical unit of work.

transaction is a sequence of data manipulation statements that is executedas one(all or nothing) and translating databasefrom one integral state to another integral state.

A transaction has four important properties known as ASID properties:

(A) Atomicity . The transaction is executed as an atomic operation - either the entire transaction is executed, or the entire transaction is not executed.
(C) Consistency. A transaction moves a database from one consistent (consistent) state to another consistent (consistent) state. Within a transaction, database consistency can be broken.
(I) Isolation . Transactions of different users should not interfere with each other (for example, as if they were performed strictly in turn).
(D) Durability. If the transaction is completed, then the results of its work should be saved in the database, even if the system crashes at the next moment.

The transaction usually starts automatically from the moment the user joins the DBMS and continues until one of the following events occurs:

A COMMIT WORK command has been issued (to commit a transaction).
ROLLBACK WORK command issued.
The user has disconnected from the DBMS.
There was a failure of the system.

For the user, she wears usuallyatomic character. In fact, this is a complex mechanism of interaction between the user (application) and the database. Enterprise systems software uses a real-time transaction processing engine (Online Transaction Processing Systems, OLTP), in particular accounting programs, software for receiving and processing client applications, financial applications, produce a lot of information. These systems are designed (and appropriately optimized) for processing large amounts of data, complex transactions, and intensive read/write operations.

Unfortunately, the information placed in the databases of OLTP systems is not very suitable for use by ordinary users (due to the high degree of table normalization, specific data presentation formats, and other factors). Therefore, data from different information pipelines is sent (in the sense of being copied) tostorage warehouse, sorting and subsequent delivery to the consumer. In information technology, the role of warehouses is played byinformation storages.

Delivery of information to the end user - systems of analytical data processing in real time are engaged(On-line Analytical Processing, OLAP), which provide extremely easy access to data through convenient tools for generating queries and analyzing results. In OLAP systems, the value of an information product is increased through the use of various methods of analysis and statistical processing. In addition, these systems are optimized in terms of data extraction speed, collection of generalized information and are focused on ordinary users (they have an intuitive interface). If OLTP system gives answers to simple questions like "what was the level of sales of product N in region M in January 199x?", then OLAP systems are ready for more complex user requests, for example: "Give an analysis of sales of product N for all regions according to the plan for the second quarter compared to the previous two years."

Client/server architecture

In modern systemsdistributed information processingtechnology takes center stage client/server. In system client-server architecturesdata processing is divided between a client computer and a server computer, the communication between which takes place over a network. This separation of data processing processes is based on the grouping of functions. Typically, a database server computer is dedicated to perform database operations, while a client computer runs application programs. Figure 2.1 shows a simple client-server architecture system that includes a computer acting as a server and another computer acting as its client. Each machine performs different functions and has its own resources.

Server		Database
Server computer
Net
IBM compatible PC	IBM compatible PC	IBM compatible PC
Clients		Applications

Rice. 2.1. Client-server architecture system

The main function of the client computer is to run the application (user interface and presentation logic) and communicate with the server when required by the application.

Server (Server) - This is an object (computer) that provides services to other objects at their request.

Distributed Computing

The term "distributed computing" is often used to refer to two different, albeit complementary, concepts:

Distributed database;
Distributed data processing.

The application of these concepts makes it possible to organize access to information stored on several machines for end users using various means.

There are many types of servers:

Database server;
Print server;
Remote access server;
Fax server;
Web server, etc.

At the core of Client/Server technologyThere are such basic technologies as:

Technologies of operating systems, the concept of interaction of open systems, creation of object-oriented environments for the functioning of programs;
Telecommunication technologies;
Network technologies;
Graphical user interface technologies ( GUI);
Etc.

Advantages of client-server technology:

The client/server technology allows computing on heterogeneous computing environments. Platform Independence: Access to heterogeneous network environments that include different types of computers with different operating systems.
Independence from data sources: access to information from heterogeneous databases. Examples of such systems are DB2, SQL/DS, Oracle, Sybase.
Load balance between client and server.
Performing calculations where it happens most efficiently;
Provides efficient scaling capability;
Cross platform computing. Cross-platform computing is defined simply as the implementation of technologies in heterogeneous computing environments. The following options should be provided here:
The application must run on multiple platforms;
On all platforms, it should have the same interface and logic of work;
The application must integrate with the native operating environment;
It should behave the same on all platforms;
It should have simple and consistent support.

 Distributed Computing. Distributed computing involves the distribution of work among several computers (although distributed computing is a broader concept).

 Downscaling. Downscaling is the transfer of mainframe applications to small computer platforms.

Reduce infrastructure and hardware costs. Cost-Effective: The availability of low-cost computing hardware and the growing prevalence of local area networks make client-server technology more cost-effective than other data processing technologies. Equipment can be upgraded as needed.

 Reducing the overall application execution time;

 Reduced client memory usage;

 Reducing network traffic.

Ability to work with multimedia: To date, a lot of programs for working with multimedia for PCs have been created. There are either no such programs for terminal-host configuration, or they are very expensive.
The ability to use more computing resources for database operations: since applications run on client computers, additional (compared to the terminal-host configuration) resources are freed up on the server computer for database operations, such as CPU and operational resources. memory.
Increased programmer productivity: Programmer productivity is increased by using tools such as SQL*Forms and CASE to develop applications faster than programming languages such as C, PL1, or COBOL.
Increasing end user productivity: Nowadays, many end users have embraced systems such as Lotus, Paradox, Word Perfect, Harvard Graphics, etc.

The back-end interface is defined and fixed. Therefore, it is possible to create new client parts of an existing system (an example of interoperability at the system level).

Rice. 2.2. An illustration of client access to a server share.

How to implement client-server technology

The installation of a system based on client-server technology and capable of distributed data processing is discussed below. The following computer hardware and software is required:

database server computer;
client computers;
communication network;
network software;
application software.

SQL language . High level query language - SQL (Structured Query Language ) is used to implement queries to databases, such as NMD, NDL and PJD, and has been adopted as a standard. Language SQL was originally adopted as the data language of the firm's software products IBM and YMD of a relational DBMS SYSTEM R by IBM . An important feature of the language SQL is that the same language is represented through two different interfaces, namely: through an interactive interface and through an application programming interface (dynamic SQL). Dynamic SQL consists of many built-in language features SQL , provided specifically for constructing interactive applications, where an interactive application is a program that is written to support access to the database by the end user running on the interactive terminal. Language SQL provides the functions of defining, manipulating and managing database data and is transparent to the user from the point of view of the implemented DBMS.

Rice. 2.3. Scheme for executing user requests to distributed databases.

The internal structure of databases is determined by the data models used. The conceptual model has more abstraction capabilities and richer semantics than external models. External models are often called syntactic or operational models, referring to the syntactic nature of the management and application as a means of user interaction with the database. In information modeling, there are various levels of abstraction, from the level of the conceptual model to the level of the physical data model, that affect the architecture of the DBMS.

The data model has three components:

A data structure to represent from the user's perspective on the database.
Valid operations to be performed on the data structure. It is necessary to be able to work with this structure using various DDL and NML operations. A rich structure is worthless if you can't manipulate its content.
Constraints for integrity control. The data model must be provided with means to preserve its integrity and protect it. As an example, consider the following two constraints:
Each subtree must have a source node. Hierarchical databases cannot store child nodes without a parent node.
In relation to a relational database, there cannot be identical tuples. For a file, this requirement requires that all records be unique.

One of the most important characteristics of the DBMS is the ability to link objects.

There are the following types of links between objects:

One-to-One (1:1). One object of one set can be associated with one object of another set.
One-to-Many (1:M). One object of one set can be related to many objects of another set.
Many-to-Many (M:N). One object of one set can be associated with many objects of another set, but at the same time, one object of another set can be associated with many objects of the first set.
branched . One object of one set can be associated with objects of many sets.
Recursive . One object of a given set can be associated with an object of the same set.

There are following main data models:

Relational data model.
Hierarchical data model.
Incomplete network data model.
CODASYL data model.
Extended network data model.

V.3. INTERNET / INTRANET TECHNOLOGIES AND CORPORATE DATABASE ACCESS SOLUTIONS

A common solution to the problem of mobility of systems based on the "client-server" architecture is to rely on software packages that implement remote procedure call protocols (RPC - Remote Procedure Call). Using these tools, calling a service at the remote host looks like a normal procedure call. The RPC tools, which, of course, contain all the information about the specifics of the local network equipment and network protocols, translates the call into a sequence of network interactions. Thus, the specifics of the network environment and protocols are hidden from the application programmer.

Other related works that may interest you.vshm>
6914.		Database concept	11.56KB
	The database is a set of independent materials presented in an objective form of articles of calculation of normative acts of court decisions and other similar materials systematized in such a way that these materials can be found and processed using an electronic computer Civil Code of the Russian Federation Art. A database organized in accordance with certain rules and maintained in the computer's memory, a set of data characterizing the current state of some ...
8064.		Distributed databases	43.66KB
	Distributed databases A distributed RDB database is a set of logically interconnected shared data that is physically distributed over different nodes of a computer network. Data access should not depend on the presence or absence of data replicas. The system should automatically determine methods for performing a data join join, a network link capable of handling the amount of information being transferred and a node having sufficient processing power to join the tables. The RDBMS must be capable of...
20319.		DATABASES AND THEIR PROTECTION	102.86KB
	Online online databases appeared in the mid-1960s. Operations on operational databases were processed interactively using terminals. The simple index-sequential record organization quickly evolved to a more powerful set-oriented record model. Charles Bachmann received the Turing Prize for leading the work of the Data Base Task Group (DBTG), which developed a standard language for describing data and manipulating data.
5031.		Database Development Library	11.72MB
	Database design technology. Defining relationships between entities and creating a data model. The main ideas of modern information technology are based on the concept that data should be organized into databases in order to adequately reflect the changing real world and meet the information needs of users. These databases are created and operated under the control of special software systems called DBMS database management systems.
13815.		HIERARCHICAL DATABASE MODEL	81.62KB
	The main ideas of modern information technology are based on the concept of databases, according to which the basis of information technology is data organized in databases that adequately reflect the state of a particular subject area and provide the user with relevant information in this subject area. It must be acknowledged that data is...
14095.		Library database development	11.72MB
	The increase in the volume and structural complexity of stored data, the expansion of the circle of users of information systems have led to the widespread use of the most convenient and relatively easy to understand relational (tabular) DBMS.
5061.		Creation of a polyclinic database	2.4MB
	The development of computer technology and information technology has provided opportunities for the creation and widespread use of automated information systems (AIS) for various purposes. Information systems for managing economic and technical facilities are being developed and implemented
13542.		Databases of geological information	20.73KB
	Recently, the introduction of computer technologies and, in particular, databases, into the scientific sphere has been taking place at a rapid pace. This process does not bypass geology either, since it is in the natural sciences that there is a need to store and process large amounts of information.
9100.		Database. Basic concepts	26.28KB
	A database is a collection of information about specific objects of the real world in any subject area, economics, management, chemistry, etc. The purpose of an information system is not just to store data about objects, but also to manipulate these data, taking into account the relationships between objects. Each object is characterized by some set of data properties, which are called attributes in the database.
5240.		Creation of the database "Dean's office of the university"	1.57MB
	A database (DB) is a collection of interrelated data stored together on external storage media of a computer with such an organization and minimal redundancy that allows their use in an optimal way for one or more applications.