Hardware and software setup

Conceptual data model of corporate storage. The enterprise data model includes

The purpose of the lecture

After studying the material of this lecture, you will know:

  • what's happened enterprise data model ;
  • how to convert enterprise data model into the data warehouse model;
  • essential elements enterprise data model ;
  • presentation layers of the corporate data model ;
  • algorithm for converting an enterprise data model into a multidimensional data warehouse model ;

and learn:

  • develop data warehouse models based on enterprise data model organizations;
  • develop a star schema using CASE tools;
  • partition tables multidimensional model using CASE tools.

Enterprise data model

Introduction

The core of any data warehouse is its data model. Without a data model, it will be very difficult to organize data in a data warehouse. Therefore, DW developers must spend time and effort developing such a model. The development of the HD model falls on the shoulders of the CD designer.

Compared with the design of OLTP systems, the methodology for designing a data warehouse has a number of distinctive features related to the orientation of storage data structures towards solving problems of analysis and information support decision-making process. The data model of the data warehouse should provide an effective solution of these problems.

The starting point in the design of a data warehouse can be the so-called enterprise data model(corporate data model or enterprise data model, EDM), which is created in the process of designing an organization's OLTP systems. When designing enterprise data model usually an attempt is made to create, on the basis of business operations, such a data structure that would collect and synthesize all the information needs of the organization.

In this way, enterprise data model contains the necessary information to build a DW model. Therefore, at the first stage, if such a model exists in the organization, a data warehouse designer can start designing a data warehouse by solving a transformation problem enterprise data model in model HD.

Enterprise data model

How to solve the conversion problem enterprise data model in the HD model? To solve this problem, you need to have this model, i.e. enterprise data model should be built and documented. And you need to understand what from this model and how should be transformed into a HD model.

Let us clarify the concept enterprise data model. Under corporate data model understand the multi-level, structured description of the subject areas of the organization, data structures of subject areas, business processes and business procedures, data flows adopted in the organization, state diagrams, data-process matrices and other model representations that are used in the activities of the organization. Thus, in a broad sense, enterprise data model is a set of models of various levels that characterize (model at some abstract level) the activities of the organization, i.e. content corporate model directly depends on what model structures were included in it in a given organization.

Main elements enterprise data model are:

  • description of the subject areas of the organization (definition of areas of activity);
  • relationships between the subject areas defined above;
  • information data model (ERD-model or entity-relationship model);
  • for each subject area description:
    • entity keys;
    • entity attributes;
    • subtypes and supertypes;
    • relationships between entities;
    • attribute groupings;
    • relationships between subject areas;
  • functional model or business process model;
  • data flow diagrams;
  • state diagrams;
  • other models.

In this way, enterprise data model contains entities, attributes, and relationships that represent the information needs of the organization. On fig. 16.1 shows the main elements enterprise data model.

Presentation layers of the enterprise data model

Enterprise data model is subdivided according to subject areas, which represent groups of entities related to supporting specific business needs. Some subject areas may cover specific business functions such as contract management, while others may group entities that describe products or services.

Each logical model must correspond to an existing subject area enterprise data model. If the logical model does not meet this requirement, a model that defines the subject area must be added to it.

Enterprise data model usually has several levels of presentation. Actually high level(high level) enterprise data model is a description of the main subject areas of the organization and their relationships at the level of entities. On fig. 16.2 is a fragment enterprise data model top level.


Rice. 16.2.

The diagram shown in the figure shows four subject areas: "Customer" ( customer), "Check" ( account), "Order" ( order) and "Product" ( Product). Typically, at the top level of the model view, only direct connections between subject areas, which, for example, fix the following fact: the buyer pays the invoice for the order of goods. Detailed information and indirect relationships at this level corporate model are not given.

On the next middle level(mid level) enterprise data model shown detailed information about domain objects, i.e. keys and entity attributes, their relationships, subtypes and supertypes, etc. For each domain of the top-level model, there is one middle-level model. On fig. 16.3 depicts the middle level of presentation corporate model for a fragment of the subject area "Order".

From fig. 16.3 it can be seen that the subject area "Order" ( order) includes several entities, defined through their attributes, and the relationships between them. The presented model allows you to answer questions such as the date of the order, who made the order, who sent the order, who receives the order, and a number of others. From the above diagram, it can be seen that in this organization there are two types of orders - orders for an advertising campaign ( Commercial) and retail orders ( Retail).

notice, that enterprise data model can represent various aspects of the organization's activities and with varying degrees of detail and completeness. If corporate model represents all aspects of the organization, it is also called organization data model(enterprise data model).

From the point of view of designing a data warehouse, an important factor in deciding to create a data warehouse model from enterprise data model is the state completeness enterprise data model.

Enterprise data model organization has the characteristic of evolution, i.e. it is constantly evolving and improving. Some subject areas enterprise data model may be well developed, for some the work may not yet have begun. If a fragment of the subject area is not worked out in enterprise data model, then there is no way to use this model as a starting point for designing a data warehouse.

Degree of completion corporate model can be leveled in the design of HD as follows. Since the development process of a data warehouse is usually divided into a sequence of stages in time, the process of its design can be synchronized with completion process development of individual fragments enterprise data model organizations.

At the lowest presentation layer of the corporate data model displays information about the physical characteristics of the database objects corresponding to logical data model middle presentation layer of the enterprise data model.

Increasingly, IT professionals are turning their attention to data management solutions based on industry standard data models and business decision templates. Ready-to-load complex physical data models and business intelligence reports for specific areas of activity allow you to unify the information component of the enterprise and significantly speed up business processes. Solution templates allow service providers to take advantage of non-standard information hidden in existing systems, thereby reducing project timelines, costs and risks. For example, real projects show that data model and business decision templates can reduce development effort by 50%.

An industry logical model is a domain-specific, integrated and logically structured view of all the information that must be in a corporate data warehouse to answer both strategic and tactical business questions. The main purpose of models is to facilitate orientation in the data space and help in highlighting the details that are important for business development. In today's business environment, it is absolutely essential to have a clear understanding of the relationships between the various components and a good understanding of the big picture of the organization. Identification of all the details and relationships using models allows the most efficient use of time and tools for organizing the work of the company.

Data models are abstract models that describe how data is represented and accessed. Data models define data elements and relationships between them in a given area. A data model is a navigational tool for both business and IT professionals that uses a specific set of symbols and words to accurately explain a specific class of real information. This improves communication within the organization and thus creates a more flexible and stable application environment.


An example of a GIS for authorities and local self-government model.

Today, it is strategically important for software and service providers to be able to quickly respond to changes in the industry associated with technological innovations, the removal of government restrictions and the complexity of supply chains. Along with changes in the business model, the complexity and cost of information technology necessary to support the activities of the company is growing. Data management is particularly difficult in an environment where corporate Information Systems, as well as functional and business requirements for them are constantly changing.

To help facilitate and optimize this process, in translating the IT approach to the modern level, industry data models are called upon.

Industry data models from the companyEsri

Data models for the Esri ArcGIS platform are working templates for use in GIS projects and creating data structures for various application areas. Building a data model involves creating a conceptual design, logical structure, and physical structure that can then be used to build a personal or corporate geodatabase. ArcGIS provides tools for creating and managing a database schema, and data model templates are used to quick start GIS project for various applications and industries. Esri, along with the user community, has spent a significant amount of time developing a number of templates that can help you quickly start designing an enterprise geodatabase. These projects are described and documented at support.esri.com/datamodels . Below, in the order in which they appear on this site, are the semantic translations of Esri's industry model names:

  • Address register
  • Agriculture
  • Meteorology
  • Basic Spatial Data
  • Biodiversity
  • Inner space of buildings
  • Greenhouse gas accounting
  • Maintenance of administrative boundaries
  • Military establishment. Intelligence service
  • Energy (including the new ArcGIS MultiSpeak protocol)
  • Ecological buildings
  • Ministry of Emergency Situations. fire protection
  • Forest cadastre
  • Forestry
  • Geology
  • National level GIS (e-gov)
  • Groundwater and waste water
  • healthcare
  • Archeology and protection of memorial sites
  • National security
  • Hydrology
  • International Hydrographic Organization (IHO). S-57 format for ENC
  • Irrigation
  • Land Registry
  • municipal government
  • Maritime navigation
  • State cadastre
  • Oil and gas structures
  • Pipelines
  • Raster stores
  • Bathymetry, seabed topography
  • Telecommunications
  • Transport
  • Plumbing, sewerage, utilities

These models contain all the necessary features of the industry standard, namely:

  • are freely available;
  • are not tied to the technology of the "selected" manufacturer;
  • created as a result of the implementation of real projects;
  • created with the participation of industry experts;
  • designed to provide information interaction between various products and technologies;
  • do not contradict other standards and regulatory documents;
  • used in implemented projects around the world;
  • are designed to work with information throughout the life cycle of the system being created, and not the project itself;
  • expandable to the needs of the customer without losing compatibility with other projects and / or models;
  • accompanied by additional materials and examples;
  • used in guidelines and technical materials of various industrial companies;
  • a large community of participants, while access to the community is open to everyone;
  • a large number of references to data models in publications in recent years.

Esri is part of an expert group of independent bodies that recommend various industry models for use, such as PODS (Pipeline Open Data Standards - an open standard for the oil and gas industry; there is currently an implementation of PODS as an Esri PODS Esri Spatial 5.1.1 geodatabase) or a geodatabase (GDB) from ArcGIS for Aviation that takes into account ICAO and FAA recommendations, as well as the AIXM 5.0 navigation data exchange standard. In addition, there are recommended models that strictly adhere to existing industry standards, such as S-57 and ArcGIS for Maritime (marine and coastal features), as well as models created from the work of Esri Professional Services and are "de facto" standards in the relevant areas. For example, GIS for the Nation and Local Government have influenced the NSDI and INSPIRE standards, while Hydro and Groundwater (hydrology and groundwater) are actively used in the freely available professional package ArcHydro and commercial products third firms. It should be noted that Esri also supports "de facto" standards such as NHDI. All proposed data models are documented and ready for use in enterprise IT processes. Accompanying materials for the models include:

  • UML diagrams of entity relationships;
  • data structures, domains, directories;
  • ready-made geodatabase templates in ArcGIS GDB format;
  • sample data and sample applications;
  • examples of data loading scripts, examples of analysis utilities;
  • reference books on the proposed data structure.

Esri summarizes its experience in building industry models in the form of books and localizes published materials. Esri CIS has localized and published the following books:

  • Geospatial Service Oriented Architecture (SOA);
  • Designing geodatabases for transport;
  • Corporate geoinformation systems;
  • GIS: new energy of electric and gas enterprises;
  • Oil and gas digital map;
  • Modeling our world. Esri Geodatabase Design Guide;
  • Thinking about GIS. GIS planning: a guide for managers;
  • Geographic information systems. Basics;
  • GIS for administrative and economic management;
  • Web GIS. Principles and application;
  • Systems Design Strategies, 26th edition;
  • 68 issues of the ArcReview magazine with publications by companies and users of GIS systems;
  • ... and many other thematic notes and publications.

For example, the book Modeling our world..."(translation) is a comprehensive guide and reference guide to GIS data modeling in general, and the geodatabase data model in particular. The book shows how to make the right data modeling decisions, decisions that are involved in every aspect of a GIS project: from database design data and data collection to spatial analysis and visualization Describes in detail how to design a geographic database appropriate for the project, set up database functionality without programming, manage workflow in complex projects, model a variety of network structures, such as river, transport or Electricity of the net, integrate satellite imagery data into geographic analysis and mapping, and create 3D GIS data models. Book " Designing geodatabases for transportation" contains methodological approaches that have been tested on a large number of projects and fully comply with the legislative requirements of Europe and the United States, as well as international standards. And in the book " GIS: new energy of electric and gas enterprises Using real-world examples, it shows the benefits that an enterprise GIS can bring to an energy supplier, including aspects such as customer service, network operation, and other business processes.


Some of the books, translated and original, published in Russian by Esri CIS and DATA+. They cover both conceptual issues related to GIS technology and many applied aspects of modeling and deploying GIS of various scales and purposes.

We will consider the use of industry models using the BISDM (Building Interior Space Data Model) version 3.0 data model as an example. BISDM is a development of a more general BIM model (Building Information Model, building information model) and is intended for use in the design, construction, operation and decommissioning of buildings and structures. Used in GIS software, allows you to effectively exchange geodata with other platforms and interact with them. Refers to the general task group FM (organization infrastructure management). We list the main advantages of the BISDM model, the use of which allows:

  • organize the exchange of information in a heterogeneous environment according to uniform rules;
  • get a "physical" embodiment of the BIM concept and the recommended rules for managing a construction project;
  • maintain a single repository using GIS tools throughout the entire life cycle of the building (from design to decommissioning);
  • coordinate the work of various specialists in the project;
  • visualize the planned schedule and construction stages for all participants;
  • give a preliminary estimate of the cost and construction time (4D and 5D data);
  • control the progress of the project;
  • ensure the quality operation of the building, including maintenance and repairs;
  • become part of the asset management system, including the functions of analyzing the efficiency of space use (renting, storage facilities, employee management);
  • calculate and manage the energy efficiency of the building;
  • simulate the movement of human flows.

BISDM defines the rules for working with spatial data at the level of internal premises in a building, including the purpose and types of use, laid communications, installed equipment, accounting for repairs and maintenance, logging incidents, relationships with other company assets. The model helps to create a unified repository of geographic and non-geographic data. The experience of the world's leading companies was used to isolate entities and model at the GDB level (geodatabase) the spatial and logical relationships of all physical elements that form both the building itself and its interior. Following the principles of BISDM allows you to significantly simplify the tasks of integration with other systems. At the first stage, this is usually integration with CAD. Then, during the operation of the building, data exchange with ERP and EAM systems (SAP, TRIRIGA, Maximo, etc.) is used.


Visualization of BISDM structural elements using ArcGIS.

In the case of using BISDM, the customer / owner of the facility receives an end-to-end exchange of information from the idea of ​​​​creating a facility to the development of a complete project, construction control with obtaining up-to-date information by the time the facility is put into operation, control of parameters during operation, and even during reconstruction or decommissioning of the facility. Following the BISDM paradigm, the GIS and the GDB created with its help become a common data repository for related systems. Often in the GDB there are data created and operated by third-party systems. This must be taken into account when designing the architecture of the system being created.

At a certain stage, the accumulated "critical mass" of information allows you to move to a new qualitative level. For example, upon completion of the design phase of a new building, it is possible to automatically visualize 3D survey models in GIS, compile a list of equipment to be installed, calculate the kilometers of engineering networks to be laid, perform a number of verifications, and even give a preliminary financial estimate of the project cost.

Once again, when using BISDM and ArcGIS together, it becomes possible to automatically build 3D models from the accumulated data, since the GDB contains a complete description of the object, including z-coordinates, belonging to a floor, types of element connections, equipment installation methods, material, available paths personnel movements, functional purpose of each element, etc. etc. It should be noted that after the initial import of all design materials into the BISDM GDB, there is a need for additional content for:

  • placement of 3D models of objects and equipment at designated places;
  • collecting information about the cost of materials and the procedure for their laying and installation;
  • patency control according to the dimensions of the installed non-standard equipment.

Through the use of ArcGIS, the import of additional 3D objects and reference books from external sources is simplified. The ArcGIS Data Interoperability module allows you to create procedures for importing such data and placing it correctly within the model. All formats used in the industry are supported, including IFC, AutoCAD Revit, Bentlye Microstation.

Industry data models from IBM

IBM provides a set of storage management tools and models for a variety of industries:

  • IBM Banking and Financial Markets Data Warehouse (finance)
  • IBM Banking Data Warehouse
  • IBM Banking Process and Service Models
  • IBM Health Plan Data Model (health)
  • IBM Insurance Information Warehouse (insurance)
  • IBM Insurance Process and Service Models
  • IBM Retail Data Warehouse (retail)
  • IBM Telecommunications Data Warehouse (telecommunications)
  • InfoSphere Warehouse Pack:
    - for Customer Insight (to understand customers)
    - for Market and Campaign Insight (to understand the company and the market)
    - for Supply Chain Insight (for understanding suppliers).

For example, model IBMbankingandFinancialmarketsDataWarehouse designed to address the specific challenges of the banking industry in terms of data, and IBMbankingprocessandServiceModels- in terms of processes and SOA (service-oriented architecture). Models presented for the telecommunications industry IBMInformationFrameWork(IFW) and IBMTelecommunicationsDataWarehouse (TDW). They help to significantly speed up the process of creating analytical systems, as well as reduce the risks associated with the development of business intelligence applications, corporate data management and organization of data warehouses, taking into account the specifics of the telecommunications industry. IBM TDW capabilities cover the entire spectrum of the telecommunications market - from Internet providers and cable network operators offering wired and wireless telephony services, data transmission and multimedia content, to multinational companies providing telephone, satellite, long-distance and international communication services, as well as organizations global networks. Today, TDW is used by large and small wireline and wireless communication Worldwide.

The tool called InfoSphere Warehouse Pack for Customer Insight is a structured and easily implemented business content for a growing number of business projects and industries, including banking, insurance, finance, health insurance programs, telecommunications, retail and distribution. For business users InfoSphere Warehouse Pack for Market and Campaign Insight helps you maximize the effectiveness of your market intelligence and marketing campaigns through a step-by-step development and business-specific process. Via InfoSphere Warehouse Pack for Supply Chain Insight organizations have the ability to obtain current information on supply chain operations.


Esri's position within the IBM solution architecture.

Of particular note is IBM's approach to utilities and utility companies. To meet growing customer demands, utility companies need a more flexible architecture than they use today, as well as an industry-standard object model to facilitate the free exchange of information. This will improve the communication capabilities of energy companies, enabling them to communicate more cost-effectively, and give the new systems better visibility into all required resources, no matter where they are located within the organization. The basis for this approach is SOA (Service Oriented Architecture), a component model that establishes a correspondence between the functions of departments and services of various applications that can be reused. The "services" of these components communicate through interfaces without hard binding, hiding from the user the full complexity of the systems behind them. In this mode, enterprises can easily add new applications regardless of the software vendor, operating system, programming language, or other internal characteristics of the software. The concept is implemented on the basis of SOA SAFE ( Solution Architecture for Energy, it enables the utility industry to gain a standards-based, holistic view of their infrastructure.

Esri ArcGIS® is a globally recognized software platform for geoinformation systems(GIS), which ensures the creation and management of digital assets of electric power, gas transmission, distribution, and telecommunications networks. ArcGIS allows you to conduct the most complete inventory of the components of the electrical distribution network, taking into account their spatial location. ArcGIS significantly extends the IBM SAFE architecture by providing the tools, applications, workflows, analytics, and information and integration capabilities needed to manage the smart grid. ArcGIS within IBM SAFE allows you to obtain information from various sources about infrastructure objects, assets, customers and employees with accurate data about their location, as well as create, store and process geo-referenced information about enterprise assets (pillars, pipelines, wires, transformers, cable ducts etc.). ArcGIS within a SAFE infrastructure allows you to dynamically combine key business applications by combining data from GIS, SCADA, and customer service systems with external information such as traffic, weather conditions, or satellite imagery. Utilities use this combined information for a variety of purposes, from C.O.R. (big picture of the operational environment) to site inspections, maintenance, network analysis and planning.

The information components of a power supply enterprise can be modeled using several levels, which range from the lowest - physical - to the top, most complex level of business process logic. These layers can be integrated to meet typical industry requirements such as automated measurement logging and supervisory control and data acquisition (SCADA) control. By building the SAFE architecture, utility companies are making significant strides in advancing an industry-wide open object model called the Common Information Model (CIM) for Energy and Utilities. This model provides the necessary basis for moving many enterprises towards a service-oriented architecture, as it encourages the use of open standards for structuring data and objects. By having all systems use the same objects, the confusion and inelasticity associated with different implementations of the same objects will be reduced to a minimum. Thus, the definition of the “customer” object and other important business objects will be unified in all systems of the power supply company. With CIM, service providers and service consumers can now share a common data structure, making it easier to outsource costly business components as CIM establishes a common base on which to build information sharing.

Conclusion

Comprehensive industry data models provide companies with a single, integrated view of their business information. Many companies find it difficult to integrate their data, although this is a prerequisite for most enterprise-wide projects. According to a study by The Data Warehousing Institute (TDWI), more than 69% of organizations surveyed found integration to be a significant barrier to new application adoption. On the contrary, the implementation of data integration brings the company tangible income and increased efficiency.

A well-built model uniquely defines the meaning of the data, which in this case is structured data (as opposed to unstructured data such as an image, binary file, or text, where the value can be ambiguous). The most effective industry models are offered by professional vendors, including Esri and IBM. The high returns from using their models are achieved due to their significant level of detail and accuracy. They usually contain many data attributes. In addition, experts from Esri and IBM not only have extensive modeling experience, but are also well versed in building models for a particular industry.


Zaitsev S.L., Ph.D.

Repeating groups

Repeating groups are attributes for which a single entity instance can have more than one value. For example, a person may have more than one skill. If, in terms of business requirements, we need to know the skill level for everyone, and each person can only have two skills, we can create the entity shown in Fig. 1.6. Here is the entity A PERSON with two attributes to store skills and skill levels for each.

Rice. 1.6. This example uses repeating groups.

The problem with repeating groups is that we can't know exactly how many skills a person might have. In real life, some people have one skill, some have several, and some have none yet. Figure 1.7 shows the model reduced to first normal form. Notice the added Skill ID , which uniquely defines each SKILL.

Rice. 1.7. Model reduced to first normal form.

One fact in one place

If the same attribute is present in more than one entity and is not a foreign key, then that attribute is considered redundant. The logical model should not contain redundant data.

Redundancy requires additional space, but while memory efficiency is important, the real problem lies elsewhere. Guaranteed synchronization of redundant data comes with an overhead, and you always run the risk of conflicting values.

In the previous example SKILL depends on Person ID and from Skill ID. This means that you will not have SKILL until it appears A PERSON, having this skill. It also makes it harder to change the Skill Name. You need to find each Skill Name entry and change it for each Person who owns that skill.

Figure 1.8 shows the model in second normal form. Note that the entity has been added SKILL, and attribute TITLE skill transferred to this entity. The skill level remained, respectively, at the intersection PERSONS and SKILLS.

Rice. 1.8. In second normal form, the repeating group is moved to another entity. This provides the flexibility to add as many Skills as needed and change the Skill Name or Skill Description in one place.

Each attribute depends on a key

Each attribute of an entity must depend on the primary key of that entity. In the previous example School name and Geographic area present in the table A PERSON but do not describe a person. To achieve the third normal form, you need to move the attributes to the entity, where they will depend on the key. Figure 1.9. shows the model in third normal form.

Rice. 1.9. In third normal form School name and Geographic region moved to the entity, where their values ​​depend on the key.

Many-to-Many Relationships

Relationship many-to-many reflect the reality of the environment. Note that in Figure 1.9 there is a many-to-many relationship between PERSON and SCHOOL. The ratio accurately reflects the fact that A PERSON can study in many SCHOOLS and in SCHOOL can learn a lot PERSON. To achieve fourth normal form, an associative entity is created that eliminates the monogie-to-many relationship by generating a separate entry for each unique combination of school and person. Figure 1.10 shows the model in fourth normal form.

Rice. 1.10. In fourth normal form, the monogie-to-many relation between PERSON and SCHOOL resolved by introducing an associative entity, in which a separate entry is assigned for each unique combination SCHOOLS and PERSONS.

Formal definitions of normal forms

The following definitions of normal forms may seem intimidating. Think of them simply as formulas for achieving normalization. Normal forms are based on relational algebra and can be interpreted as mathematical transformations. Although this book does not cover a detailed discussion of normal forms, modellers are encouraged to delve deeper into the subject.

In a given relation R, attribute Y is functionally dependent on attribute X. Symbolically, RX -> RY (read "RX functionally defines RY") if and only if each X value in R is associated with exactly one Y value in R (at any given time). Attributes X and Y can be compound (Date K.J. Introduction to Database Systems. 6th edition. Ed. Williams: 1999, 848 pp.).

A relation R is in first normal form (1NF) if and only if all its domains contain only atomic values ​​(Date, ibid.).

A relation R is in second normal form (2NF) if and only if it is in 1NF and every non-key attribute is completely dependent on the primary key (Date, ibid.).

A relation R is in third normal form (3NF) if and only if it is in 2NF and every non-key attribute is not transitively dependent on the primary key (Date, ibid.).

The relation R conforms to Boyce-Codd normal form (BCNF) if and only if each determinant is a candidate for use as a key.

NOTE Below is a brief explanation of some of the abbreviations used in Date's definitions.

MVD (multi-valued dependency) - multi-valued dependency. Used only for entities with three or more attributes. In a multivalued dependency, the value of an attribute depends on only part of the primary key.

FD (functional dependency) - functional dependency. In a functional dependency, the value of an attribute depends on the value of another attribute that is not part of the primary key.

JD (join dependency) - join dependency. In a join dependency, the primary key of the parent entity is traceable to at least the third level of descendants while retaining the ability to be used in the original key join.

A relation is in fourth normal form (4NF) if and only if there is an MVD in R, such as A®®B. In this case, all attributes of R are functionally dependent on A. In other words, in R there are only dependencies (FD or MVD) of the form K®X (ie, the functional dependence of the attribute X on the candidate for use as a key K). Accordingly, R meets the requirements of 4NF if it complies with BCNF and all MVDs are in fact FDs (Date, ibid.).

For the fifth normal form, the relation R satisfies the union relation (JD)*(X, Y, …, Z) if and only if R is equivalent to its projections onto X, Y,..., Z, where X, Y,. .., Z subsets of the set of attributes R.

There are many other normal forms for complex data types and specific situations that are beyond the scope of our discussion. Every enthusiast of model development would like to explore other normal forms.

Business Normal Forms

In his book Clive Finklestein (Finklestein Cl. An Introduction to Information Engineering: From Strategic Planning to Information Systems. Reading, Massachusetts: Addison-Wesley, 1989) took a different approach to normalization. It defines business normal forms in terms of reductions to those forms. Many modellers find this approach to be more intuitive and pragmatic.

First Business Normal Form (1BNF) maps repeating groups to another entity. This entity gets its own name and primary (composite) key attributes from the original entity and its repeating group.

Second Business Normal Form (2BNF) maps attributes that partially depend on a primary key to another entity. The primary (composite) key of this entity is the primary key of the entity it originally resided in, along with additional keys on which the attribute is entirely dependent.

The third business normal form (3BNF) moves attributes that do not depend on the primary key to another entity, where they are completely dependent on the primary key of this entity.

Fourth Business Normal Form (4BNF) maps attributes that depend on the value of the primary key or are optional to a secondary entity, where they depend entirely on the value of the primary key, or where they must (mandatory) be present in that entity.

Fifth Business Normal Form (5BNF) appears as a structural entity if there is a recursive or other dependency between instances of a secondary entity, or if a recursive dependency exists between instances of its primary entity.

Completed logical data model

The completed logical model must satisfy the requirements of the third business normal form and include all entities, attributes, and relationships necessary to support the data requirements and business rules associated with the data.

All entities must have names that describe the content and a clear, concise, complete description or definition. In one of the following publications, an initial set of recommendations for the correct formation of names and descriptions of entities will be considered.

Entities must have a complete set of attributes, so that every fact about each entity can be represented by its attributes. Each attribute must have a name that reflects its values, a boolean data type, and a clear, short, complete description or definition. In one of the following publications, we will consider the initial set of recommendations for the correct formation of names and descriptions of attributes.

Relationships should include a verb construction that describes the relationship between entities, along with characteristics such as plurality, the need for existence, or the possibility of non-existence of the relationship.

NOTE Plurality relationships describes the maximum number of secondary entity instances that can be associated with an instance of the original entity.The need for existence orpossibility of absence relationship is used to define the minimum number of instances of a secondary entity that can be associated with an instance of the original entity.

Physical data model

After creating a complete and adequate logical model, you are ready to make a decision on the choice of implementation platform. The choice of platform depends on the requirements for data use and the strategic principles of the organization's architecture. Platform selection is a complex issue that is beyond the scope of this book.

In ERwin, the physical model is a graphical representation of the actual database. The physical database will consist of tables, columns, and relationships. The physical model depends on the platform chosen for implementation and the data usage requirements. The physical model for IMS will be very different from the same model for Sybase. The physical model for OLAP reports will look different than the model for OLTP (Online Transaction Processing).

The data modeler and the database administrator (DBA) use the logical model, usage requirements, and corporate architecture strategic principles to develop the physical data model. You can denormalize the physics model to improve performance, and create views to support usage requirements. The following sections detail the process of denormalization and view creation.

This section provides an overview of the process of building a physical model, collecting requirements for using data, and defining the components of a physical model and reverse engineering. These issues will be covered in more detail in future publications.

Collection of data usage requirements

Typically, you collect data usage requirements early on during interviews and work sessions. At the same time, the requirements should define the use of data by the user as fully as possible. Superficial attitude and gaps in the physical model can lead to unplanned costs and delay the project. Usage requirements include:

    Access and performance requirements

    Volumetric characteristics (an estimate of the amount of data to be stored), which allow the administrator to represent the physical volume of the database

    Estimating the number of users who need concurrent access to data to help you design your database for an acceptable level of performance

    Summary, summary, and other calculated or derived data that may be considered candidates for storage in durable data structures

    Requirements for generating reports and standard queries to help the database administrator build indexes

    Views (permanent or virtual) that will assist the user in performing data aggregation or filtering operations.

In addition to the chairperson, secretary, and users, the usage requirements session should include the modeler, database administrator, and database architect. User requirements for historical data should be discussed. The length of time that data is stored has a significant impact on the size of the database. Often, older data is stored in aggregate form, and atomic data is archived or deleted.

Users should bring sample queries and reports with them to the session. Reports must be strictly defined and must include atomic values ​​used for any summary and summary fields.

Components of the physical data model

The components of the physical data model are tables, columns, and relationships. Entities in the logical model are likely to become tables in the physical model. Boolean attributes will become columns. Logical relationships will become constraints on the integrity of relationships. Some logical relationships cannot be implemented in a physical database.

reverse engineering

When the logical model is not available, it becomes necessary to recreate the model from the existing database. At ERwin, this process is called reverse engineering. Reverse engineering can be done in several ways. The modeler can explore the data structures in the database and recreate the tables in a visual modeling environment. You can import data definition language (DDL) into a tool that supports reverse engineering (eg Erwin). Advanced tools such as ERwin include functions that provide ODBC communication with an existing database to create a model by directly reading data structures. Reverse engineering using ERwin will be discussed in detail in a future publication.

Use of corporate functional boundaries

When building a logical model, it is important for the modeler to ensure that the new model matches the enterprise model. Using corporate functional boundaries means modeling data in terms used within the corporation. The way data is used in a corporation is changing faster than the data itself. In each logical model, the data must be represented holistically, regardless of the business domain it supports. Entities, attributes, and relationships should define business rules at the corporate level.

NOTE Some of my colleagues refer to these corporate functional boundaries as real-world modeling. Real-world modeling encourages the modeler to view information in terms of its real relationships and relationships.

The use of corporate functional boundaries for a properly constructed data model provides a framework to support the information needs of any number of processes and applications, enabling a corporation to more effectively exploit one of its most valuable assets, information.

What is an enterprise data model?

Enterprise Data Model (EDM) contains entities, attributes, and relationships that represent the information needs of a corporation. EDM is usually subdivided into subject areas, which are groups of entities related to supporting specific business needs. Some subject areas may cover specific business functions such as contract management, while others may group entities that describe products or services.

Each logical model must correspond to an existing enterprise data model domain. If the logical model does not meet this requirement, a model that defines the subject area must be added to it. This comparison ensures that the corporate model is improved or adjusted and all logical modeling efforts are coordinated within the corporation.

EDM also includes specific entities that define the scope of values ​​for key attributes. These entities have no parents and are defined as independent. Independent entities are often used to maintain the integrity of relationships. These entities are identified by several different names, such as code tables, reference tables, type tables, or classification tables. We will use the term "corporate business object". An enterprise business object is an entity that contains a set of attribute values ​​that are independent of any other entity. Enterprise business objects within a corporation should be used consistently.

Building an Enterprise Data Model by Scaling

There are organizations where the corporate model from start to finish was built as a result of a single concerted effort. On the other hand, most organizations build fairly complete enterprise models by building up.

Growth means building something sequentially, layer by layer, just like an oyster grows a pearl. Each data model created provides input to the formation of the EDM. Building an EDM in this way requires additional modeling steps to add new data structures and domains or extend existing data structures. This makes it possible to build an enterprise data model by building up, iteratively adding levels of detail and refinement.

The concept of modeling methodology

There are several methodologies for visual data modeling. ERwin supports two:

    IDEF1X (Integration Definition for Information Modeling - integrated description of information models).

    IE (Information Engineering - information engineering).

IDEF1X is a good methodology and its notation is widely used

Integrated description of information models

IDEF1X is a highly structured data modeling methodology that extends the IDEF1 methodology adopted as a FIPS (Federal Information Processing Standards) standard. IDEF1X uses a highly structured set of modeling construct types and results in a data model that requires an understanding of the physical nature of the data before such information can be made available.

The rigid structure of IDEF1X forces the modeler to assign characteristics to entities that may not correspond to the realities of the world around them. For example, IDEF1X requires all entity subtypes to be exclusive. This leads to the fact that a person cannot be both a client and an employee. While real practice tells us otherwise.

Information engineering

Clive Finklestein is often called the father of information engineering, although James Martin shared similar concepts with him (Martin, James. Managing the Database Environment. Upper Saddle River, New Jersey: Prentice Hall, 1983.). Information engineering uses a business driven approach to manage information and uses a different notation to represent business rules. IE serves as an extension and development of the notation and basic concepts of the ER methodology proposed by Peter Chen.

IE provides the infrastructure to support information requirements by integrating corporate strategic planning with the information systems being developed. Such integration makes it possible to more closely link the management of information resources with the long-term strategic prospects of the corporation. This business-driven approach leads many modelers to choose IE over other methodologies that primarily focus on solving immediate development problems.

IE provides a workflow that leads a corporation to identify all of its information needs to collect and manage data and identify relationships between information objects. As a result, information requirements are articulated based on management directives and can be directly translated into a management information system that will support strategic information needs.

Conclusion

Understanding how to use a data modeling tool like ERwin is only part of the problem. In addition, you must understand when data modeling tasks are performed and how information requirements and business rules are collected to be represented in the data model. Conducting work sessions provides the most favorable conditions for collecting information requirements in an environment that includes subject matter experts, users, and information technology specialists.

Building a good data model requires the analysis and research of information requirements and business rules collected during work sessions and interviews. The resulting data model should be compared to the enterprise model, if possible, to ensure that it does not conflict with existing object models and includes all required objects.

The data model consists of logical and physical models representing information requirements and business rules. The logical model must be reduced to the third normal form. Third normal form limits, adds, updates, and removes data structure anomalies to support the "one fact, one place" principle. The collected information requirements and business rules should be analyzed and researched. They need to be compared with the enterprise model to ensure that they do not conflict with existing object models and that they include all required objects.

V ERwin model data includes both logical and physical models. ERwin implements the ER approach and allows you to create logical and physical model objects to represent information requirements and business rules. Logical model objects include entities, attributes, and relationships. Physical model objects include tables, columns, and relationship integrity constraints.

In one of the following publications, the issues of identifying entities, determining entity types, choosing entity names and descriptions, as well as some tricks to avoid the most common modeling errors associated with the use of entities, will be considered.

Entities must have a complete set of attributes, so that every fact about each entity can be represented by its attributes. Each attribute must have a name that reflects its values, a boolean data type, and a clear, short, complete description or definition. In one of the following publications, we will consider the initial set of recommendations for the correct formation of names and descriptions of attributes. Relationships should include a verb construction that describes the relationship between entities, along with characteristics such as plurality, the need for existence, or the possibility of non-existence of the relationship.

NOTE Plurality relationships describes the maximum number of secondary entity instances that can be associated with an instance of the original entity.The necessity of existence or the possibility of absence relationship is used to determine the minimum number of instances of a secondary entity that can be associated with an instance of the original

Increasingly, IT professionals are turning their attention to data management solutions based on industry standard data models and business decision templates. Ready-to-load complex physical data models and business intelligence reports for specific areas of activity allow you to unify the information component of the enterprise and significantly speed up business processes. Solution templates allow service providers to leverage the power of non-standard information hidden in existing systems, thereby reducing project timelines, costs and risks. For example, real projects show that data model and business decision templates can reduce development effort by 50%.

An industry logical model is a domain-specific, integrated and logically structured view of all the information that must be in a corporate data warehouse to answer both strategic and tactical business questions. The main purpose of models is to facilitate orientation in the data space and help in highlighting the details that are important for business development. In today's business environment, it is absolutely essential to have a clear understanding of the relationships between the various components and a good understanding of the big picture of the organization. Identification of all the details and relationships using models allows the most efficient use of time and tools for organizing the work of the company.

Data models are abstract models that describe how data is represented and accessed. Data models define data elements and relationships between them in a given area. A data model is a navigational tool for both business and IT professionals that uses a specific set of symbols and words to accurately explain a specific class of real information. This improves communication within the organization and thus creates a more flexible and stable application environment.


An example of a GIS for authorities and local self-government model.

Today, it is strategically important for software and service providers to be able to quickly respond to changes in the industry associated with technological innovations, the removal of government restrictions and the complexity of supply chains. Along with changes in the business model, the complexity and cost of information technology necessary to support the activities of the company is growing. Data management is especially difficult in an environment where corporate information systems and their functional and business requirements are constantly changing.

To help facilitate and optimize this process, in translating the IT approach to the modern level, industry data models are called upon.

Industry data models from the companyEsri

Data models for the Esri ArcGIS platform are working templates for use in GIS projects and creating data structures for various application areas. Building a data model involves creating a conceptual design, logical structure, and physical structure that can then be used to build a personal or corporate geodatabase. ArcGIS provides tools for creating and managing a database schema, and data model templates are used to quickly launch a GIS project across a variety of applications and industries. Esri, along with the user community, has spent a significant amount of time developing a number of templates that can help you quickly start designing an enterprise geodatabase. These projects are described and documented at support.esri.com/datamodels . Below, in the order in which they appear on this site, are the semantic translations of Esri's industry model names:

  • Address register
  • Agriculture
  • Meteorology
  • Basic Spatial Data
  • Biodiversity
  • Inner space of buildings
  • Greenhouse gas accounting
  • Maintenance of administrative boundaries
  • Military establishment. Intelligence service
  • Energy (including the new ArcGIS MultiSpeak protocol)
  • Ecological buildings
  • Ministry of Emergency Situations. fire protection
  • Forest cadastre
  • Forestry
  • Geology
  • National level GIS (e-gov)
  • Groundwater and waste water
  • healthcare
  • Archeology and protection of memorial sites
  • National security
  • Hydrology
  • International Hydrographic Organization (IHO). S-57 format for ENC
  • Irrigation
  • Land Registry
  • municipal government
  • Maritime navigation
  • State cadastre
  • Oil and gas structures
  • Pipelines
  • Raster stores
  • Bathymetry, seabed topography
  • Telecommunications
  • Transport
  • Plumbing, sewerage, utilities

These models contain all the necessary features of the industry standard, namely:

  • are freely available;
  • are not tied to the technology of the "selected" manufacturer;
  • created as a result of the implementation of real projects;
  • created with the participation of industry experts;
  • designed to provide information interaction between various products and technologies;
  • do not contradict other standards and regulatory documents;
  • used in implemented projects around the world;
  • are designed to work with information throughout the life cycle of the system being created, and not the project itself;
  • expandable to the needs of the customer without losing compatibility with other projects and / or models;
  • accompanied by additional materials and examples;
  • used in guidelines and technical materials of various industrial companies;
  • a large community of participants, while access to the community is open to everyone;
  • a large number of references to data models in publications in recent years.

Esri is part of an expert group of independent bodies that recommend various industry models for use, such as PODS (Pipeline Open Data Standards - an open standard for the oil and gas industry; there is currently an implementation of PODS as an Esri PODS Esri Spatial 5.1.1 geodatabase) or a geodatabase (GDB) from ArcGIS for Aviation that takes into account ICAO and FAA recommendations, as well as the AIXM 5.0 navigation data exchange standard. In addition, there are recommended models that strictly adhere to existing industry standards, such as S-57 and ArcGIS for Maritime (marine and coastal features), as well as models created from the work of Esri Professional Services and are "de facto" standards in the relevant areas. For example, GIS for the Nation and Local Government have influenced the NSDI and INSPIRE standards, while Hydro and Groundwater are heavily used in the freely available ArcHydro professional package and commercial products. third firms. It should be noted that Esri also supports "de facto" standards such as NHDI. All proposed data models are documented and ready for use in enterprise IT processes. Accompanying materials for the models include:

  • UML diagrams of entity relationships;
  • data structures, domains, directories;
  • ready-made geodatabase templates in ArcGIS GDB format;
  • sample data and sample applications;
  • examples of data loading scripts, examples of analysis utilities;
  • reference books on the proposed data structure.

Esri summarizes its experience in building industry models in the form of books and localizes published materials. Esri CIS has localized and published the following books:

  • Geospatial Service Oriented Architecture (SOA);
  • Designing geodatabases for transport;
  • Corporate geoinformation systems;
  • GIS: new energy of electric and gas enterprises;
  • Oil and gas on a digital map;
  • Modeling our world. Esri Geodatabase Design Guide;
  • Thinking about GIS. GIS planning: a guide for managers;
  • Geographic information systems. Basics;
  • GIS for administrative and economic management;
  • Web GIS. Principles and application;
  • Systems Design Strategies, 26th edition;
  • 68 issues of the ArcReview magazine with publications by companies and users of GIS systems;
  • ... and many other thematic notes and publications.

For example, the book Modeling our world..."(translation) is a comprehensive guide and reference guide to GIS data modeling in general, and the geodatabase data model in particular. The book shows how to make the right data modeling decisions, decisions that are involved in every aspect of a GIS project: from database design data and data collection to spatial analysis and visualization Describes in detail how to design a geographic database appropriate for the project, set up database functionality without programming, manage workflow in complex projects, model a variety of network structures such as river, transport or electric networks, integrate satellite imagery data into geographic analysis and mapping, and create 3D GIS data models. Designing geodatabases for transportation" contains methodological approaches that have been tested on a large number of projects and fully comply with the legislative requirements of Europe and the United States, as well as international standards. And in the book " GIS: new energy of electric and gas enterprises Using real-world examples, it shows the benefits that an enterprise GIS can bring to an energy supplier, including aspects such as customer service, network operation, and other business processes.


Some of the books, translated and original, published in Russian by Esri CIS and DATA+. They cover both conceptual issues related to GIS technology and many applied aspects of modeling and deploying GIS of various scales and purposes.

We will consider the use of industry models using the BISDM (Building Interior Space Data Model) version 3.0 data model as an example. BISDM is a development of a more general BIM model (Building Information Model, building information model) and is intended for use in the design, construction, operation and decommissioning of buildings and structures. Used in GIS software, allows you to effectively exchange geodata with other platforms and interact with them. Refers to the general task group FM (organization infrastructure management). We list the main advantages of the BISDM model, the use of which allows:

  • organize the exchange of information in a heterogeneous environment according to uniform rules;
  • get a "physical" embodiment of the BIM concept and the recommended rules for managing a construction project;
  • maintain a single repository using GIS tools throughout the entire life cycle of the building (from design to decommissioning);
  • coordinate the work of various specialists in the project;
  • visualize the planned schedule and construction stages for all participants;
  • give a preliminary estimate of the cost and construction time (4D and 5D data);
  • control the progress of the project;
  • ensure the quality operation of the building, including maintenance and repairs;
  • become part of the asset management system, including the functions of analyzing the efficiency of space use (renting, storage facilities, employee management);
  • calculate and manage the energy efficiency of the building;
  • simulate the movement of human flows.

BISDM defines the rules for working with spatial data at the level of internal premises in a building, including the purpose and types of use, laid communications, installed equipment, accounting for repairs and maintenance, logging incidents, relationships with other company assets. The model helps to create a unified repository of geographic and non-geographic data. The experience of the world's leading companies was used to isolate entities and model at the GDB level (geodatabase) the spatial and logical relationships of all physical elements that form both the building itself and its interior. Following the principles of BISDM allows you to significantly simplify the tasks of integration with other systems. At the first stage, this is usually integration with CAD. Then, during the operation of the building, data exchange with ERP and EAM systems (SAP, TRIRIGA, Maximo, etc.) is used.


Visualization of BISDM structural elements using ArcGIS.

In the case of using BISDM, the customer / owner of the facility receives an end-to-end exchange of information from the idea of ​​​​creating a facility to the development of a complete project, construction control with obtaining up-to-date information by the time the facility is put into operation, control of parameters during operation, and even during reconstruction or decommissioning of the facility. Following the BISDM paradigm, the GIS and the GDB created with its help become a common data repository for related systems. Often in the GDB there are data created and operated by third-party systems. This must be taken into account when designing the architecture of the system being created.

At a certain stage, the accumulated "critical mass" of information allows you to move to a new qualitative level. For example, upon completion of the design phase of a new building, it is possible to automatically visualize 3D survey models in GIS, compile a list of equipment to be installed, calculate the kilometers of engineering networks to be laid, perform a number of verifications, and even give a preliminary financial estimate of the project cost.

Once again, when using BISDM and ArcGIS together, it becomes possible to automatically build 3D models from the accumulated data, since the GDB contains a complete description of the object, including z-coordinates, belonging to a floor, types of element connections, equipment installation methods, material, available paths personnel movements, functional purpose of each element, etc. etc. It should be noted that after the initial import of all design materials into the BISDM GDB, there is a need for additional content for:

  • placement of 3D models of objects and equipment at designated places;
  • collecting information about the cost of materials and the procedure for their laying and installation;
  • patency control according to the dimensions of the installed non-standard equipment.

Through the use of ArcGIS, the import of additional 3D objects and reference books from external sources is simplified. The ArcGIS Data Interoperability module allows you to create procedures for importing such data and placing it correctly within the model. All formats used in the industry are supported, including IFC, AutoCAD Revit, Bentlye Microstation.

Industry data models from IBM

IBM provides a set of storage management tools and models for a variety of industries:

  • IBM Banking and Financial Markets Data Warehouse (finance)
  • IBM Banking Data Warehouse
  • IBM Banking Process and Service Models
  • IBM Health Plan Data Model (health)
  • IBM Insurance Information Warehouse (insurance)
  • IBM Insurance Process and Service Models
  • IBM Retail Data Warehouse (retail)
  • IBM Telecommunications Data Warehouse (telecommunications)
  • InfoSphere Warehouse Pack:
    - for Customer Insight (to understand customers)
    - for Market and Campaign Insight (to understand the company and the market)
    - for Supply Chain Insight (for understanding suppliers).

For example, model IBMbankingandFinancialmarketsDataWarehouse designed to address the specific challenges of the banking industry in terms of data, and IBMbankingprocessandServiceModels- in terms of processes and SOA (service-oriented architecture). Models presented for the telecommunications industry IBMInformationFrameWork(IFW) and IBMTelecommunicationsDataWarehouse (TDW). They help to significantly speed up the process of creating analytical systems, as well as reduce the risks associated with the development of business intelligence applications, corporate data management and organization of data warehouses, taking into account the specifics of the telecommunications industry. IBM TDW capabilities cover the entire spectrum of the telecommunications market - from Internet providers and cable network operators offering wired and wireless telephony services, data transmission and multimedia content, to multinational companies providing telephone, satellite, long-distance and international communication services, as well as organizations global networks. Today, TDW is used by large and small wired and wireless service providers around the world.

The tool called InfoSphere Warehouse Pack for Customer Insight is a structured and easily implemented business content for a growing number of business projects and industries, including banking, insurance, finance, health insurance programs, telecommunications, retail and distribution. For business users InfoSphere Warehouse Pack for Market and Campaign Insight helps you maximize the effectiveness of your market intelligence and marketing campaigns through a step-by-step development and business-specific process. Via InfoSphere Warehouse Pack for Supply Chain Insight organizations have the ability to obtain current information on supply chain operations.


Esri's position within the IBM solution architecture.

Of particular note is IBM's approach to utilities and utility companies. To meet growing customer demands, utility companies need a more flexible architecture than they use today, as well as an industry-standard object model to facilitate the free exchange of information. This will improve the communication capabilities of energy companies, enabling them to communicate more cost-effectively, and give the new systems better visibility into all required resources, no matter where they are located within the organization. The basis for this approach is SOA (Service Oriented Architecture), a component model that establishes a correspondence between the functions of departments and services of various applications that can be reused. The "services" of these components communicate through interfaces without hard binding, hiding from the user the full complexity of the systems behind them. In this mode, enterprises can easily add new applications regardless of the software vendor, operating system, programming language, or other internal characteristics of the software. The concept is implemented on the basis of SOA SAFE ( Solution Architecture for Energy, it enables the utility industry to gain a standards-based, holistic view of their infrastructure.

Esri ArcGIS® is a globally recognized software platform for geographic information systems (GIS), which provides the creation and management of digital assets of electric power, gas transmission, distribution, and telecommunications networks. ArcGIS allows you to conduct the most complete inventory of the components of the electrical distribution network, taking into account their spatial location. ArcGIS significantly extends the IBM SAFE architecture by providing the tools, applications, workflows, analytics, and information and integration capabilities needed to manage the smart grid. ArcGIS within IBM SAFE allows you to obtain information from various sources about infrastructure objects, assets, customers and employees with accurate data about their location, as well as create, store and process geo-referenced information about enterprise assets (pillars, pipelines, wires, transformers, cable ducts etc.). ArcGIS within a SAFE infrastructure allows you to dynamically combine key business applications by combining data from GIS, SCADA, and customer service systems with external information such as traffic, weather conditions, or satellite imagery. Utilities use this combined information for a variety of purposes, from C.O.R. (big picture of the operational environment) to site inspections, maintenance, network analysis and planning.

The information components of a power supply enterprise can be modeled using several levels, which range from the lowest - physical - to the top, most complex level of business process logic. These layers can be integrated to meet typical industry requirements such as automated measurement logging and supervisory control and data acquisition (SCADA) control. By building the SAFE architecture, utility companies are making significant strides in advancing an industry-wide open object model called the Common Information Model (CIM) for Energy and Utilities. This model provides the necessary basis for moving many enterprises towards a service-oriented architecture, as it encourages the use of open standards for structuring data and objects. By having all systems use the same objects, the confusion and inelasticity associated with different implementations of the same objects will be reduced to a minimum. Thus, the definition of the “customer” object and other important business objects will be unified in all systems of the power supply company. With CIM, service providers and service consumers can now share a common data structure, making it easier to outsource costly business components as CIM establishes a common base on which to build information sharing.

Conclusion

Comprehensive industry data models provide companies with a single, integrated view of their business information. Many companies find it difficult to integrate their data, although this is a prerequisite for most enterprise-wide projects. According to a study by The Data Warehousing Institute (TDWI), more than 69% of organizations surveyed found integration to be a significant barrier to new application adoption. On the contrary, the implementation of data integration brings the company tangible income and increased efficiency.

A well-built model uniquely defines the meaning of the data, which in this case is structured data (as opposed to unstructured data such as an image, binary file, or text, where the value can be ambiguous). The most effective industry models are offered by professional vendors, including Esri and IBM. The high returns from using their models are achieved due to their significant level of detail and accuracy. They usually contain many data attributes. In addition, experts from Esri and IBM not only have extensive modeling experience, but are also well versed in building models for a particular industry.


Database architecture

The CMD schema is a description of the structure of the data model from the point of view of the administrator.

A AMD schema is a description of an internal or physical model. It stores a description of the physical location of the data on the media. The scheme stores direct indications of the location of data in memory (volumes, disks).

The CMD schema describes the structure of data, records, and fields.

All DBMS support three main types of data models:

1. Hierarchical model. It assumes some root entry. Branches come from the roots.

Not all objects are conveniently described in this way. There are no connections in the hierarchy and a large redundancy of information is characteristic.

2. Network model. Allows you to correctly display all the complexities of relationships.

The model is convenient for representing links with data from the external environment, but less convenient for describing in the database, which leads to additional work for the user to study the navigation through the links.

3. relational model. It is based on the mathematical term Relation - a relation, but simply - a table. For example, a rectangular two-dimensional.

The relational data structure was developed in the late 60s by a number of researchers, of which the most significant contribution was made by IBM employee Edgar Codd. With the relational approach, data is presented in the form of two-dimensional tables - the most natural for a person. At the same time, for data processing, Codd suggested using the apparatus of set theory - union, intersection, difference, Cartesian product.

Data type- this concept has the same meaning as in programming languages ​​(i.e., the data type defines the internal representation in computer memory and the way the data instance is stored, as well as the set of values ​​that the data instance can take and the set of valid data operations). All existing modern databases support special types of data designed to store data of an integer type, fractional floating point, characters and strings, calendar dates. Many database servers implement other types, for example, the Interbase server has a special data type for storing large binary information arrays (BLOBs).

Domain is a potential set of values ​​of a simple data type, it is similar to a data subtype in some programming languages. The domain is defined by two elements - the data type and boolean expression The that is applied to the data. If this expression evaluates to true, then the data instance belongs to the domain.

Attitude is a two-dimensional table of a special kind, consisting of a header and a body.

header is a fixed set of attributes, each of which is defined on some domain, and there is a one-to-one correspondence between attributes and defining domains.


Each attribute is defined on its own domain. The domain is an integer data type, and the boolean condition is n>0. The heading is timeless, unlike the relation body. Relationship body- is a collection tuples, each of which is an attribute-value pair.

By the power of the relationship is the number of its tuples, and degree of attitude is the number of attributes.

The degree of a ratio is a constant value for a given ratio, while the power of a ratio varies with time. The power of the ratio is also called the cardinal number.

The above concepts are theoretical and are used in the development of language tools and software systems for relational DBMS. In everyday work, their informal equivalents are used instead:

attitude - table;

attribute - column or field;

tuple - record or line.

Thus, the degree of the relation is the number of columns in the table, and the cardinal number is the number of rows.

Since a relation is a set, and in classical set theory, by definition, a set cannot contain matching elements, a relation cannot have two identical tuples. Therefore, for a given relation, there is always a set of attributes that uniquely identify a tuple. This set of attributes is called key.

The key must meet the following requirements:

must be unique;

· must be minimal, that is, the removal of any attribute from the key leads to a violation of uniqueness.

As a rule, the number of attributes in a key is less than the degree of the relation, however, in extreme cases, the key may contain all attributes, since the combination of all attributes satisfies the uniqueness condition. Typically, a relation has multiple keys. Of all the keys of the relationship (they are also called " possible keys”) one is chosen as primary key. When choosing primary key preference is usually given to the key with the least number of attributes. It is also inappropriate to use keys with long string values.

In practice, a special numeric attribute is often used as a primary key - an auto-incrementing zero, the value of which can be generated by a trigger (a trigger is a special procedure that is called when changes are made to the database) or by special means defined in the DBMS mechanism.

The concepts described in this chapter are not specific to any particular database implementation, but are common to all of them. Thus, these concepts are the basis of a certain general model, which is called the relational data model.

The founder of the relational approach, Date, established that the relational model consists of three parts:

structural;

· manipulative;

holistic.

Relationships are fixed in the structural part of the model as the only data structure used in the relational model.

In the manipulation part, two basic mechanisms for manipulating relational databases are fixed - relational algebra and relational calculus.

An integral part is understood as a certain mechanism for ensuring the non-destructibility of data. The integrity part includes two basic integrity requirements for relational databases - entity integrity and referential integrity.

Requirement entity integrity is that any tuple of any relation must be distinct from any other tuple of this relation, that is, in other words, any relation must have a primary key. This requirement must be satisfied if the basic properties of the relationship are satisfied.

In the data manipulation language, as well as in the query language, a mathematical apparatus called the algebra of relations is executed, for which the following actions are defined:

1. Standard operations: - intersection, - union, \ - difference, X - Cartesian product.

2. Specific: projection, limitation, connection, division.

a. An association.

SD SHM EI HP

R 1 (part code, material code, units of measurement, consumption rate)

R 2 (SHD, SHM, EI, HP)

Need to find

It is supposed to join the sets R 1 and R 2 . In this operation, the degree is preserved, and the cardinality of the resulting set

b. intersection.

Highlight matching lines.

c. Difference.

Exclude from R 1 tuples that match R 2 .

d. Cartesian product.

This is where tuples are concatenated.

Each row of one set is concatenated with each row of the other.

Given two sets:

The Cartesian product has the following form:

In this case, the S-degree is, a, i.e. you get 12 rows and 5 columns.

The corporate database is the central link of the corporate information system and allows you to create a single corporate information space. Corporate databases


Share work on social networks

If this work does not suit you, there is a list of similar works at the bottom of the page. You can also use the search button

THEME V CORPORATE DATABASES

V .one. Organization of data in corporate systems. Corporate databases.

V .2. DBMS and structural solutions in corporate systems.

V.3. Internet / Intranet technologies and corporate database access solutions.

V .one. DATA ORGANIZATION IN CORPORATE SYSTEMS. CORPORATE DATABASES

Corporate base data is the central link of the corporate information system and allows you to create a single information space of the corporation. Corporate databases (Figure 1.1).

There are various definitions of databases.

Under the database (DB) understand a set of information logically related in such a way as to constitute a single set of data stored in the storage devices of a computer. This set acts as the initial data of tasks solved in the process of functioning of automated control systems, data processing systems, information and computing systems.

The term database can be briefly formulated as a set of logically related data intended for sharing.

Under database refers to a collection of data stored together with minimal redundancy such that it can be used optimally for one or more applications.

Purpose of creating databases as a form of data storagebuilding a data system that does not depend on the accepted algorithms (software) used technical means, the physical location of data in a computer. The database assumes multi-purpose use (several users, many forms of documents and queries of one user).

Basic database requirements:

  • Completeness of data presentation. The data in the database should adequately represent all information about the object and should be sufficient for ODS.
  • Database integrity. The data must be preserved during the processing of their ODS and in any situations that arise in the course of work.
  • Flexibility of the data structure. The database should allow changing data structures without violating its integrity and completeness when external conditions change.
  • Realizability. This means that there must be an objective representation of various objects, their properties and relationships.
  • Availability. It is necessary to provide differentiation of access to data.
  • redundancy. The database should have minimal redundancy in representing data about any object.

Knowledge is understood a set of facts, patterns and heuristic rules with which you can solve the problem.

Knowledge base (KB)  collection of databases and rules used, received from decision makers. The knowledge base is an element of expert systems.

should be distinguished different ways of presenting data.

Physical Data - This is data stored in the computer's memory.

Logical representation of data corresponds to the user's representation of physical data. The difference between a physical and a corresponding logical representation of data is that the latter reflects some important relationships between physical data.

Under corporate database understand a database that combines in one form or another all the necessary data and knowledge about an automated organization. In corporate information systems, such a concept asintegrated databases, in which the principle of single entry and multiple use of information is implemented.

Rice. 1.1. The structure of the interaction of departments with the information resources of the corporation.

Corporate databases are focused (centralized) and distributed.

Concentrated (centralized) database is a database whose data is physically stored in the storage devices of one computer. On fig. 1.2 shows a diagram of a server application for accessing databases in various platforms.

Fig.1.2. Diagram of a heterogeneous centralized database

The centralization of information processing made it possible to eliminate such shortcomings of traditional file systems as incoherence, inconsistency and data redundancy. However, as databases grow, and especially when used in geographically dispersed organizations, problems arise. For example, for concentrated databases located in a telecommunications network node, through which various departments of an organization access data, with an increase in the volume of information and the number of transactions, the following difficulties arise:

  • Large data exchange flow;
  • High network traffic;
  • Low reliability;
  • Low overall performance.

Although it is easier to ensure the security, integrity, and consistency of information during updates in a concentrated database, these problems create certain difficulties. Data decentralization is proposed as a possible solution to these problems. Decentralization achieves:

  • Higher degree of processing simultaneity due to load sharing;
  • Improving the use of data in the field when performing remote (remote) queries;
  • lower costs;
  • Easy to manage local databases.

The costs of creating a network with workstations (small computers) at its nodes are much lower than the costs of creating a similar system using a mainframe. Figure 1.3 shows a logical diagram of a distributed database.

Fig.1.3. Distributed corporate database.

We give the following definition of a distributed database.

Distributed database - this is a collection of information, files (relations) stored in different nodes of the information network and logically linked in such a way as to constitute a single set of data (the link can be functional or through copies of the same file). Thus, it is a set of databases that are logically interconnected, but physically located on several machines that are part of the same computer network.

The most important requirements for the characteristics of a distributed database are as follows:

  • Scalability;
  • Compatibility;
  • Support for various data models;
  • portability;
  • Location transparency;
  • Autonomy of distributed database nodes (Site Autonomy);
  • Processing of distributed requests;
  • Execution of distributed transactions.
  • Support for a homogeneous security system.

Location transparency allows users to work with databases without knowing anything about their location. The autonomy of the distributed database nodes means that each database can be maintained independently of the others. A distributed query is a query (SQL statement) during which access to objects (tables or views) of different databases occurs. When executing distributed transactions, concurrency control is exercised over all involved databases. Oracle7 uses two-phase information transfer technology to perform distributed transactions.

The databases that make up a distributed database need not be homogeneous (i.e. run by the same DBMS) or run on the same operating system environment and/or on the same type of computers. For example, one database could be an Oracle database on a SUN computer running SUN OS(UNIX), a second database could be run by a DB2 DBMS on an IBM 3090 mainframe running an MVS operating system, and a third database could be run by a SQL/DS DBMS also on IBM mainframe, but with a VM operating system. Only one condition is obligatory - all machines with databases must be accessible over the network they are part of.

The main task of a distributed database – distribution of data over the network and providing access to it. There are the following ways to solve this problem:

  • Each node stores and uses its own set of data that is available for remote queries. This distribution is divided.
  • Some data that is frequently used at remote sites may be duplicated. Such a distribution is called partially duplicated.
  • All data is duplicated in each node. Such a distribution is called fully redundant.
  • Some files can be split horizontally (a subset of records is selected) or vertically (a subset of attribute fields is selected), while the split subsets are stored in different nodes along with unsplit data. Such distribution is called split (fragmented).

When creating a distributed database at the conceptual level, you have to solve the following tasks:

  • It is necessary to have a single conceptual scheme for the entire network. This will provide logical data transparency for the user, as a result of which he will be able to form a request to the entire database, being at a separate terminal (it works, as it were, with a centralized database).
  • A schema is needed to locate data on the network. This will provide transparency in data placement so that the user does not have to specify where to forward the request to get the required data.
  • It is necessary to solve the problem of heterogeneity of distributed databases. Distributed databases can be homogeneous or heterogeneous in terms of hardware and software. The problem of heterogeneity is relatively easy to solve if the distributed database is heterogeneous in terms of hardware, but homogeneous in terms of software (the same DBMS in the nodes). If different DBMS are used in the nodes of a distributed system, means of converting data structures and languages ​​are needed. This should provide transparency of the transformation in the distributed database nodes.
  • It is necessary to solve the problem of managing dictionaries. To provide all kinds of transparency in a distributed database, programs that manage numerous dictionaries and reference books are needed.
  • It is necessary to define methods for executing queries in a distributed database. Methods for executing queries in a distributed database differ from similar methods in centralized databases, since individual parts of queries must be executed at the location of the corresponding data and transfer partial results to other nodes; at the same time, coordination of all processes should be ensured.
  • It is necessary to solve the problem of parallel execution of queries. In a distributed database, a complex mechanism for managing concurrent processing is needed, which, in particular, must ensure synchronization when information is updated, which guarantees data consistency.
  • The need for a developed methodology for the distribution and placement of data, including splitting, is one of the main requirements for a distributed database.

One of the actively developing new areas of computer systems architecture, which is a powerful tool for non-numerical information processing, are database machines. Database machines are used to solve non-numerical tasks, such as storing, searching and transforming documents and facts, working with objects. Following the definition of data as digital and graphic information about the objects of the surrounding world, different content is embedded in the concept of data in numerical and non-numerical processing. Numeric processing uses objects such as variables, vectors, matrices, multi-dimensional arrays, constants, and so on, while non-numeric processing uses objects such as files, records, fields, hierarchies, networks, relationships, and so on. non-numeric processing is concerned directly with information about the objects (for example, a particular employee or group of employees), and not the employee file itself. It does not index the employee file to select a particular person; here more interested in the content of the desired record. Huge volumes of information are usually subjected to non-numerical processing. In various applications, such operations can be performed on this data, for example:

  • increase the salary of all employees of the company;
  • calculate the bank interest on the accounts of all customers;
  • make changes to the list of all goods in stock;
  • find the required abstract from all texts stored in the library or in the bibliographic information retrieval system;
  • find the description of the desired contract in a file containing legal documents;
  • view all files containing descriptions of patents and find a patent (if any) similar to the one proposed again.

To implement the database engine, parallel and associative architectures as an alternative to uniprocessorvon Neumannstructure, allowing you to work with large amounts of information in real time.

Database engines are gaining importance due to the exploration and application of artificial intelligence concepts such as knowledge representation, expert systems, inference, pattern recognition, etc.

Information storages. Today, many recognize that most companies already operate several databases and, in order to successfully work with information, not just different types of databases are required, but different generations of DBMS. According to statistics, each organization uses an average of 2.5 different DBMS. The need to "isolate" the business of companies, or rather, the people involved in this business, from the technological features of databases, to provide users with a single view of corporate information, regardless of where it is physically stored, has become obvious. This stimulated the emergence of information warehousing technology ( Data Warehousing, DW).

The main goal of DW is creation of a single logical representation of data contained in different types of databases, or, in other words, a single corporate data model.

A new round of DW development became possible thanks to the improvement of information technology in general, in particular the emergence of new types of databases based on parallel query processing, which in turn relied on advances in the field of parallel computers. Were created query builderswith an intuitive graphical interface that made it easy to build complex database queries. Miscellaneous softwaremiddlewareprovided communicationbetween different types of databases, and finally fell sharply in priceinformation storage devices.

A data bank may be present in the structure of a corporation.

Database - functional and organizational component in automated systems management and information-computing systems, carrying out centralized Information Support a group of users or a set of tasks solved in the system.

Database is considered as an information and reference system, the main purpose of which is:

  • in the accumulation and maintenance in working condition of a set of information constituting the information base of the entire automated system or a certain set of tasks solved in it;
  • in the issuance of the data required by the task or the user;
  • in providing collective access to stored information;
  • in ensuring the necessary management of the use of information contained in the infobase.

Thus, a modern data bank is a complex software and hardware complex, which includes technical, system and network tools, databases and DBMS, information retrieval systems for various purposes.

V .2. DBMS AND STRUCTURAL SOLUTIONS IN CORPORATE SYSTEMS

Database and knowledge management systems

An important component of modern information systems are database management systems (DBMS).

DBMS - a set of software and language tools designed to create, maintain and use databases.

The database management system provides data processing systems with access to databases. As already noted, an important role of DBMS is acquired in the creation of corporate information systems and a particularly important role in the creation of information systems using distributed information resources based on modern network computer technologies.

The main feature of modern DBMS is that modern DBMS support such technologies as:

  • client/server technology.
  • Support for database languages. Thisschema definition language DB (SDL - Schema Definition Language),data manipulation language (DML - Data Manipulation Language), integrated languages SQL (Structured Queue Language), QDB (Query - By - Example) and QMF (Query Management Facility ) is an advanced peripheral tool for query specification and report generation for DB 2 etc.;
  • Direct management of data in external memory.
  • Memory buffer management.
  • Transaction management. OLTP technology (On-Line Transaction Processing), OLAP - technology (On-Line Analysis Processing) for DW.
  • Ensure data protection and integrity. The use of the system is allowed only to users who have the right to access the data. When users perform operations on data, the consistency of the stored data (integrity) is maintained. This is important in corporate multi-user information systems.
  • Journalization.

Modern DBMS must meet the database requirements listed above. In addition, they must comply with the following principles:

  • Data independence.
  • Versatility. The DBMS must have powerful support for the conceptual data model to display custom logical views.
  • Compatibility. The DBMS must remain operational with the development of software and hardware.
  • Data redundancy. Unlike file systems, a database must be a single set of integrated data.
  • Data protection. The DBMS must provide protection against unauthorized access.
  • Data integrity. The DBMS must prevent users from tampering with the database.
  • Managing concurrent work. The DBMS must protect the database from inconsistencies in the shared access mode. To ensure a consistent state of the database, all user requests (transactions) must be performed in a certain order.
  • The DBMS must be universal. It should support different data models on a single logical and physical basis.
  • The DBMS should support both centralized and distributed databases and thus become an important link in computer networks.

Considering a DBMS as a class of software products focused on maintaining databases in automated systems, we can distinguish two of the most significant features that determine the types of DBMS. According to them, the DBMS can be considered from two points of view:

  • their capabilities in relation to distributed (corporate) databases;
  • their relationship to the type of data model implemented in the DBMS.

In relation to corporate (distributed) databases, the following types of DBMS can be conventionally distinguished:

  • DBMS "desktop". These products are primarily focused on working with personal data (desktop data). They have command sets for sharing common databases, but they are small in size (small office type). First of all, it is a DBMS like Access, dBASE, Paradox, ExPro. Why Access, dBASE, Paradox, ExPro have poor access to corporate data. The fact is that there is no easy way to overcome the barrier between personal and corporate data. And the point is not even that the mechanism of a personal data DBMS (or a small office) is focused on accessing data through many gateways, gateway products, etc. The problem is that these mechanisms typically involve full file transfers and a lack of extensive index support, resulting in queues to the server that are practically a halt in large systems.
  • Specialized high-performance multi-user DBMS. Such DBMS are characterized by the presence of a multi-user system kernel, data manipulation language and following functions, characteristic of developed multi-user DBMS:
  • organizing a buffer pool;
  • the presence of a system for processing transaction queues;
  • the presence of mechanisms for multi-user data blocking;
  • transaction logging;
  • availability of access control mechanisms.

These are DBMS such as Oracle, DВ2, SQL/Server, Informix, Sybase, ADABAS, Titanium and others provide a wide service for processing corporate databases.

When working with databases, the mechanism of transactions is used.

transaction is a logical unit of work.

transaction is a sequence of data manipulation statements that is executedas one(all or nothing) and translating databasefrom one integral state to another integral state.

A transaction has four important properties, known as ASID properties:

  • (A) Atomicity . The transaction is executed as an atomic operation - either the entire transaction is executed, or the entire transaction is not executed.
  • (C) Consistency. A transaction moves a database from one consistent (consistent) state to another consistent (consistent) state. Within a transaction, database consistency can be broken.
  • (I) Isolation . Transactions of different users should not interfere with each other (for example, as if they were performed strictly in turn).
  • (D) Durability. If the transaction is completed, then the results of its work should be saved in the database, even if the system crashes at the next moment.

The transaction usually starts automatically from the moment the user joins the DBMS and continues until one of the following events occurs:

  • A COMMIT WORK command has been issued (to commit a transaction).
  • ROLLBACK WORK command issued.
  • The user has disconnected from the DBMS.
  • There was a failure of the system.

For the user, she wears usually atomic character. In fact, this is a complex mechanism of interaction between the user (application) and the database. Enterprise systems software uses a real-time transaction processing engine (Online Transaction Processing Systems, OLTP), in particular accounting programs, software for receiving and processing client applications, financial applications, produce a lot of information. These systems are designed (and appropriately optimized) for processing large amounts of data, complex transactions, and intensive read/write operations.

Unfortunately, the information placed in the databases of OLTP systems is not very suitable for use by ordinary users (due to the high degree of table normalization, specific data presentation formats, and other factors). Therefore, data from different information pipelines is sent (in the sense of being copied) to storage warehouse, sorting and subsequent delivery to the consumer. In information technology, the role of warehouses is played byinformation storages.

Delivery of information to the end user - systems of analytical data processing in real time are engaged (On-line Analytical Processing, OLAP), which provide extremely easy access to data through convenient tools for generating queries and analyzing results. In OLAP systems, the value of an information product is increased through the use of various methods of analysis and statistical processing. In addition, these systems are optimized in terms of data extraction speed, collection of generalized information and are focused on ordinary users (they have an intuitive interface). If OLTP system gives answers to simple questions like "what was the level of sales of product N in region M in January 199x?", then OLAP systems ready for more complex requests users, for example: "Give an analysis of sales of product N for all regions according to the plan for the second quarter in comparison with the previous two years."

Client/server architecture

In modern systems distributed information processingtechnology takes center stage client/server. In system client-server architecturesdata processing is divided between a client computer and a server computer, the communication between which takes place over a network. This separation of data processing processes is based on the grouping of functions. Typically, a database server computer is dedicated to perform database operations, while a client computer runs application programs. Figure 2.1 shows a simple client-server architecture system that includes a computer acting as a server and another computer acting as its client. Each machine performs different functions and has its own resources.

Database

Server computer

Net

IBM compatible PC

IBM compatible PC

IBM compatible PC

Applications

Rice. 2.1. Client-server architecture system

The main function of the client computer is to run the application (user interface and presentation logic) and communicate with the server when required by the application.

Server - This is an object (computer) that provides services to other objects at their request.

As the term implies, the main function of the server computer is to serve the needs of the client. The term "Server" is used to refer to two different groups of functions: a file server and a database server (hereinafter, these terms mean, depending on the context, either the software that implements these groups of functions, or computers with this software). File servers are not designed to perform database operations, their main function is to share files among several users, i.e. providing simultaneous access of many users to files on a computer - a file server. An example of a file server is Novell's NetWare operating system. The database server can be installed and run on a file server computer. Oracle DBMS in the form of NLM (Network Loadable Module) runs in a NetWare environment on a file server.

The local network server must have resources that correspond to its functional purpose and the needs of the network. Note that due to the orientation towards the approach open systems, it is more correct to speak of logical servers (meaning a set of resources and software tools that provide services over these resources), which are not necessarily located on different computers. A feature of a logical server in an open system is that if, for reasons of efficiency, it is expedient to move the server to a separate computer, then this can be done without the need for any modification, both of itself and of the application programs using it.

One of the important server requirements is that the operating system in which the database server is hosted must be multitasking (and preferably, but not necessarily, multiuser). For example, the Oracle DBMS installed on a personal computer with an MS-DOS (or PC-DOS) operating system that does not meet the requirement for multitasking cannot be used as a database server. And the same Oracle DBMS installed on a computer with a multitasking (though not multiuser) OS / 2 operating system can be a database server. Many varieties UNIX systems, MVS, VM and some other operating systems are both multitasking and multiuser.

Distributed Computing

The term "distributed computing" is often used to refer to two different, albeit complementary, concepts:

  • Distributed database;
  • Distributed data processing.

The application of these concepts makes it possible to organize access to information stored on several machines for end users using various means.

There are many types of servers:

  • Database server;
  • Print server;
  • Remote access server;
  • Fax server;
  • Web server, etc.

At the core of Client/Server technology such basic technologies, how:

  • Technologies of operating systems, the concept of interaction of open systems, creation of object-oriented environments for the functioning of programs;
  • Telecommunication technologies;
  • Network technologies;
  • Graphical user interface technologies ( GUI);
  • Etc.

Advantages of client-server technology:

  • The client/server technology allows computing on heterogeneous computing environments. Platform Independence: Access to heterogeneous network environments that include different types of computers with different operating systems.
  • Independence from data sources: access to information from heterogeneous databases. Examples of such systems are DB2, SQL/DS, Oracle, Sybase.
  • Load balance between client and server.
  • Performing calculations where it happens most efficiently;
  • Provides efficient scaling capability;
  • Cross platform computing. Cross-platform computing is defined simply as the implementation of technologies in heterogeneous computing environments. The following options should be provided here:
  • The application must run on multiple platforms;
  • On all platforms, it should have the same interface and logic of work;
  • The application must integrate with the native operating environment;
  • It should behave the same on all platforms;
  • It should have simple and consistent support.

Distributed Computing. Distributed computing involves the distribution of work among several computers (although distributed computing is a broader concept).

Downscaling. Downscaling is the transfer of mainframe applications to small computer platforms.

  • Reduce infrastructure and hardware costs. Cost-Effective: The availability of low-cost computing hardware and the growing prevalence of local area networks make client-server technology more cost-effective than other data processing technologies. Equipment can be upgraded as needed.

Reducing the overall application execution time;

Reduced client memory usage;

Reducing network traffic.

  • Ability to work with multimedia: To date, a lot of programs for working with multimedia for PCs have been created. There are either no such programs for terminal-host configuration, or they are very expensive.
  • The ability to use more computing resources for database operations: since applications run on client computers, additional (compared to the terminal-host configuration) resources are freed up on the server computer for database operations, such as CPU and operational resources. memory.
  • Increased programmer productivity: Programmer productivity is increased by using tools such as SQL*Forms and CASE to develop applications faster than programming languages ​​such as C, PL1, or COBOL.
  • Increasing end user productivity: Nowadays, many end users have embraced systems such as Lotus, Paradox, Word Perfect, Harvard Graphics, etc.

The back-end interface is defined and fixed. Therefore, it is possible to create new client parts of an existing system (an example of interoperability at the system level).

Rice. 2.2. An illustration of client access to a server share.

How to implement client-server technology

The installation of a system based on client-server technology and capable of distributed data processing is discussed below. The following computer hardware and software is required:

  • database server computer;
  • client computers;
  • communication network;
  • network software;
  • application software.

SQL language . High level query language - SQL (Structured Query Language ) is used to implement queries to databases, such as NMD, NDL and PJD, and has been adopted as a standard. Language SQL was originally adopted as the data language of the firm's software products IBM and YMD of a relational DBMS SYSTEM R by IBM . An important feature of the language SQL is that the same language is represented through two different interfaces, namely: through an interactive interface and through an application programming interface (dynamic SQL). Dynamic SQL consists of many built-in language features SQL , provided specifically for constructing interactive applications, where an interactive application is a program that is written to support access to the database by the end user running on the interactive terminal. Language SQL provides the functions of defining, manipulating and managing database data and is transparent to the user from the point of view of the implemented DBMS.

Rice. 2.3. Scheme for executing user requests to distributed databases.

The internal structure of databases is determined by the data models used. The conceptual model has more abstraction capabilities and richer semantics than external models. External models are often called syntactic or operational models, referring to the syntactic nature of the management and application as a means of user interaction with the database. In information modeling, there are various levels of abstraction, from the level of the conceptual model to the level of the physical data model, that affect the architecture of the DBMS.

The data model consists of three components:

  • A data structure to represent from the user's perspective on the database.
  • Valid operations to be performed on the data structure. It is necessary to be able to work with this structure using various DDL and NML operations. A rich structure is worthless if you can't manipulate its contents.
  • Constraints for integrity control. The data model must be provided with means to preserve its integrity and protect it. As an example, consider the following two constraints:
  • Each subtree must have a source node. Hierarchical databases cannot store child nodes without a parent node.
  • In relation to a relational database, there cannot be identical tuples. For a file, this requirement requires that all records be unique.

One of the most important characteristics of the DBMS is the ability to link objects.

There are the following types of links between objects:

  • One-to-One (1:1). One object of one set can be associated with one object of another set.
  • One-to-Many (1:M). One object of one set can be related to many objects of another set.
  • Many-to-Many (M:N). One object of one set can be associated with many objects of another set, but at the same time, one object of another set can be associated with many objects of the first set.
  • branched . One object of one set can be associated with objects of many sets.
  • Recursive . One object given set can be associated with an object of the same set.

There are the following main data models:

  • Relational data model.
  • Hierarchical data model.
  • Incomplete network data model.
  • CODASYL data model.
  • Extended network data model.

V.3. INTERNET / INTRANET TECHNOLOGIES AND CORPORATE DATABASE ACCESS SOLUTIONS

The main problem of systems based on the "client-server" architecture is that, in accordance with the concept of open systems, they are required to be mobile in the widest possible class of open systems hardware and software solutions. Even if we restrict ourselves to UNIX-based local area networks, different networks use different equipment and communication protocols. Attempting to create systems that support all possible protocols leads to their overload with network details at the expense of functionality.

An even more complex aspect of this problem is related to the possibility of using different representations of data in different nodes of a heterogeneous local network. Different computers may have different addressing, representation of numbers, character encoding, etc. This is especially important for high-level servers: telecommunications, computing, databases.

A common solution to the problem of mobility of systems based on the "client-server" architecture is to rely on software packages that implement remote procedure call protocols (RPC - Remote Procedure Call). Using these tools, calling a service at the remote host looks like a normal procedure call. RPC tools, which, of course, contain all the information about the specifics of the local network equipment and network protocols, translates the call into a sequence of network interactions. Thus, the specifics of the network environment and protocols are hidden from the application programmer.

When a remote procedure is called, RPC programs convert client data formats to intermediate machine-independent formats and then convert to server data formats. When passing response parameters, similar transformations are performed.

Other related works that may interest you.vshm>

6914. Database concept 11.56KB
The database is a set of independent materials presented in an objective form of articles of calculation of normative acts of court decisions and other similar materials systematized in such a way that these materials can be found and processed using an electronic computer Civil Code of the Russian Federation Art. A database organized in accordance with certain rules and maintained in the computer's memory, a set of data characterizing the current state of some ...
8064. Distributed databases 43.66KB
Distributed databases distributed database RDB data is understood as a set of logically interconnected shared data that are physically distributed over different nodes of a computer network. Data access should not depend on the presence or absence of data replicas. The system should automatically determine the methods for performing a data aggregation connection, a network link capable of handling the volume transmitted information and a node having sufficient processing power to join the tables. The RDBMS must be capable of...
20319. DATABASES AND THEIR PROTECTION 102.86KB
Online online databases appeared in the mid-1960s. Operations on operational databases were processed interactively using terminals. The simple index-sequential record organization quickly evolved to a more powerful set-oriented record model. Charles Bachmann received the Turing Prize for leading the work of the Data Base Task Group (DBTG), which developed a standard language for describing data and manipulating data.
5031. Database Development Library 11.72MB
Database design technology. Defining relationships between entities and creating a data model. The main ideas of modern information technology are based on the concept that data should be organized into databases in order to adequately reflect the changing real world and meet the information needs of users. These databases are created and operated under the control of special software systems called DBMS database management systems.
13815. HIERARCHICAL DATABASE MODEL 81.62KB
The main ideas of modern information technology are based on the concept of databases, according to which the basis of information technology is data organized in databases that adequately reflect the state of a particular subject area and provide the user with relevant information in this subject area. It must be acknowledged that data is...
14095. Library database development 11.72MB
The increase in the volume and structural complexity of stored data, the expansion of the circle of users of information systems have led to the widespread use of the most convenient and relatively easy to understand relational (tabular) DBMS.
5061. Creation of a polyclinic database 2.4MB
The development of computer technology and information technology has provided opportunities for the creation and widespread use of automated information systems (AIS) for various purposes. Information systems for managing economic and technical facilities are being developed and implemented
13542. Databases of geological information 20.73KB
Recently, the introduction of computer technologies and, in particular, databases, into the scientific sphere has been taking place at a rapid pace. This process does not bypass geology either, since it is in the natural sciences that there is a need to store and process large amounts of information.
9100. Database. Basic concepts 26.28KB
A database is a collection of information about specific objects of the real world in any subject area, economics, management, chemistry, etc. The purpose of an information system is not just to store data about objects, but also to manipulate these data, taking into account the relationships between objects. Each object is characterized by some set of data properties, which are called attributes in the database.
5240. Creation of the database "Dean's office of the university" 1.57MB
A database (DB) is a collection of interrelated data stored together on external storage media of a computer with such an organization and minimal redundancy that allows their use in an optimal way for one or more applications.

Industry Data Models

The main purpose of models is to facilitate orientation in the data space and help in highlighting the details that are important for business development. In today's business environment, it is absolutely essential to have a clear understanding of the relationships between the various components and a good understanding of the big picture of the organization. Identification of all the details and relationships using models allows the most efficient use of time and tools for organizing the work of the company.

Data models are abstract models that describe how data is represented and accessed. Data models define data elements and relationships between them in a given area. A data model is a navigational tool for both business and IT professionals that uses a specific set of symbols and words to accurately explain a specific class of real information. This improves communication within the organization and thus creates a more flexible and stable application environment.

The data model uniquely defines the meaning of the data, which in this case is structured data (as opposed to unstructured data such as an image, binary file, or text, where the value can be ambiguous).

As a rule, models of a higher level (and more general in content) and a lower level (respectively, more detailed) are distinguished. The upper level of modeling is the so-called conceptual data models(conceptual data models), which give the most general picture of the functioning of an enterprise or organization. The conceptual model includes the main concepts or subject areas that are critical to the functioning of the organization; usually their number does not exceed 12-15. Such a model describes classes of entities important to the organization (business objects), their characteristics (attributes) and associations between pairs of these classes (ie relationships). Since the terminology in business modeling has not yet been completely settled, in various English-language sources, conceptual data models can also be called subject area model (which can be translated as subject area models) or subject enterprise data model (subject corporate data models).

The next hierarchical level is logical data models(logical data models). They may also be referred to as enterprise data models or business models. These models contain data structures, their attributes, and business rules, and represent information used by an enterprise from a business perspective. In such a model, data is organized in the form of entities and relationships between them. The logical model represents the data in a way that is easily understood by business users. In a logical model, a data dictionary can be allocated - a list of all entities with their exact definitions, which allows different categories of users to have a common understanding of all input and information output flows of the model. The next, lower level of modeling is already the physical implementation of the logical model using specific software tools and technical platforms.

The logical model contains the detailed enterprise business decision, which usually takes the form of a normalized model. Normalization is the process that ensures that each data element in the model has only one value and is completely and uniquely dependent on the primary key. Data elements are organized into groups according to their unique identification. The business rules that control the data elements must be fully included in the normalized model with a preliminary check of their validity and correctness. For example, a data element such as Customer Name would most likely be split into First Name and Last Name and grouped with other relevant data elements into a Customer entity with a primary key of Customer ID.

The logical data model is independent of applied technologies, such as a database, network technologies or reporting tools, and their means of physical implementation. An organization can only have one enterprise data model. Logical models typically include thousands of entities, relationships, and attributes. For example, a data model for a financial institution or a telecommunications company may contain about 3,000 industry concepts.

It is important to distinguish between logical and semantic data model. The logical data model represents the corporate business solution, while the semantic data model represents the applied business solution. The same corporate logical data model can be implemented using different semantic models, i.e. semantic models can be considered as the next level of modeling approaching physical models. In addition, each of these models will represent a separate "slice" of the corporate data model in accordance with the requirements of various applications. For example, in a corporate logical data model, the entity Client will be completely normalized, and in a semantic model for a data mart, it can be represented as a multidimensional structure.

A company can have two ways to create an enterprise logical data model: build it yourself or use a ready-made industry model(industry logical data model). In this case, the differences in terms reflect only different approaches to building the same logical model. In the event that a company independently develops and implements its own logical data model, then such a model, as a rule, is simply called the corporate logical model. If the organization decides to use the finished product of a professional supplier, then we can talk about an industry logical data model. The latter is a ready-made logical data model that reflects the functioning of a particular industry with a high degree of accuracy. An industry logical model is a domain-specific and integrated view of all the information that must be in an enterprise data warehouse to answer both strategic and tactical business questions. Like any other logical data model, the industry model does not depend on applied solutions. It also does not include derived data or other calculations for faster data retrieval. As a rule, most of the logical structures of such a model find a good embodiment in its effective physical implementation. Such models are being developed by many vendors for a wide variety of areas: finance, manufacturing, tourism, healthcare, insurance, etc.

An industry logical data model contains information that is common to an industry and therefore cannot be a complete solution for a company. Most companies have to increase the model by an average of 25% by adding data elements and expanding definitions. The finished models contain only the key data elements, and the rest of the elements must be added to the appropriate business objects during the installation of the model in the company.

Industry logical data models contain a significant number of abstractions. Abstraction refers to the union of similar concepts under common names such as Event or Participant. This adds flexibility to industry models and makes them more unified. Thus, the concept of Event is applicable to all industries.

Business Intelligence expert Steve Hoberman outlines five factors to consider when deciding whether to purchase an industry data model. The first is the time and resources needed to build the model. If an organization needs to achieve results quickly, then the industry model will give an advantage. Using an industry model may not immediately provide a picture of the entire organization, but it can save a significant amount of time. Instead of actual modeling, time will be spent linking existing structures to the industry model, as well as discussing how best to customize it to the needs of the organization (for example, which definitions should be changed and which data elements should be added).

The second factor is the time and money needed to keep the model running. If an enterprise data model is not part of a methodology that keeps it accurate and up-to-date, the model becomes outdated very quickly. The industry data model can prevent this risk as it is kept up to date by external resources. Of course, changes occurring within the organization must be reflected in the model by the company itself, but industry changes will be reproduced in the model by its supplier.

The third factor is experience in risk assessment and modeling. Creating an enterprise data model requires skilled resources from both the business and IT staff. As a rule, managers know well either the work of the organization as a whole, or the activities of a particular department. Few of them have both broad (company-wide) and deep (within departmental) knowledge of their business. Most managers usually know only one area well. Therefore, in order to get a corporate-wide picture, significant business resources are required. This also increases the requirements for IT staff. The more business resources required to create and test a model, the more experienced the analysts need to be. They must not only know how to get information from business personnel, but also be able to find common ground in controversial areas and be able to present all this information in an integrated way. The one who creates the model (in many cases, this is the same analyst) must have good modeling skills. Creating corporate logic models requires modeling “for the future” and the ability to convert complex business into literally “squares and lines”.

On the other hand, the industry model allows you to use the experience of third-party specialists. Industry-specific logic models use proven modeling methodologies and teams of experienced professionals to avoid common and costly problems that can arise when developing enterprise data models within an organization.

The fourth factor is the existing application infrastructure and vendor relationships. If an organization already uses many tools from the same vendor and has established relationships with them, then it makes sense to order the industry model from them as well. Such a model will be able to work freely with other products of the same supplier.

The fifth factor is intra-industry exchange of information. If a company needs to share data with other organizations operating in the same field, then an industry model can be very helpful in this situation. Organizations within the same industry use similar structural components and terminology. Nowadays, in most industries, companies are forced to share data to run their business successfully.

The industry models offered by professional vendors are the most effective. The high efficiency of their use is achieved due to the significant level of detail and accuracy of these models. They usually contain many data attributes. In addition, the creators of these models not only have extensive modeling experience, but also are well versed in building models for a particular industry.

Industry data models provide companies with a single, integrated view of their business information. Many companies find it difficult to integrate their data, although this is a prerequisite for most enterprise-wide projects. According to a study by The Data Warehousing Institute (TDWI), more than 69% of organizations surveyed found integration to be a significant barrier to new application adoption. On the contrary, the implementation of data integration brings a significant income to the company.

The industry data model, in addition to linking with existing systems, provides great benefits for enterprise-wide projects such as enterprise resource planning (ERP), master data management, business intelligence, data quality improvement and employee development.

Thus, industry logical data models are an effective tool for integrating data and obtaining a holistic picture of the business. The use of logical models seems to be a necessary step towards the creation of corporate data warehouses.

Publications

  1. Steve Hoberman. Leveraging the Industry Logical Data Model as Your Enterprise Data Model
  2. Claudia Imhoff. Fast-Tracking Data Warehousing & Business Intelligence Projects via Intelligent Data Modeling

The corporate database is the central link of the corporate information system and allows you to create a single corporate information space. Corporate databases


Share work on social networks

If this work does not suit you, there is a list of similar works at the bottom of the page. You can also use the search button


Page 15

THEME V CORPORATE DATABASES

LECTURE 8

V .one. Organization of data in corporate systems. Corporate databases.

V .2. DBMS and structural solutions in corporate systems.

V.3. Internet / Intranet technologies and corporate database access solutions.

V .one. DATA ORGANIZATION IN CORPORATE SYSTEMS. CORPORATE DATABASES

Corporate basedata is the central link of the corporate information system and allows you to create a single information space of the corporation. Corporate databases (Figure 1.1).

There are various definitions of databases.

Under the database (DB)understand a set of information logically related in such a way as to constitute a single set of data stored in the storage devices of a computer. This set acts as the initial data of tasks solved in the process of functioning of automated control systems, data processing systems, information and computing systems.

You can briefly formulate the term database as a collection of logically related data intended for sharing.

Under databaserefers to a collection of data stored together with minimal redundancy such that it can be used optimally for one or more applications.

Purpose of creating databasesas a form of data storagebuilding a data system that does not depend on the adopted algorithms (software), the technical means used, the physical location of the data in the computer. The database assumes multi-purpose use (several users, many forms of documents and queries of one user).

Basic database requirements:

  • Completeness of data presentation. The data in the database should adequately represent all information about the object and should be sufficient for ODS.
  • Database integrity. The data must be preserved during the processing of their ODS and in any situations that arise in the course of work.
  • Flexibility of the data structure. The database should allow changing data structures without violating its integrity and completeness when external conditions change.
  • Realizability. This means that there must be an objective representation of various objects, their properties and relationships.
  • Availability. It is necessary to provide differentiation of access to data.
  • redundancy. The database should have minimal redundancy in representing data about any object.

Knowledge is understooda set of facts, patterns and heuristic rules with which you can solve the problem.

Knowledge base (KB)  collection of databases and rules used, received from decision makers. The knowledge base is an element of expert systems.

should be distinguisheddifferent ways of presenting data.

Physical Data -This is data stored in the computer's memory.

Logical representation of datacorresponds to the user's representation of physical data. The difference between a physical and a corresponding logical representation of data is that the latter reflects some important relationships between physical data.

Under corporate databaseunderstand a database that combines in one form or another all the necessary data and knowledge about an automated organization. In corporate information systems, such a concept asintegrated databases, in which the principle of single entry and multiple use of information is implemented.

Rice. 1.1. The structure of the interaction of departments with the information resources of the corporation.

Corporate databases arefocused (centralized) and distributed .

Concentrated (centralized) database is a database whose data is physically stored in the storage devices of one computer. On fig. 1.2 shows a diagram of a server application for accessing databases in various platforms.

Fig.1.2. Diagram of a heterogeneous centralized database

The centralization of information processing made it possible to eliminate such shortcomings of traditional file systems as incoherence, inconsistency and data redundancy. However, as databases grow, and especially when used in geographically dispersed organizations, problems arise. For example, for concentrated databases located in a telecommunications network node, through which various departments of an organization access data, with an increase in the volume of information and the number of transactions, the following difficulties arise:

  • Large data exchange flow;
  • High network traffic;
  • Low reliability;
  • Low overall performance.

Although it is easier to ensure the security, integrity, and consistency of information during updates in a concentrated database, these problems create certain difficulties. Data decentralization is proposed as a possible solution to these problems. Decentralization achieves:

  • Higher degree of processing simultaneity due to load sharing;
  • Improving the use of data in the field when performing remote (remote) queries;
  • lower costs;
  • Easy to manage local databases.

The costs of creating a network with workstations (small computers) at its nodes are much lower than the costs of creating a similar system using a mainframe. Figure 1.3 shows a logical diagram of a distributed database.

Fig.1.3. Distributed corporate database.

We give the following definition of a distributed database.

Distributed database -this is a collection of information, files (relations) stored in different nodes of the information network and logically linked in such a way as to constitute a single set of data (the link can be functional or through copies of the same file). Thus, it is a set of databases that are logically interconnected, but physically located on several machines that are part of the same computer network.

The most important requirements for the characteristics of a distributed database are as follows:

  • Scalability;
  • Compatibility;
  • Support for various data models;
  • portability;
  • Location transparency;
  • Autonomy of distributed database nodes (Site Autonomy);
  • Processing of distributed requests;
  • Execution of distributed transactions.
  • Support for a homogeneous security system.

Location transparency allows users to work with databases without knowing anything about their location. The autonomy of the distributed database nodes means that each database can be maintained independently of the others. A distributed query is a query (SQL statement) during which access to objects (tables or views) of different databases occurs. When executing distributed transactions, concurrency control is exercised over all involved databases. Oracle7 uses two-phase information transfer technology to perform distributed transactions.

The databases that make up a distributed database need not be homogeneous (i.e. run by the same DBMS) or run on the same operating system environment and/or on the same type of computers. For example, one database could be an Oracle database on a SUN computer running SUN OS(UNIX), a second database could be run by a DB2 DBMS on an IBM 3090 mainframe running an MVS operating system, and a third database could be run by a SQL/DS DBMS also on IBM mainframe, but with a VM operating system. Only one condition is obligatory - all machines with databases must be accessible over the network they are part of.

The main task of a distributed database– distribution of data over the network and providing access to it. There are the following ways to solve this problem:

  • Each node stores and uses its own set of data that is available for remote queries. This distribution is divided.
  • Some data that is frequently used at remote sites may be duplicated. Such a distribution is called partially duplicated.
  • All data is duplicated in each node. Such a distribution is called fully redundant.
  • Some files can be split horizontally (a subset of records is selected) or vertically (a subset of attribute fields is selected), while the split subsets are stored in different nodes along with unsplit data. Such distribution is called split (fragmented).

When creating a distributed database at the conceptual level, you have to solve the following tasks:

  • It is necessary to have a single conceptual scheme for the entire network. This will provide logical data transparency for the user, as a result of which he will be able to form a request to the entire database, being at a separate terminal (it works, as it were, with a centralized database).
  • A schema is needed to locate data on the network. This will provide transparency in data placement so that the user does not have to specify where to forward the request to get the required data.
  • It is necessary to solve the problem of heterogeneity of distributed databases. Distributed databases can be homogeneous or heterogeneous in terms of hardware and software. The problem of heterogeneity is relatively easy to solve if the distributed database is heterogeneous in terms of hardware, but homogeneous in terms of software (the same DBMS in the nodes). If different DBMS are used in the nodes of a distributed system, means of converting data structures and languages ​​are needed. This should provide transparency of the transformation in the distributed database nodes.
  • It is necessary to solve the problem of managing dictionaries. To provide all kinds of transparency in a distributed database, programs that manage numerous dictionaries and reference books are needed.
  • It is necessary to define methods for executing queries in a distributed database. Methods for executing queries in a distributed database differ from similar methods in centralized databases, since individual parts of queries must be executed at the location of the corresponding data and transfer partial results to other nodes; at the same time, coordination of all processes should be ensured.
  • It is necessary to solve the problem of parallel execution of queries. In a distributed database, a complex mechanism for managing concurrent processing is needed, which, in particular, must ensure synchronization when information is updated, which guarantees data consistency.
  • The need for a developed methodology for the distribution and placement of data, including splitting, is one of the main requirements for a distributed database.

One of the actively developing new areas of computer systems architecture, which is a powerful tool for non-numerical information processing, aredatabase machines. Database machines are used to solve non-numerical tasks, such as storing, searching and transforming documents and facts, working with objects. Following the definition of data as digital and graphic information about the objects of the surrounding world, different content is embedded in the concept of data in numerical and non-numerical processing. Numeric processing uses objects such as variables, vectors, matrices, multi-dimensional arrays, constants, and so on, while non-numeric processing uses objects such as files, records, fields, hierarchies, networks, relationships, and so on. non-numeric processing is concerned directly with information about the objects (for example, a particular employee or group of employees), and not the employee file itself. It does not index the employee file to select a particular person; here more interested in the content of the desired record. Huge volumes of information are usually subjected to non-numerical processing. In various applications, such operations can be performed on this data, for example:

  • increase the salary of all employees of the company;
  • calculate the bank interest on the accounts of all customers;
  • make changes to the list of all goods in stock;
  • find the required abstract from all texts stored in the library or in the bibliographic information retrieval system;
  • find the description of the desired contract in a file containing legal documents;
  • view all files containing descriptions of patents and find a patent (if any) similar to the one proposed again.

To implement the database engine, parallel and associative architectures as an alternative to uniprocessorvon Neumannstructure, allowing you to work with large amounts of information in real time.

Database engines are gaining importance in connection with the exploration and application of artificial intelligence concepts such as knowledge representation, expert systems, inference, pattern recognition, etc.

Information storages.Today, many recognize that most companies already operate several databases and, in order to successfully work with information, not just different types of databases are required, but different generations of DBMS. According to statistics, each organization uses an average of 2.5 different DBMS. The need to "isolate" the business of companies, or rather, the people involved in this business, from the technological features of databases, to provide users with a single view of corporate information, regardless of where it is physically stored, has become obvious. This stimulated the emergence of information warehousing technology ( Data Warehousing, DW).

The main goal of DW iscreation of a single logical representation of data contained in different types of databases, or, in other words, a single corporate data model.

A new round of DW development became possible thanks to the improvement of information technology in general, in particular the emergence of new types of databases based on parallel query processing, which in turn relied on advances in the field of parallel computers. Were createdquery builderswith an intuitive graphical interface that made it easy to build complex database queries. Miscellaneous softwaremiddlewareprovided communicationbetween different types of databases, and finally fell sharply in priceinformation storage devices.

The structure of a corporation may contain database.

Database - functional and organizational component in automated control systems and information and computing systems, which provides centralized information support for a group of users or a set of tasks solved in the system.

Database is considered as an information and reference system, the main purpose of which is:

  • in the accumulation and maintenance in working condition of a set of information constituting the information base of the entire automated system or a certain set of tasks solved in it;
  • in the issuance of the data required by the task or the user;
  • in providing collective access to stored information;
  • in ensuring the necessary management of the use of information contained in the infobase.

Thus, a modern data bank is a complex software and hardware complex, which includes technical, system and network tools, databases and DBMS, information retrieval systems for various purposes.

V .2. DBMS AND STRUCTURAL SOLUTIONS IN CORPORATE SYSTEMS

Database and knowledge management systems

An important component of modern information systems are database management systems (DBMS).

DBMS - a set of software and language tools designed to create, maintain and use databases.

The database management system provides data processing systems with access to databases. As already noted, an important role of DBMS is acquired in the creation of corporate information systems and a particularly important role in the creation of information systems using distributed information resources based on modern network computer technologies.

The main feature of modern DBMS is that modern DBMS support such technologies like:

  • client/server technology.
  • Support for database languages. Thisschema definition language DB (SDL - Schema Definition Language),data manipulation language (DML - Data Manipulation Language), integrated languages SQL (Structured Queue Language ), QDB (Query - By - Example ) and QMF (Query Management Facility ) is an advanced peripheral tool for query specification and report generation for DB 2 etc.;
  • Direct management of data in external memory.
  • Memory buffer management.
  • Transaction management. OLTP technology (On-Line Transaction Processing), OLAP - technology (On-Line Analysis Processing) for DW.
  • Ensure data protection and integrity. The use of the system is allowed only to users who have the right to access the data. When users perform operations on data, the consistency of the stored data (integrity) is maintained. This is important in corporate multi-user information systems.
  • Journalization.

Modern DBMS must meet the database requirements listed above. In addition, they must comply with the following principles:

  • Data independence.
  • Versatility. The DBMS must have powerful support for the conceptual data model to display custom logical views.
  • Compatibility. The DBMS must remain operational with the development of software and hardware.
  • Data redundancy. Unlike file systems, a database must be a single set of integrated data.
  • Data protection. The DBMS must provide protection against unauthorized access.
  • Data integrity. The DBMS must prevent users from tampering with the database.
  • Managing concurrent work. The DBMS must protect the database from inconsistencies in the shared access mode. To ensure a consistent state of the database, all user requests (transactions) must be performed in a certain order.
  • The DBMS must be universal. It should support different data models on a single logical and physical basis.
  • The DBMS should support both centralized and distributed databases and thus become an important link in computer networks.

Considering a DBMS as a class of software products focused on maintaining databases in automated systems, we can distinguish two of the most significant features that determine the types of DBMS. According to them, the DBMS can be considered from two points of view:

  • their capabilities in relation to distributed (corporate) databases;
  • their relationship to the type of data model implemented in the DBMS.

In relation to corporate (distributed) databases, the following types of DBMS can be conventionally distinguished:

  • DBMS "desktop". These products are primarily focused on working with personal data (desktop data). They have command sets for sharing common databases, but they are small in size (small office type). First of all, it is a DBMS like Access, dBASE, Paradox, ExPro. Why Access, dBASE, Paradox, ExPro have poor access to corporate data. The fact is that there is no easy way to overcome the barrier between personal and corporate data. And the point is not even that the mechanism of a personal data DBMS (or a small office) is focused on accessing data through many gateways, gateway products, etc. The problem is that these mechanisms typically involve full file transfers and a lack of extensive index support, resulting in queues to the server that are practically a halt in large systems.
  • Specialized high-performance multi-user DBMS. Such DBMSs are characterized by the presence of a multi-user system kernel, a data manipulation language, and the following functions that are typical for developed multi-user DBMSs:
  • organizing a buffer pool;
  • the presence of a system for processing transaction queues;
  • the presence of mechanisms for multi-user data blocking;
  • transaction logging;
  • availability of access control mechanisms.

These are DBMS such as Oracle, DВ2, SQL/Server, Informix, Sybase, ADABAS, Titanium and others provide a wide service for processing corporate databases.

When working with databases, the mechanism of transactions is used.

transaction is a logical unit of work.

transaction is a sequence of data manipulation statements that is executedas one(all or nothing) and translating databasefrom one integral state to another integral state.

A transaction has four important properties known as ASID properties:

  • (A) Atomicity . The transaction is executed as an atomic operation - either the entire transaction is executed, or the entire transaction is not executed.
  • (C) Consistency. A transaction moves a database from one consistent (consistent) state to another consistent (consistent) state. Within a transaction, database consistency can be broken.
  • (I) Isolation . Transactions of different users should not interfere with each other (for example, as if they were performed strictly in turn).
  • (D) Durability. If the transaction is completed, then the results of its work should be saved in the database, even if the system crashes at the next moment.

The transaction usually starts automatically from the moment the user joins the DBMS and continues until one of the following events occurs:

  • A COMMIT WORK command has been issued (to commit a transaction).
  • ROLLBACK WORK command issued.
  • The user has disconnected from the DBMS.
  • There was a failure of the system.

For the user, she wears usuallyatomic character. In fact, this is a complex mechanism of interaction between the user (application) and the database. Enterprise systems software uses a real-time transaction processing engine (Online Transaction Processing Systems, OLTP), in particular accounting programs, software for receiving and processing client applications, financial applications, produce a lot of information. These systems are designed (and appropriately optimized) for processing large amounts of data, complex transactions, and intensive read/write operations.

Unfortunately, the information placed in the databases of OLTP systems is not very suitable for use by ordinary users (due to the high degree of table normalization, specific data presentation formats, and other factors). Therefore, data from different information pipelines is sent (in the sense of being copied) tostorage warehouse, sorting and subsequent delivery to the consumer. In information technology, the role of warehouses is played byinformation storages.

Delivery of information to the end user - systems of analytical data processing in real time are engaged(On-line Analytical Processing, OLAP), which provide extremely easy access to data through convenient tools for generating queries and analyzing results. In OLAP systems, the value of an information product is increased through the use of various methods of analysis and statistical processing. In addition, these systems are optimized in terms of data extraction speed, collection of generalized information and are focused on ordinary users (they have an intuitive interface). If OLTP system gives answers to simple questions like "what was the level of sales of product N in region M in January 199x?", then OLAP systems are ready for more complex user requests, for example: "Give an analysis of sales of product N for all regions according to the plan for the second quarter compared to the previous two years."

Client/server architecture

In modern systemsdistributed information processingtechnology takes center stage client/server. In system client-server architecturesdata processing is divided between a client computer and a server computer, the communication between which takes place over a network. This separation of data processing processes is based on the grouping of functions. Typically, a database server computer is dedicated to perform database operations, while a client computer runs application programs. Figure 2.1 shows a simple client-server architecture system that includes a computer acting as a server and another computer acting as its client. Each machine performs different functions and has its own resources.

Server

Database

Server computer


Net

IBM compatible PC

IBM compatible PC

IBM compatible PC

Clients

Applications

Rice. 2.1. Client-server architecture system

The main function of the client computer is to run the application (user interface and presentation logic) and communicate with the server when required by the application.

Server (Server) - This is an object (computer) that provides services to other objects at their request.

As the term implies, the main function of the server computer is to serve the needs of the client. The term "Server" is used to refer to two different groups of functions: a file server and a database server (hereinafter, these terms mean, depending on the context, either the software that implements these groups of functions, or computers with this software). File servers are not designed to perform database operations, their main function is to share files among several users, i.e. providing simultaneous access of many users to files on a computer - a file server. An example of a file server is Novell's NetWare operating system. The database server can be installed and run on a file server computer. Oracle DBMS in the form of NLM (Network Loadable Module) runs in a NetWare environment on a file server.

The local network server must have resources that correspond to its functional purpose and the needs of the network. Note that due to the orientation towards the open systems approach, it is more correct to speak of logical servers (meaning a set of resources and software tools that provide services over these resources), which are not necessarily located on different computers. A feature of a logical server in an open system is that if, for reasons of efficiency, it is expedient to move the server to a separate computer, then this can be done without the need for any modification, both of itself and of the application programs using it.

One of the important server requirements is that the operating system in which the database server is hosted must be multitasking (and preferably, but not necessarily, multiuser). For example, the Oracle DBMS installed on a personal computer with an MS-DOS (or PC-DOS) operating system that does not meet the requirement for multitasking cannot be used as a database server. And the same Oracle DBMS installed on a computer with a multitasking (though not multiuser) OS / 2 operating system can be a database server. Many varieties of UNIX, MVS, VM, and some other operating systems are both multitasking and multiuser.

Distributed Computing

The term "distributed computing" is often used to refer to two different, albeit complementary, concepts:

  • Distributed database;
  • Distributed data processing.

The application of these concepts makes it possible to organize access to information stored on several machines for end users using various means.

There are many types of servers:

  • Database server;
  • Print server;
  • Remote access server;
  • Fax server;
  • Web server, etc.

At the core of Client/Server technologyThere are such basic technologies as:

  • Technologies of operating systems, the concept of interaction of open systems, creation of object-oriented environments for the functioning of programs;
  • Telecommunication technologies;
  • Network technologies;
  • Graphical user interface technologies ( GUI);
  • Etc.

Advantages of client-server technology:

  • The client/server technology allows computing on heterogeneous computing environments. Platform Independence: Access to heterogeneous network environments that include different types of computers with different operating systems.
  • Independence from data sources: access to information from heterogeneous databases. Examples of such systems are DB2, SQL/DS, Oracle, Sybase.
  • Load balance between client and server.
  • Performing calculations where it happens most efficiently;
  • Provides efficient scaling capability;
  • Cross platform computing. Cross-platform computing is defined simply as the implementation of technologies in heterogeneous computing environments. The following options should be provided here:
  • The application must run on multiple platforms;
  • On all platforms, it should have the same interface and logic of work;
  • The application must integrate with the native operating environment;
  • It should behave the same on all platforms;
  • It should have simple and consistent support.

Distributed Computing. Distributed computing involves the distribution of work among several computers (although distributed computing is a broader concept).

Downscaling. Downscaling is the transfer of mainframe applications to small computer platforms.

  • Reduce infrastructure and hardware costs. Cost-Effective: The availability of low-cost computing hardware and the growing prevalence of local area networks make client-server technology more cost-effective than other data processing technologies. Equipment can be upgraded as needed.

Reducing the overall application execution time;

Reduced client memory usage;

Reducing network traffic.

  • Ability to work with multimedia: To date, a lot of programs for working with multimedia for PCs have been created. There are either no such programs for terminal-host configuration, or they are very expensive.
  • The ability to use more computing resources for database operations: since applications run on client computers, additional (compared to the terminal-host configuration) resources are freed up on the server computer for database operations, such as CPU and operational resources. memory.
  • Increased programmer productivity: Programmer productivity is increased by using tools such as SQL*Forms and CASE to develop applications faster than programming languages ​​such as C, PL1, or COBOL.
  • Increasing end user productivity: Nowadays, many end users have embraced systems such as Lotus, Paradox, Word Perfect, Harvard Graphics, etc.

The back-end interface is defined and fixed. Therefore, it is possible to create new client parts of an existing system (an example of interoperability at the system level).

Rice. 2.2. An illustration of client access to a server share.

How to implement client-server technology

The installation of a system based on client-server technology and capable of distributed data processing is discussed below. The following computer hardware and software is required:

  • database server computer;
  • client computers;
  • communication network;
  • network software;
  • application software.

SQL language . High level query language - SQL (Structured Query Language ) is used to implement queries to databases, such as NMD, NDL and PJD, and has been adopted as a standard. Language SQL was originally adopted as the data language of the firm's software products IBM and YMD of a relational DBMS SYSTEM R by IBM . An important feature of the language SQL is that the same language is represented through two different interfaces, namely: through an interactive interface and through an application programming interface (dynamic SQL). Dynamic SQL consists of many built-in language features SQL , provided specifically for constructing interactive applications, where an interactive application is a program that is written to support access to the database by the end user running on the interactive terminal. Language SQL provides the functions of defining, manipulating and managing database data and is transparent to the user from the point of view of the implemented DBMS.

Rice. 2.3. Scheme for executing user requests to distributed databases.

The internal structure of databases is determined by the data models used. The conceptual model has more abstraction capabilities and richer semantics than external models. External models are often called syntactic or operational models, referring to the syntactic nature of the management and application as a means of user interaction with the database. In information modeling, there are various levels of abstraction, from the level of the conceptual model to the level of the physical data model, that affect the architecture of the DBMS.

The data model has three components:

  • A data structure to represent from the user's perspective on the database.
  • Valid operations to be performed on the data structure. It is necessary to be able to work with this structure using various DDL and NML operations. A rich structure is worthless if you can't manipulate its contents.
  • Constraints for integrity control. The data model must be provided with means to preserve its integrity and protect it. As an example, consider the following two constraints:
  • Each subtree must have a source node. Hierarchical databases cannot store child nodes without a parent node.
  • In relation to a relational database, there cannot be identical tuples. For a file, this requirement requires that all records be unique.

One of the most important characteristics of the DBMS is the ability to link objects.

There are the following types of links between objects:

  • One-to-One (1:1). One object of one set can be associated with one object of another set.
  • One-to-Many (1:M). One object of one set can be related to many objects of another set.
  • Many-to-Many (M:N). One object of one set can be associated with many objects of another set, but at the same time, one object of another set can be associated with many objects of the first set.
  • branched . One object of one set can be associated with objects of many sets.
  • Recursive . One object of a given set can be associated with an object of the same set.

There are following main data models:

  • Relational data model.
  • Hierarchical data model.
  • Incomplete network data model.
  • CODASYL data model.
  • Extended network data model.

V.3. INTERNET / INTRANET TECHNOLOGIES AND CORPORATE DATABASE ACCESS SOLUTIONS

The main problem of systems based on the "client-server" architecture is that, in accordance with the concept of open systems, they are required to be mobile in the widest possible class of open systems hardware and software solutions. Even if we restrict ourselves to UNIX-based local area networks, different networks use different equipment and communication protocols. Attempting to create systems that support all possible protocols leads to their overload with network details at the expense of functionality.

An even more complex aspect of this problem is related to the possibility of using different representations of data in different nodes of a heterogeneous local network. Different computers may have different addressing, representation of numbers, character encoding, etc. This is especially important for high-level servers: telecommunications, computing, databases.

A common solution to the problem of mobility of systems based on the "client-server" architecture is to rely on software packages that implement remote procedure call protocols (RPC - Remote Procedure Call). Using these tools, calling a service at the remote host looks like a normal procedure call. The RPC tools, which, of course, contain all the information about the specifics of the local network equipment and network protocols, translates the call into a sequence of network interactions. Thus, the specifics of the network environment and protocols are hidden from the application programmer.

When a remote procedure is called, RPC programs convert client data formats to intermediate machine-independent formats and then convert to server data formats. When passing response parameters, similar transformations are performed.

Other related works that may interest you.vshm>

6914. Database concept 11.56KB
The database is a set of independent materials presented in an objective form of articles of calculation of normative acts of court decisions and other similar materials systematized in such a way that these materials can be found and processed using an electronic computer Civil Code of the Russian Federation Art. A database organized in accordance with certain rules and maintained in the computer's memory, a set of data characterizing the current state of some ...
8064. Distributed databases 43.66KB
Distributed databases A distributed RDB database is a set of logically interconnected shared data that is physically distributed over different nodes of a computer network. Data access should not depend on the presence or absence of data replicas. The system should automatically determine methods for performing a data join join, a network link capable of handling the amount of information being transferred and a node having sufficient processing power to join the tables. The RDBMS must be capable of...
20319. DATABASES AND THEIR PROTECTION 102.86KB
Online online databases appeared in the mid-1960s. Operations on operational databases were processed interactively using terminals. The simple index-sequential record organization quickly evolved to a more powerful set-oriented record model. Charles Bachmann received the Turing Prize for leading the work of the Data Base Task Group (DBTG), which developed a standard language for describing data and manipulating data.
5031. Database Development Library 11.72MB
Database design technology. Defining relationships between entities and creating a data model. The main ideas of modern information technology are based on the concept that data should be organized into databases in order to adequately reflect the changing real world and meet the information needs of users. These databases are created and operated under the control of special software systems called DBMS database management systems.
13815. HIERARCHICAL DATABASE MODEL 81.62KB
The main ideas of modern information technology are based on the concept of databases, according to which the basis of information technology is data organized in databases that adequately reflect the state of a particular subject area and provide the user with relevant information in this subject area. It must be acknowledged that data is...
14095. Library database development 11.72MB
The increase in the volume and structural complexity of stored data, the expansion of the circle of users of information systems have led to the widespread use of the most convenient and relatively easy to understand relational (tabular) DBMS.
5061. Creation of a polyclinic database 2.4MB
The development of computer technology and information technology has provided opportunities for the creation and widespread use of automated information systems (AIS) for various purposes. Information systems for managing economic and technical facilities are being developed and implemented
13542. Databases of geological information 20.73KB
Recently, the introduction of computer technologies and, in particular, databases, into the scientific sphere has been taking place at a rapid pace. This process does not bypass geology either, since it is in the natural sciences that there is a need to store and process large amounts of information.
9100. Database. Basic concepts 26.28KB
A database is a collection of information about specific objects of the real world in any subject area, economics, management, chemistry, etc. The purpose of an information system is not just to store data about objects, but also to manipulate these data, taking into account the relationships between objects. Each object is characterized by some set of data properties, which are called attributes in the database.
5240. Creation of the database "Dean's office of the university" 1.57MB
A database (DB) is a collection of interrelated data stored together on external storage media of a computer with such an organization and minimal redundancy that allows their use in an optimal way for one or more applications.

This article will focus on the architecture of data warehouses. What to be guided by when building it, what approaches work - and why.

"The fairy tale is a lie - but there is a hint in it ..."

Grandfather planted ... storage. And the storehouse grew big and big. I just didn't really know how it worked. And grandfather started a review. The grandfather called the grandmother, granddaughter, cat and mouse for a family council. And he says the following topic: “Our storage has grown. Data from all systems flock, tables are visible and invisible. Users prepare their reports. It seems that everything is fine - to live and live. Yes, only one sadness - no one knows how it works. It requires disks apparently-invisibly - you will not get enough! And then there are users who come to me with different complaints: either the report freezes, or the data is outdated. And sometimes it’s quite a disaster - we come with reports to the tsar-father, but the numbers don’t agree with each other. The hour is not even - the king will be angry - then do not demolish your head - neither for me, nor for you. So I decided to gather you and consult: what are we going to do?

He cast his eyes over the assembly and asked:
- Here you are, grandmother, do you know how our storage is arranged?
- No, grandfather, I don't know. And how should I know? Over there, what brave lads are guarding him! Some mustaches! Don't step up. I went to visit them somehow, baked pies. And they ate some pies, wiped their mustaches and said: “Why did you come, grandmother? What is your storage? You tell us what kind of report you need - we will do it for you! You most importantly bring pies more often! Painfully, they taste delicious.”
- And you, my beloved granddaughter, do you know how our storage is arranged?
- No, grandfather, I don't know. Gave me some access to it. I connected, I look - and there are tables - apparently invisible. And different schemes are hidden. Eyes widen.... I was confused at first. And then I looked closely - some of them are empty, others are filled, but only half. Also, the data seems to be repeated. No wonder you can't stock up on disks with such redundancy!
- Well, you, cat, what can you say about our storage? Is there something good in it?
- Yes, how not to say, grandfather - I will say. At the request of my granddaughter, I tried to make a pilot pilot in a separate scheme - a small display case. In order to understand what kind of trade is beneficial for our state - what products are good for merchants, they pay tribute - the treasury is replenished. And which ones are bad. And I began to pick up data from this repository. Collected facts. And he began to try to compare them against products. And what, grandfather, I saw - the products seem to be the same, but you look at the signs - they are different! I then began to comb them with my granddaughter's comb. He scratched, scratched - and led to a certain uniformity, caressing the eye. But early I rejoiced - the next day I launched my scripts to update the wonderful data in the window - and everything went away for me! "How so?" - I think, - the granddaughter is going to be upset - today it would be necessary to show our pilot to the minister. How do we go about it - with such data?
- Yes, sad tales, cat, you tell. Well, you, little mouse, didn’t you really try to find out about the vault? You are a lively, nimble, sociable girl! What will you tell us?
- Yes, how, grandfather, do not try - of course, I am a quiet mouse, but agile. Somehow the granddaughter of the cat asked the data model of our state repository to get it. And the cat, of course, came to me - on you, says the mouse, all hope! Well, what a good deed good people (and cats) can not do? I went to the castle, where the head of the repository hides the data model in a safe. And hid. I waited for him to take that model out of the safe. As soon as he went out for coffee - I jumped on the table. I look at the model - I can not understand anything! How so? I don't recognize our vault! We have countless thousands of tables, data - indefatigable streams! And here - everything is harmonious and beautiful ... He looked at this model - and put it back in the safe.
- Yes, very strange things, you told us, mouse.
Grandpa thought hard.
What shall we do, my friends? After all, you won’t live long with such a repository ... Users will soon lose patience completely.

Whatever our grandfather from the fairy tale decides - to build a new storage facility or try to reanimate the existing one - we must draw conclusions before “rolling up our sleeves” again.
Let's put aside organizational aspects - such as the danger of focusing expertise in some narrow closed group, the lack of control processes and ensuring transparency of the architecture of systems used in the enterprise, etc.
Today I would like to focus on building the architecture of a particular system (or group of systems) - data warehouses. What should be kept in focus in the first place when an organization starts building such a complex and expensive system as storage.

Debriefing

None of us, working on the creation and development of any system, does not want it to be a “temporary house”, or a solution that will “wither away” in a year or two, because. will be unable to meet the requirements and expectations of Customers and Business. No matter how strong the shift towards “flexible methodologies” is today, it is much more pleasant for a person to feel like a “master” who makes violins than an artisan who carves sticks for disposable drums.
Our intention sounds natural: to make systems that are solid and high-quality, which will not require us to regularly “night vigils with a file”, which we will not be ashamed of in front of end users, and which will not look like a “black box” to all “uninitiated” followers.

First, let's list the typical problems that we regularly encounter when working with storages. Let's just write down what we have - so far without trying to streamline and formalize.

  1. In principle, we have a good storage: if you do not touch it, then everything works. True, as soon as a change is required, “local collapses” begin.
  2. Data is loaded daily, according to the regulations, within one large process, within 8 hours. And it suits us. But if a failure suddenly occurs, this requires manual intervention. And then everything can work unpredictably for a long time, because. human participation is required in the process.
  3. Rolled release - expect problems.
  4. Some one source could not give the data in time - all processes are waiting.
  5. Data integrity is controlled by the database - so our processes crash when it is broken.
  6. We have a very large storage - 2000 tables in one common schema. And 3000 more in many other schemes. We already have little idea how they are arranged and for what reason they appeared. Therefore, it can be difficult for us to reuse something. And many problems have to be solved again. Because, it's easier and faster (than to understand "in someone else's code"). As a result, we have discrepancies and duplicate functionality.
  7. We expect the source to provide quality data. But it turns out that this is not the case. As a result, we spend a lot of time reconciling our final reports. And they were very successful at it. We even have a streamlined process. True, it takes time. But users are used to...
  8. The user does not always trust our reports and requires justification for a particular figure. In some cases he is right and in others he is wrong. But it is very difficult for us to substantiate them, because we do not provide means of "end-to-end analysis" (or data lineage).
  9. We could bring in additional developers. But we have a problem - how do we turn them into work? What is the most efficient way to parallelize work?
  10. How to develop the system gradually, without going into the development of the “core of the system” for a whole year?
  11. The data warehouse is associated with the corporate model. But we know for sure (we saw it in the XYZ bank) that it is possible to build a model indefinitely (in the XYZ bank we went around and discussed business entities for six months, without any movement). Why is she at all? Or maybe it's better without her, if there are so many problems with her? Maybe generate it somehow?
  12. We decided to lead the model. But how to systematically develop the warehouse data model? Do we need "rules of the game" and what can they be? What will it give us? What if we make a mistake with the model?
  13. Should we save the data, or the history of their changes, if "the business does not need them"? I would not like to "store garbage" and complicate the use of this data for real tasks. Should the vault keep history? What is it like? How does storage work over time?
  14. Is it necessary to try to unify the data in the storage if we have a NSI management system? If there is MDM, does this mean that now the whole master data problem is solved?
  15. We are expected to replace key accounting systems soon. Should the data store be ready for a source change? How to achieve this?
  16. Do we need metadata? What shall we understand by this? Where exactly can they be used? How can it be implemented? Do they need to be kept "in one place"?
  17. Our Customers are extremely unstable in their requirements and desires - something is constantly changing. In general, our business is very dynamic. While we are doing something, it already becomes unnecessary. How can we make sure that we produce results as quickly as possible - like hot cakes?
  18. Users demand speed. But we cannot run our main boot processes often, because this loads the source systems (has a bad effect on performance) - therefore, we hang up additional data streams - which will take pointwise - what we need. True, it turns out a lot of flows. And then we will throw out some of the data. In addition, there will be a problem of convergence. But there is no other way...
Quite a lot has already happened. But this is not a complete list - it is easy to supplement and develop it. We will not hide it in the table, but hang it in a conspicuous place - keeping these issues in the focus of our attention in the process of work.
Our task is to develop a comprehensive solution as a result.

antifragility

Looking at our list, one conclusion can be drawn. It is not difficult to create some kind of “database for reporting”, throw data there, or even build some kind of routine data update processes. The system begins to live somehow, users appear, and with them obligations and SLAs, new requirements arise, additional sources are connected, methodologies change - all this must be taken into account in the development process.

After some time, the picture is as follows:
"Here's the vault. And it works if you don't touch it. Problems arise when we have to change something.”

A change comes to us, the impact of which we are unable to evaluate and comprehend (because we did not put such tools into the system initially) - and in order not to take risks, we do not touch what is, but make one more extension on the side, and one more, and more - turning our decision into slums, or as they say in Latin America, "favelas", where even the police are afraid to go.
There is a feeling of loss of control over one's own system, chaos. More and more hands are required to maintain existing processes and solve problems. And it's getting harder to make changes. In other words, the system becomes unstable to stresses, non-adaptive to changes. And besides, there is a strong dependence on characters who "know the fairway", since no one has a "card".

This property of an object is to collapse under the influence of chaos, random events and upheavals - Nassim Nicholas Taleb calls fragility . It also introduces the opposite concept: antifragility when the object is not destroyed by stress and accidents, but receives a direct benefit from it. ("Antifragility. How to benefit from chaos")
Otherwise it can be called adaptability or resistance to change .

What does this mean in this context? What are the "sources of chaos" for IT systems? And what does it mean to "capitalize on chaos" in terms of IT architecture?
The first thought that comes to mind is the changes that come from outside. What is the outside world for the system? For storage in particular. Of course, first of all - changes from the data sources for the repository:

  • changing the formats of incoming data;
  • replacement of some data source systems with others;
  • changing rules/platforms for system integration;
  • changing the interpretation of data (formats are saved, the logic of working with data changes);
  • changing the data model, if the integration is done at the data level (parsing the database transaction log files);
  • growth in data volumes - while there was little data in the source system, and the load was small - you could take them at any time, with an arbitrarily heavy request, the data and load have grown - now there are strict restrictions;
  • etc.
The source systems themselves, the composition of information and its structure, the type of integration interaction, as well as the very logic of working with data can change. Each system implements its own data model and approaches to working with them that meet the goals and objectives of the system. And no matter how hard they try to unify industry models and reference practices, nuances will inevitably emerge anyway. (And besides, the process of industry unification itself, for various reasons, is not moving forward much.)
The culture of working with corporate data - the presence and control of information architecture, a single semantic model, master data management systems (MDM) somewhat facilitate the task of consolidating data in the warehouse, but do not exclude its necessity.

No less critical changes are initiated by storage consumers (requirement changes):

  • previously, there was enough data to build a report - now it was necessary to connect additional fields or a new data source;
  • previously implemented data processing methods are outdated - algorithms and everything that it affects need to be reworked;
  • Previously, everyone was satisfied with the current value of the dictionary attribute on the information panel - now the value that is relevant at the time of the occurrence of the analyzed fact / event is required;
  • there was a requirement for the depth of the history of data storage, which was not there before - to store data not for 2 years, but for 10 years;
  • Previously, it was enough to have data as of “end of day/period” - now you need the state of data “intraday”, or at the time of a certain event (for example, making a decision on a loan application - for Basel II);
  • earlier we were satisfied with reporting on data for yesterday (T-1) or later, now we need T0;
  • etc.
Both integration interactions with source systems and requirements from warehouse data consumers are external factors for the data warehouse: one source system replaces another, data volumes grow, incoming data formats change, user requirements change, etc. And all these are typical external changes for which our system - our repository - must be ready. With the right architecture, they shouldn't kill the system.

But that's not all.
Speaking of variability, we, first of all, recall external factors. After all, inside we can control everything, it seems to us, right? Yes and no. Yes, most of the factors that are outside the zone of influence are external. But there is also “internal entropy”. And it is precisely because of its presence that we sometimes need to return “to point 0”. Start the game over.
In life, we often tend to start from scratch. Why do we tend to do this? And is it that bad?
Applied to IT. For the system itself - this may be very good - the opportunity to reconsider separate solutions. Especially when we can do it locally. Refactoring is the process of unraveling the "web" that periodically arises in the process of system development. Returning "to the beginning" can be useful. But it has a price.
With proper architecture management, this price is reduced - and the process of system development itself becomes more controllable and transparent. A simple example: if the principle of modularity is observed, it is possible to rewrite a separate module without affecting external interfaces. And this cannot be done with a monolithic structure.

The antifragility of a system is determined by its architecture. And it is this property that makes it adaptive.
When we talk about adaptive architecture- we mean that the system is able to adapt to changes, and not at all that we are constantly changing the architecture itself. On the contrary, the more stable and stable the architecture, the fewer requirements that entail its revision, the more adaptive the system.

Solutions that require a revision of the entire architecture will have a much higher price. And for their adoption, you need to have very good reasons. For example, such a reason could be a requirement that cannot be implemented within the existing architecture. Then they say - there was a requirement that affects the architecture.
Thus, we also need to know our “antifragility limits”. Architecture is not developed "in a vacuum" - it is based on current requirements and expectations. And if the situation fundamentally changes - we must understand that we have gone beyond the current architecture - and we need to revise it, develop a different solution - and think over the transition paths.
For example, we have assumed that we will always need data in the warehouse at the end of the day, we will do data collection daily standard interfaces systems (via a set of views). Then, from the risk management department, requests came about the need to receive data not at the end of the day, but at the time of making a decision on lending. No need to try to "stretch the unstretched" - you just need to recognize this fact - the sooner the better. And start working on an approach that will allow us to solve the problem.
There is a very fine line here - if we only consider "requirements in the moment" and do not look a few steps ahead (and several years ahead), then we increase the risk of encountering an architecture-affecting requirement too late - and the cost of our change will be very high. Looking a little ahead - within the boundaries of our horizon - has never harmed anyone.

The example of a system from the “storage fairy tale” is just an example of a very shaky system built on fragile design approaches. And if this happens, destruction occurs rather quickly, for this particular class of systems.
Why can I say so? The topic of storage is not new. The approaches and engineering practices that have been developed during this time have been aimed at precisely this - maintaining the viability of the system.
To take a simple example, one of the most common reasons why takeoff storage projects fail is when they try to build storage on top of source systems under development without matching integration interfaces—trying to pull data directly from tables. As a result, they went into development - during this time the source database changed - and the download streams in the storage became inoperable. It's too late to redo something. And if you haven’t secured yourself by making several layers of tables inside the storage, then you can throw everything away and start over. This is just one example, and one of the simplest.

Taleb's criterion for fragile and antifragile is simple. The chief judge is time. If the system stands the test of time and shows its "survivability" and "indestructibility" - it has the property of antifragility.
If, when designing a system, we take into account antifragility as a requirement, then this will encourage us to use such approaches to building its architecture that will make the system more adaptive to both “chaos from the outside” and “chaos from the inside”. And ultimately the system will have a longer lifespan.
None of us wants to make "temporaries". And do not delude yourself that there is no other way now. Looking a few steps ahead is normal for a person at any time, especially in times of crisis.

What is a data warehouse and why do we build it

The storage architecture article assumes that the reader not only knows what it is, but also has some experience with such systems. Nevertheless, I considered it necessary to do this - to return to the origins, to the beginning of the path, because. it is there that the “fulcrum” of development is located.

How did people come to the conclusion that data warehouses are needed? And how are they different from just "a very large database"?
A long time ago, when there were simply “business data processing systems” in the world, there was no division of IT systems into such classes as front-end oltp systems, back-office dss, text data processing systems, data warehouses, etc.
This was the time when the first relational DBMS Ingres was created by Michael Stonebreaker.
And it was the time when the era of personal computers swirled into computer industry and forever turned all the ideas of the IT community of that time.

Then it was easy to find enterprise applications written on the basis of desktop-class DBMS - such as Clipper, dBase and FoxPro. And the market for client-server applications and DBMS was only gaining momentum. One after another, database servers appeared that would occupy their niche in the IT space for a long time - Oracle, DB2, etc.
And the term "database application" was circulated. What did such an application include? Simplified - some input forms through which users could simultaneously enter information, some calculations that were launched “on a button” or “on a schedule”, as well as some reports that could be seen on the screen or saved as files and sent to seal.
"Nothing special - just a simple application, just a database," one of my early mentors remarked. "Is it nothing special?" - I thought then.

If you look closely, then there are still features. As users grow, the volume of incoming information increases, as the load on the system increases, its developers-designers, in order to maintain performance at an acceptable level, go to some "tricks". The very first is the division of a monolithic “business data processing system” into an accounting application that supports the work of users in on-line mode, and a separate application for batch data processing and reporting. Each of these applications has its own database and is even hosted on a separate instance of the database server, with different settings for a different nature of the load - OLTP and DSS. And data streams are built between them.

It's all? It would seem that the problem is solved. What happens next?
And then companies grow, their information needs multiply. The number of interactions with the outside world is also growing. And as a result, there is not one big application that fully automates all processes, but several different ones, from different manufacturers. The number of information generating systems - data source systems in the company is increasing. And sooner or later, there will be a need to see and compare information received from different systems. This is how Data Warehousing, a new class of systems, appears in the company.
The generally accepted definition of this class of systems is as follows.

Data Warehouse (or Data Warehouse)– a domain-specific information database, specially designed and intended for the preparation of reports and business analysis in order to support decision-making in an organization
In this way, consolidation data from different systems, the ability to look at them in a certain “single” (unified) way is one of the key properties of data storage class systems. This is the reason why storage came into existence during the evolution of IT systems.

Key Features of Data Warehouses

Let's take a look in more detail. What are the key features of these systems? What makes data warehouses different from other enterprise IT systems?

First, these are large volumes. Very big. VLDB - this is how leading vendors call such systems when they give their recommendations on the use of their products. From all company systems, data flows into this big base data and stored there "eternally and unchanged", as they say in textbooks (in practice, life is more difficult).

Secondly, it is historical data − "Corporate memory" - so called data warehouses. In terms of working with time in storage, everything is quite interesting. In accounting systems, data is relevant at the moment. Then the user performs some operation - and the data is updated. At the same time, the history of changes may not be preserved - it depends on accounting practice. Take, for example, a bank account balance. We may be interested in the current balance at "now", at the end of the day or at the time of some event (for example, at the time of calculating the score). If the first two are solved quite simply, then the latter will most likely require special efforts. When working with the repository, the user can access past periods, compare them with the current one, and so on. It is these time-related capabilities that significantly distinguish data warehouses from accounting systems - obtaining the state of data at various points on the time axis - to a certain depth in the past.

Thirdly, this consolidation and data unification . In order to make possible their joint analysis, it is necessary to bring them to general viewunified data model , compare facts with unified reference books. There can be several aspects and difficulties here. Primarily - conceptual – under the same term, different people from different departments can understand different things. And vice versa - to call differently something that is essentially the same thing. How to ensure a "single view", and at the same time preserve the specifics of the vision of a particular group of users?

Fourthly, it is work with data quality . In the process of loading data into the storage, they are cleaned, general transformations and transformations are performed. General transformations must be done in one place - and then used to build various reports. This will avoid the discrepancies that cause so much irritation for business users - especially for management, who are brought to the table by numbers from different departments that do not agree with each other. Poor data quality gives rise to errors and discrepancies in reports, the consequence of which is a decrease in the level user trust to the entire system, to the entire analytical service as a whole.

architectural concept

Everyone who came across the repository, most likely observed some kind of "layered structure" - because. it is this architectural paradigm that has taken root for systems of this class. And not by chance. Storage layers can be perceived as separate components of the system - with their own tasks, area of ​​responsibility, "rules of the game".
Layered architecture is a means of dealing with the complexity of the system - each subsequent layer is abstracted from the complexities of the internal implementation of the previous one. This approach allows you to identify tasks of the same type and solve them in a uniform way, without reinventing the “bicycle” every time from scratch.
A schematic conceptual architectural diagram is shown in the figure. This is a simplified diagram that reflects only the key idea - the concept, but without the "anatomical details" that will arise with a deeper study of the details.

As shown in the diagram, conceptually select the following layers. Three main layers that contain the data storage area (indicated by a filled rectangle) and data loading software (conditionally shown by arrows of the same color). As well as an auxiliary - service layer, which, however, plays a very important connecting role - managing data loading and quality control.

Primary Data Layer - primary data layer (or staging , or operating layer ) - is designed to be loaded from source systems and save primary information, without transformations - in its original quality and with support for a complete history of changes.
The task of this layer– abstract subsequent storage layers from physical device data sources, methods of data collection and methods for extracting delta changes.

Core Data Layer - storage core - the central component of the system, which distinguishes the storage from just a “batch integration platform”, or a “big data dump”, since its main role is data consolidation from different sources, reduction to uniform structures, keys. It is when loading into the kernel that the main work with data quality and general transformations are carried out, which can be quite complex.
The task of this layer- abstract their consumers from the peculiarities of the logical structure of data sources and the need to compare data from different systems, ensure the integrity and quality of data.

Data Mart Layer - analytical showcases - a component whose main function is to convert data to structures convenient for analysis (if BI works with storefronts, then this is usually a dimensional model), or according to the requirements of the consumer system.
As a rule, data marts take data from the core - as a reliable and verified source - i.e. use the service of this component to bring data to a single form. We will call such windows regular . In some cases, storefronts can take data directly from staging - operating with primary data (in source keys). This approach, as a rule, is used for local tasks where data consolidation from different systems is not required and where efficiency is needed more than data quality. Such displays are called operating . Some analytical indicators can have very complex calculation methods. Therefore, for such non-trivial calculations and transformations, so-called secondary showcases .
Storefront layer task– preparation of data according to the requirements of a particular consumer – a BI platform, a group of users, or an external system.

The layers described above consist of a permanent data storage area, as well as a software module for loading and transforming data. This division into layers and regions is logical. The physical implementation of these components can be different - you can even use different platforms to store or transform data on different layers, if this is more efficient.
Storage areas contain technical (buffer tables) that are used in the process of data transformation and target tables, which are accessed by the consumer component. It's good practice to "cover" target tables with views. This facilitates subsequent maintenance and development of the system. The data in the target tables of all three layers are marked with special technical fields (meta-attributes), which serve to ensure data loading processes, as well as to enable informational audit of data flows in the storage.

A special component (or set of components) is also distinguished, which provides service functions for all layers. One of its key tasks - the control function - is to provide "single rules of the game" for the entire system as a whole, leaving the right to use different options for implementing each of the layers described above - incl. use different technologies for loading and processing data, different storage platforms, etc. Let's call him service layer (Service Layer) . It does not contain business data, but it has its own storage structures - it contains a metadata area, as well as an area for working with data quality (and possibly other structures, depending on the functions assigned to it).

Such a clear division of the system into separate components significantly increases the controllability of the system development:

  • the complexity of the task that is assigned to the developer of the functionality of a particular component is reduced (he does not have to simultaneously solve integration issues with external systems, and think over data cleaning procedures, and think about the optimal presentation of data for consumers) - the task is easier to decompose, evaluate and perform a small delivery;
  • you can involve various performers (and even teams or contractors) in the work - because this approach allows you to effectively parallelize tasks, reducing their mutual influence on each other;
  • the presence of persistent staging allows you to quickly connect data sources without designing the entire core or showcases for the entire subject area, and then gradually build up the remaining layers according to priorities (moreover, the data will already be in the repository - available to system analysts, which will greatly facilitate the tasks of subsequent development of the repository);
  • the presence of the core allows all work with data quality (as well as possible misses and errors) to be hidden from storefronts and from the end user, and most importantly, using this component as a single data source for storefronts, you can avoid problems with data convergence due to the implementation of common algorithms in one place;
  • highlighting storefronts allows you to take into account the differences and specifics of understanding data that users of different departments may have, and designing them for BI requirements allows you not only to issue aggregated figures, but also to ensure data reliability by providing opportunities to drill down to primary indicators;
  • the presence of the service layer allows you to perform end-to-end data analysis (data lineage), use unified data audit tools, common approaches to highlighting the delta of changes, working with data quality, load management, error monitoring and diagnostic tools, and speeds up problem resolution.
This approach to decomposition also makes the system more resistant to change (compared to a "monolithic structure") - ensures its antifragility:
  • changes from the source systems are worked out on staging - in the kernel, only those threads that are affected by these staging tables are modified, the effect on storefronts is minimal or absent;
  • changes to customer requirements are processed mostly in storefronts (unless it requires additional information that is not already in the warehouse).
Next, we will go through each of the above components and look at them in a little more detail.

System core

Let's start "from the middle" - the core of the system or the middle layer. Not labeled as Core Layer. The core performs the role of data consolidation - reduction to single structures, directories, keys. Here the main work with data quality is carried out - cleaning, transformation, unification.

The presence of this component allows you to reuse data flows that transform primary data received from source systems into a single format, following common rules and algorithms, rather than repeating the implementation of the same functionality separately for each application storefront, which, in addition to inefficient use of resources, may lead to also discrepancies in the data.
The storage core is implemented in a data model, in the general case, different both from the models of source systems and from the formats and structures of consumers.

Storage engine model and enterprise data model

The main task of the middle storage layer is stability. That is why the main focus here is on the data model. It is commonly referred to as the "enterprise data model". Unfortunately, a certain halo of myths and absurdities has developed around it, which sometimes lead to the abandonment of its construction altogether, but in vain.

Myth 1. An enterprise data model is a huge model consisting of thousands of entities (tables).
Actually. In any subject area, in any business domain, in the data of any company, even the most complex, there are few basic entities - 20-30.

Myth 2. There is no need to develop any "own model" - we buy an industry reference model - and do everything according to it. We spend money - but we get a guaranteed result.
Actually. Reference models can really be very useful, because. contain industry experience in modeling this area. From them you can draw ideas, approaches, naming practices. Check the "depth of coverage" of the area, so as not to miss something important. But we are unlikely to be able to use such a model "out of the box" - as it is. This is the same myth as, for example, buying an ERP system (or CRM) and implementing it without any “twisting for yourself”. The value of such models is born in their adaptation to the realities of this particular business, this particular company.

Myth 3. Development of the storage core model can take many months, during which time the project will actually be frozen. In addition, it requires an insane amount of meetings and the participation of many people.
Actually. The repository model can be developed iteratively, piece by piece, along with the repository. For uncovered areas, "extension points" or "stubs" are placed - i.e. some "universal constructions" are applied. At the same time, you need to know the measure so that you don’t get a super-universal thing of 4 tables, into which it is difficult both to “put data” and (even more difficult) to get it. And which is extremely non-optimal in terms of performance.

It will take time to develop the model. But this is not the time spent on “drawing entities” - this is the time needed to analyze the subject area, understanding how the data is structured. That is why analysts are very closely involved in this process, as well as various business experts are involved. And this is done selectively. And not by organizing meetings with an insane number of people, mailing huge questionnaires, etc.
Quality business and system analysis is the key to building a storage core model. You need to understand a lot of things: where (in what systems) data is generated, how it is arranged, in what business processes it circulates, etc. Qualitative analysis has never harmed any system. Rather, on the contrary, problems arise from “blank spots” in our understanding.

Developing a data model is not a process of inventing and coming up with something new. In fact, the data model in the company already exists. And the process of its design is more like "excavations". The model is gently and carefully brought to light from the "ground" of corporate data and clothed in a structured form.

Myth 4. In our company, the business is so dynamic, and everything is changing so quickly that it is useless for us to make a model - it will become outdated before we put this part of the system into operation.
Actually. Recall that the key factor in the core is stability. And above all, the topology of the model. Why? Because it is this component that is central and affects everything else. Stability is also a requirement for the kernel model. If the model becomes obsolete too quickly, then it is incorrectly designed. For its development, the wrong approaches and “rules of the game” were chosen. It is also a question of qualitative analysis. The key entities of the corporate model change extremely rarely.
But if it comes to our mind to do for a company that sells, say, confectionery products, instead of the “Products” directory, make “Sweets”, “Cakes” and “Pies”. Then when pizza appears in the list of goods - yes, you will need to enter a lot of new tables. And it's just a matter of approach.

Myth 5. Creating a corporate model is a very serious, complex and responsible business. And it's scary to make a mistake.
Actually. The core model, although it should be stable, is still not “cast in metal”. Like any other design decisions, its structure can be reviewed and modified. Just do not forget about this quality of her. But this does not mean at all that you “cannot breathe” on it. And this does not mean that temporary solutions and "stubs" that should be planned for processing are unacceptable.

Myth 6. If we have a data source - for example, a NSI system (or a master data management system - MDM), then it should in a good way correspond to the corporate model (especially if it was recently designed and did not have time to acquire “side effects”, “traditions ” and temporary buildings). It turns out that for this case - we do not need a kernel model?
Actually. Yes, in this case, the construction of the storage core model is greatly facilitated - because we follow a top-level ready-made conceptual model. But it is not excluded at all. Why? Because when building a model of a certain system, certain rules apply - what types of tables to use (for each entity), how to version data, with what granularity to keep history, what meta-attributes (technical fields to use), etc.

In addition, no matter how wonderful and comprehensive the NSI and MDM system we have, as a rule, there will be nuances associated with the existence of local directories “about the same” in other accounting systems. And this problem, whether we like it or not, will have to be solved at the storage, because reporting and analytics are collected here.

Primary data layer (or historizable staging or operational layer)

On it is designated as Primary Data Layer. The role of this component: integration with source systems, loading and storing primary data, as well as preliminary data cleaning - checking for compliance with the rules of format-logical control, fixed in the "interaction interface agreement" with the source.
In addition, this component solves a very important task for the storage - highlighting the "true change delta" - regardless of whether the source allows you to track changes in the data or not and how (by what criterion they can be "caught"). As soon as the data got into staging, the issue of delta selection is already clear for all other layers, thanks to the marking with meta-attributes.

The data in this layer is stored in structures that are as close as possible to the source system - in order to keep the primary data as close as possible to their original form. Another name for this component is "operational layer".
Why not just use the established term “staging”? The fact is that earlier, before the "era of big data and VLDB", disk space was very expensive - and often the primary data, if stored, was only for a limited period of time. And often the name "staging" is called cleanable buffer.
Now, technology has stepped forward - and we can afford not only to store all the primary data, but to historize them with the degree of granularity that is only possible. This does not mean that we should not control the growth of data and does not eliminate the need to manage the information lifecycle by optimizing the cost of data storage, depending on the "temperature" of use - i.e. moving "cold data", which is less in demand, to cheaper media and storage platforms.

What gives us the presence of "historic staging":

  • the possibility of making mistakes (in structures, in transformation algorithms, in the granularity of history keeping) - having completely historizable primary data in the storage availability zone, we can always reload our tables;
  • an opportunity to think - we can take our time with the development of a large fragment of the core in this iteration of the development of the repository, because in our staging, in any case, they will be, and with an even time horizon (there will be one “starting point of history”);
  • the possibility of analysis - we will save even those data that are no longer in the source - they could be overwritten there, go to the archive, etc. – with us, they remain available for analysis;
  • the possibility of an information audit - thanks to the most detailed primary information, we will then be able to figure out how the download worked for us, that we eventually got such numbers (for this, you also need to have marking with meta-attributes and the corresponding metadata on which the download works - this is decided on the service layer).
What difficulties may arise in the construction of "historic staging":
  • it would be convenient to set requirements for the transactional integrity of this layer, but practice shows that this is difficult to achieve (this means that in this area we do not guarantee the referential integrity of parent and child tables) - integrity alignment occurs on subsequent layers;
  • this layer contains very large volumes (the largest in storage - despite all the redundancy of analytical structures) - and you need to be able to handle such volumes - both in terms of loading and in terms of queries (otherwise, you can seriously degrade the performance of the entire storage).
What else can be said about this layer.
Firstly, if we move away from the “end-to-end loading processes” paradigm, then the rule “the caravan moves at the speed of the last camel” no longer works for us, or rather, we abandon the “caravan” principle and switch to the “conveyor” principle: we took data from source - put in your layer - ready to take the next portion. It means that
1) we don't wait for processing to happen on other layers;
2) we do not depend on the schedule of data provision by other systems.
Simply put, we schedule a load process that takes data from one source through a specific connection method to it, checks, extracts the delta - and puts the data into staging target tables. And that's it.

Secondly, these processes, apparently, are arranged very simply - one might say trivially, from the point of view of logic. And this means that they can be very well optimized and parameterized, reducing the load on our system and speeding up the process of connecting sources (development time).
For this to happen, you need to know very well the technological features of the platform on which this component works - and then you can make a very effective tool.

Layer of analytical showcases

Storefront layer ( Data mart layer) is responsible for preparing and providing data to end users - people or systems. At this level, the requirements of the consumer are taken into account as much as possible - both logical (conceptual) and physical. The service should provide exactly what is needed - no more, no less.

If the consumer is an external system, then, as a rule, it dictates the data structures that it needs and the rules for collecting information. A good approach is one in which the consumer is responsible for the correct data collection. The data warehouse prepared, formed the storefront, provided the possibility of incremental data collection (marking with meta-attributes for subsequent selection of delta changes), and the consumer system then manages and is responsible for how it uses this storefront. But there are peculiarities: when the system does not have an active component for data collection, either an external component is needed that will perform an integrating function, or the storage will act as an “integration platform” and ensure the correct incremental upload of data further – outside the storage. Many nuances come up here, and the rules of interface interaction should be thought out and understood by both parties (however, as always, when it comes to integration). As a rule, routine cleaning/archiving of data is applied to such storefronts (it is rarely necessary that these “transit data” be stored for a long time).

Of greatest importance in terms of analytical tasks are storefronts "for people" - more precisely, for the BI tools with which they work.
However, there is a category of "especially advanced users" - analysts, data scientists - who do not need either BI tools or routine processes for filling external specialized systems. They need some kind of "common storefront" and "their own sandbox", where they can create tables and transformations at their discretion. In this case, the responsibility of the repository is to ensure that these common data marts are populated in accordance with the regulations.
Separately, we can single out such consumers as Data Mining tools - deep data analysis. These tools have their own data preparation requirements and are also used by data scientists. For the repository, the task is reduced - again, to supporting the service for downloading certain showcases of an agreed format.

However, let's get back to analytical storefronts. It is they that are of interest from the point of view of storage designers in this data layer.
In my opinion, the best time-tested approach to designing data marts, which almost all BI platforms are now "sharpened" for, is Ralph Kimball's approach. He is known by the name dimensional modeling – multidimensional modeling. There are a great many publications on this subject. For example, the basic rules can be found in the publication of Marga Ross. And of course, you can recommend from the gurus of multivariate modeling. Another helpful resource is Kimball's Tips.
The multidimensional approach to creating storefronts has been described and worked out so well - both by method evangelists and by leading software vendors - that it makes no sense to dwell on it here in any detail - the original source is always preferable.

I would like to make only one emphasis. "Reporting and analytics" is different. There is "heavy reporting" - pre-ordered reports that are generated in the form of files and delivered to users via the provided delivery channels. And there are information panels - BI dashboards. Basically, they are web applications. And the response time requirements of these applications are the same as for any other web application. This means that the normal refresh time for the BI panel is seconds, not minutes. It is important to keep this in mind when designing a solution. How to achieve this? Standard optimization method: we look at what the response time is made up of and what we can influence. What do you spend the most time on? For physical (disk) reading of the database, for data transfer over the network. How to reduce the amount of data read and transmitted per request? The answer is obvious and simple: you need to either aggregate the data, or apply a filter on large fact tables participating in the query, and exclude the join of large tables (references to fact tables should only go through dimensions).

What is BI for? How is it convenient? Why is the multivariate model effective?
BI allows the user to perform so-called "ad hoc queries". What does it mean? This means that we do not exactly know the request in advance, but we know what indicators in which sections the user can request. The user generates such a query by selecting the appropriate BI filters. And the task of the BI developer and the showcase designer is to ensure such an application operation logic so that the data is either filtered or aggregated, avoiding a situation where too much data is requested and the application “hangs”. Usually, they start with aggregated figures, then go deeper into more detailed data, but along the way, setting the necessary filters.

It is not always enough to simply build the "right star" - and get a convenient structure for BI. Sometimes you need to apply denormalization somewhere (while looking back at how it will affect the load), and somewhere to make secondary storefronts and aggregates. Somewhere to add indexes or projections (depending on the DBMS).

Thus, through “trial and error”, you can get a structure that is optimal for BI - which will take into account the features of both the DBMS and the BI platform, as well as the user's requirements for data presentation.
If we take data from the "core", then such processing of storefronts will be of a local nature, without in any way affecting the complex processing of primary data received directly from source systems - we only "shift" the data into a format convenient for BI. And we can afford to do it many times different ways, according to different requirements. It is much easier and faster to do this on the basis of the kernel data than to assemble from the “primary” (the structure and rules of which, as we know, can also “float”).

service layer

The service layer ( - Service Layer) is responsible for the implementation of common (service) functions that can be used to process data in various storage layers - load management, data quality management, problem diagnosis and monitoring tools, etc.
The presence of this level provides transparency and structured data flows in the storage.

This layer includes two data storage areas:

  • metadata area - used for the data loading control mechanism;
  • data quality area - to implement off-line data quality checks (i.e. those that are not built directly into ETL processes).
You can build the load management process in different ways. One of the possible approaches is this: we split the entire set of storage tables into modules. Only tables of one layer can be included in a module. The tables included in each module are loaded as part of a separate process. Let's call it control process . The launch of the control process is put on its own schedule. The control process orchestrates calls to atomic processes, each of which loads one target table, and also contains some common steps.
Obviously, it is enough to simply divide the staging tables into modules - according to the source systems, or rather their connection points. But for the kernel, this is already more difficult to do - because. there we need to ensure data integrity, which means we need to take into account dependencies. Those. there will be conflicts that need to be resolved. And there are different ways to resolve them.

An important point in load management is the development of a unified approach to error handling. Errors are classified according to the level of criticality. When a critical error occurs, the process should stop, and as soon as possible, because. its occurrence indicates a significant problem that can lead to data corruption in the storage. Thus, load management is not only about starting processes, but also stopping them, as well as preventing untimely start (by mistake).

A special metadata structure is created for the service layer to work. This area will store information about loading processes, loaded data sets, checkpoints that are used to maintain the increment (which process has read up to which point) and other service information necessary for the system to function.
It is important to note that all target tables in all layers are marked with a special set of meta-fields, one of which is the ID of the process that updated this string. For tables within a repository, this process marking allows for a unified way to subsequently extract delta changes. When loading data into the primary data layer, the situation is more complicated - the algorithm for extracting delta for different loaded objects can be different. On the other hand, the logic of processing the accepted changes and their rolling onto target tables for the core and storefronts is much more complicated than for staging, where everything is quite trivial - it is easy to parameterize and think over reusable typical steps (procedures).

I do not set the task here to fully cover this topic - the organization of loading - I only place accents that are worth paying attention to.
The above approach is just one of the options. He's pretty adaptable. And his “conceptual prototype” was the Toyota conveyor and the “just-in-time” system. Those. we are moving away from the widespread paradigm of exclusively “nightly loading of data”, and we are loading in small portions during the day - as the data is ready in various sources: what has come is what has been loaded. At the same time, we have many parallel processes running. And the “hot tail” of fresh data will constantly “blink” - and even out after a while. We must take into account this feature. And, if necessary, create custom showcases "slices", where everything is already integral. Those. it is impossible to achieve both efficiency and consistency (integrity) at the same time. We need a balance - somewhere one thing is important, somewhere else.

It is extremely important to provide means of logging and monitoring. A good practice is to use typed events, where you can set different parameters and set up a notification system - a subscription to certain events. Because it is very important that when the intervention of the system administrator is required, he would know about it as early as possible and receive all the necessary diagnostic information. Logs can also be used for post-factum problem analysis, as well as for investigating incidents of system malfunctions, incl. data quality.

Design and maintain warehouse data models

Why is it important to pay attention to the design of data models when developing any system where a database is involved (and especially in a warehouse)? Why not just throw a set of tables anywhere - even in text editor? Why do we need these pictures?
Oddly enough, even experienced developers raise such questions.
Actually, yes, nothing prevents you from sketching out tables - and starting to use them. If ... if at the same time in the head (!) The developer has a harmonious overall picture of the structure that he is sculpting. What if there are multiple developers? But what if someone else will use these tables? But what if time passes - a person leaves this area, and then returns to it again?

Is it possible to figure it out without a model? Basically, you can. And to figure it out, and “estimate pictures on a piece of paper”, and “sweep - settle” the data. But it is much easier, clearer and faster to use a ready-made artifact - a data model. And also to understand the “logic of its structure” - i.e. It would be nice to have common rules of the game.

And the most important thing is not even that. Most importantly, when designing a model, we are forced (simply without options!) To study the subject area more closely and deeply, the features of the data structure and their use in various business cases. And those questions that we would easily “push aside” as complex, “blurred” by throwing our signs, without trying to design model - we will be forced to set and decide now, during analysis and design, and not later - when we build reports and think about “how to reduce the incompatible” and “reinvent the wheel” every time.

This approach is one of those engineering practices that make it possible to create antifragile systems. Since they are clearly arranged, transparent, easy to develop, and their "boundaries of fragility" are immediately visible, one can more accurately assess the "scale of the disaster" when new requirements appear and the time required for a redesign (if necessary).
Thus, the data model is one of the main artifacts that must be maintained during the development of the system. In a good way, it should be “on the table” for every analyst, developer, etc. – all those involved in system development projects.

Designing data models is a separate, very extensive topic. There are two main approaches to storage design.
The approach is good for the kernel "entity-relationship" - when a normalized (3NF) model is built on the basis of the study of the subject area, more precisely, its selected area. Here the same “corporate model” that was discussed above plays.

When designing analytical showcases suitable multidimensional model . This approach lends itself well to the understanding of business users. this is a model that is simple and convenient for human perception - people operate with understandable and familiar concepts of metrics (indicators) and the sections by which they are analyzed. And this allows us to simply and clearly build the process of collecting requirements - we draw a set of "matrices of cuts and indicators", communicating with representatives of various departments. And then we bring it into one structure - the “analysis model”: we form the “measurement bus” and determine the facts that are defined on them. Along the way, we are working on hierarchies and aggregation rules.

Further, it is very easy to move on to the physical model, adding optimization elements taking into account the features of the DBMS. For example, for Oracle it would be partitioning, a set of indexes, and so on. For Vertica, other techniques will be used - sorting, segmentation, sectioning.
Special denormalization may also be required - when we deliberately introduce redundancy into the data, thanks to which we improve the performance of queries, but at the same time complicate the data update (because the redundancy will need to be taken into account and supported during the data loading process). Perhaps, in order to improve performance, we will also have to create additional aggregate tables, or use such additional DBMS features as projections in Vertica.

So, when modeling warehouse data, we actually solve several problems:

  • the task is to build a conceptual (logical) model of the core - system and business analysis - study of the subject area, deepening into details and taking into account the nuances of "live data" and their use in business;
  • the task of building an analysis model - and then a conceptual (logical) model of storefronts;
  • construction task physical models– data redundancy management, optimization taking into account the peculiarities of the DBMS for queries and data loading.
When developing conceptual models, we may not take into account the features of a particular DBMS for which we are designing a database structure. Moreover, we can use one conceptual model to create several physical ones - for different DBMS.

Let's summarize.

  • A data model is not a set of “pretty pictures”, and the process of designing it is not the process of drawing them. The model reflects our understanding of the subject area. And the process of its compilation is the process of its study and research. This is where time is wasted. And not at all to “draw and color”.
  • A data model is a design artifact, a way to share information in a structured way between team members. To do this, it must be understandable to everyone (this is provided by the notation and explanation) and accessible (published).
  • The data model is not created once and frozen, but is created and developed in the process of system development. We ourselves set the rules for its development. And we can change them if we see how to make them better, simpler, more efficient.
  • The data model (physical) allows you to consolidate and use a set of best practices aimed at optimization - i.e. use the techniques that have already worked for this DBMS.

Features of data warehouse projects


Let's dwell on the features of projects within which the company builds and develops data warehouses. And let's look at them from the point of view of the influence of the architectural aspect. Why is it important to build architecture for such projects, and from the very beginning. And it is the presence of a well-thought-out architecture that gives flexibility to the data warehouse project, allows you to effectively distribute work between performers, and also makes it easier to predict the result and make the process more predictable.

Data Warehouse is custom software

A data warehouse is always a “custom development”, not a boxed solution. Yes, there are industry-specific BI applications that include a reference data model, pre-configured ETL processes from common sources (for example, ERP systems), a set of typical BI dashboards and reports. But in practice, the storage is rarely implemented - as a "box". I have been working with storage for about 10 years and have never seen such a story. There are always nuances associated with the unique features of the company - both business and IT landscape. Therefore, it is somewhat reckless to hope that the “vendor” providing the solution will provide the architecture. The architecture of such systems often matures within the organization itself. Or it is formed by specialists of the contractor company, which is the main contractor for the project.

The data warehouse is an integration project

The data warehouse loads and processes information from many source systems. And in order to maintain “friendly relations” with them, you need to be extremely careful with them. Among other things, it is necessary to minimize the load on the source systems, take into account the “availability and inaccessibility” windows, select interaction interfaces taking into account their architecture, etc. Then the storage will be able to collect data as early as possible and with the required frequency. Otherwise, you will be "transferred" to a backup circuit, which is not updated with the most operational frequency.
In addition, the "human factor" must be taken into account. Integration is not only the interaction of machines. It is also communication between people.

Data Warehouse is a team project


In a large company, such a system can rarely be done solely by one team. As a rule, several teams work here, each of which solves a specific problem.

The architecture should provide the possibility of organizing their parallel work, while maintaining its integrity and avoiding duplication of the same functionality in different places, by different people. In addition to unnecessary labor costs, such duplication can lead to discrepancies in the data later.

In addition, when so many people and teams, often scattered, are involved in the process of system development, the question inevitably arises: how to build communications and information interaction between them. The more standard and understandable approaches and practices are used, the easier, more convenient and efficient it is to set up such work. And including it is worth thinking about the composition of "working artifacts", among which for data warehouses No. 1 are data models (see the previous section).

The data warehouse has a longer lifespan compared to other systems

Let me clarify - the statement is true for a "live", working storage, integrated with key sources, possessing historical data and providing information and analytical services to many divisions of the company.

What grounds do I have for believing so?
Firstly, building a storage is a very resource-intensive process: in addition to the actual costs of equipment, licenses for the necessary technological software and development, almost all systems and divisions of the company are also involved in this. To repeat this whole process from scratch again is a very bold undertaking.

Secondly, if the storage has the right architecture, then it can easily survive both the change of source systems, the emergence of new requirements from end users, and the growth of data volumes.
If the architecture is correct, information flows are transparent, then such a system can be developed for a long time without the risk of being in a situation of stupor when making changes due to difficulties in assessing the impact.

Gradual iterative development

The last thing the Customer would like, getting involved in the story with the storage, is to freeze his requirements for a year or two, until the full corporate data model is designed, all sources are connected in full, etc.

The data warehouse in the eyes of Customers often looks like an absolute monster - the tasks, goals and horizon of the system development are so voluminous. And often the Customer is afraid that “at the expense of his budget” the IT department will solve some “own tasks”. And again, we are faced with the issue of interaction between people and the ability to calmly state one's position and negotiate.

Competent architectural approaches allow you to develop the system iteratively, increasing the functionality gradually, without going into “development” for several years before starting to give results.

Although it should be noted that "miracles do not happen" - and the "start" also takes time. For storages, it can be quite large - since these are large amounts of data, this is historical data - for old periods when the rules for processing information could differ from the current ones. Therefore, sufficient time is required for analytical development, interaction with source systems and a series of "trial and error", including load tests on real data.

Data Warehouses - "multi-project story"

It is difficult to single out a single business customer for a data warehouse. And it is believed (not without reason) that the key factor in the success of the storage building project is the support of the company's management - directly the first person.
A repository is rarely built and developed within a single project. As a rule, there are various needs for data consolidation and analytics, behind them are different Customers and user groups. Therefore, the repository is often developed within the framework of several parallel projects.

Balance of innovation and proven solutions

Despite the fact that the topic of storage is very “ancient” (if such a word is applicable to such a young industry as IT) and quite conservative. Nevertheless, progress does not stand still - and the limitations that previously existed due to expensive and slow disks, expensive memory, etc. - are now removed. And at the same time, it's time to reconsider some architectural approaches. And this applies to both technological platforms and architecture. applied systems that are based on them.

It is important to strike a balance here - and maintain a fairly “green” approach to both resources and stored information. Otherwise, you can very quickly turn the repository into a semi-structured "garbage dump", which, if it can be sorted out, will be through quite a lot of effort.
Yes, we have more opportunities, but this does not mean that we need to deny all the practices that have been developed and tested by time, which are clear how and why to use, and “indulge in all serious” only led by the foggy ghost of “innovation”.
Keeping a balance means using new methods and approaches where they open up new opportunities, but at the same time using old proven ones to solve urgent problems that no one has canceled.
What can we do as developers and designers of applied solutions? First of all, to know and understand the technological changes of the platforms on which we work, their capabilities, features and limits of application.

Let's look at the DBMS - as the most critical and important technology platform for storage.
Recently, there has been a clear drift of relational databases, originally created as "universal", towards specialization. For a long time, leading vendors have been releasing various options - for applications of different classes (OLTP, DSS & DWH). In addition, there are additional opportunities for working with text, geo-data, etc.

But the matter was not limited to this - products began to appear that were initially focused on a certain class of tasks - i.e. specialized DBMS. They may or may not use the relational model. The important thing is that they are initially "sharpened" not just for the storage and processing of "business information" in general, but for certain tasks.

Apparently, centralization and specialization are two complementary trends that periodically replace each other, ensuring development and balance. As well as evolutionary (gradual) gradual development and cardinal changes. So, in the 90s, Michael Stonebreaker was one of the authors of the Generation III Database Manifesto, which clearly sounded the idea that the world did not need another revolution in the world of databases. However, 10 years later, he publishes works in which he announces the prerequisites for the beginning of a new era in the world of DBMS - precisely based on their specialization.
He focuses on the fact that widespread universal DBMSs are built on a “one-size-fits-all” architecture that does not take into account changes in hardware platforms or division of applications into classes for which you can come up with a better solution than implementing universal requirements.
And he begins to develop a number of projects in accordance with this idea. One of them is C-Store, a columnar DBMS designed in the shared nothing (SN) architecture, originally created specifically for data storage class systems. This product was further commercialized as HP Vertica.

It seems that now the topic of the development of data warehouses has slipped into a new round of development. New technologies, approaches and tools are emerging. Their study, testing and reasonable application allows us to create really interesting and useful solutions. And bring them to implementation, enjoying the fact that your developments are used in real work and benefit.

Epilogue

In preparing this article, I tried to focus primarily on architects, analysts and developers who directly work with data warehouses. But it turned out that I inevitably “took the topic a little wider” - and other categories of readers fell into the field of vision. Some points will seem controversial, some are not clear, some are obvious. People are different - with different experiences, backgrounds and positions.
For example, typical questions of managers are “when to attract architects?”, “when should I do architecture?”, “architecture – won’t it be too expensive?” sound rather strange for us (developers, designers), because for us the architecture of the system appears with its birth - it does not matter whether we realize it or not. And even if there is no formal role of an architect in a project, a normal developer always “turns on his internal architect”.

In the grand scheme of things, it doesn't really matter who exactly plays the role of an architect - what matters is that someone asks these questions and explores the answers to them. If the architect is clearly singled out, this only means that he is primarily responsible for the system and its development.
Why did the topic of “antifragility” seem relevant to me in relation to this subject?

“The uniqueness of antifragility is that it allows us to work with the unknown, to do something in conditions where we do not understand what exactly we are doing – and to succeed”/Nassim N.Taleb/
Therefore, the crisis and the high degree of uncertainty are not an excuse for the lack of architecture, but factors that reinforce its need.
Liked the article? Share with friends!
Was this article helpful?
Yes
Not
Thanks for your feedback!
Something went wrong and your vote was not counted.
Thank you. Your message has been sent
Did you find an error in the text?
Select it, click Ctrl+Enter and we'll fix it!