Steps

Ensure proper permissions

Sign in as a user with at least the Metadata Management capability object role assignment on the Configuration you are in.

You will be presented with the MANAGE > Configuration menu if any of the following are true:
- Manager of the folder the configuration is in
- Manager of the configuration
- Manager of the version of the configuration
- Editor of the configuration version
- Manager of at least one model in the configuration version
- Security Administration global capability
Create the configuration (as needed)

One may create, harvest and analyze metadata (models) from the MANAGE > Repository tree. However, generally it is best to create a configuration to work in (and to define the scope of analysis and search) and then go to MANAGE > Configuration and create and harvest there.

Create the model
Go to the MANAGE > Configuration in the banner.
Click the plus sign under Manage Configuration and select Imported Model.
Provide a NAME for the model and pick the IMPORT BRIDGE (source technology / format to import from).
Pick an IMPORT SERVER to use for the import.

The list of available bridges will be quite long. However, you may use the search to find text in the name of the bridge, not just the beginning. In addition, you may click on Vendor or Category to filter to specific groupings of bridges.

You may always harvest using a remote harvesting server by specifying it here under IMPORT SERVER.
Pick the (patch) Version. This options allows you to run the bridge with the latest or with earlier patch levels of MIMB. Mostly used for troubleshooting.

The choice of (patch) Version depends upon the software patch level install on the IMPORT SERVER, not the application server.
Click OK.
In the Overview tab you may again enter a name (upper left) and may enter a Definition for the model.
You may associate Labels with the model here.
If you make any changes, click SAVE.

Configure the import bridge parameters
Click the Import Setup tab.
Pick an IMPORT SERVER to use for the import.

You may always harvest using a remote harvesting server by specifying it here under IMPORT SERVER.
Here there will be a set of parameters which are unique to each import bridge. For each Parameter, click in the Value column and enter the appropriate value.

Each bridge parameter is extensively documented in the UI. Click the caret ( ) above Help on the right hand side of the parameter list. Click a parameter to see the tool tip for that parameter. Click the Bridge name to see the tool tip for the bridge as a whole.
Be sure to click SAVE if you make any changes.

It is imperative that you read these "tool tips" when using a new bridge or diagnosing an issue with the import.

You may click in the empty space below the list of bridge parameters to return to the tool tip for the bridge as a whole.

You may click the TEST button in the upper right to attempt a connection to the source metadata tool.

Configure the import options
Click the Import Options tab.
Click all other Version Management import options that apply:
- Set new versions as default: ensures that each new version when harvested will be designated as the default version.
- Create new versions only when new import has changes: means that a new version will not be created after the import if no changes are detected on the new metadata. This option applies to all bridges independently of the incremental harvesting capability of some bridges.
- Copy model description to model: in this case, as a model is harvested and a new version is stored, the description given internally for the model (from the source metadata) is also applied to the model version itself.
Special version management options will be presented in the multi-version editions of MetaKarta. Please see version managements] for details.

Some bridges are capable of incremental harvesting as they can detect what has changed in the source technology and therefore only harvest what has changed (updated or new. Enabling this feature offers much faster subsequent harvesting, and much less space used. in produced new versions has it can reuse everything that has not changed. This is primarily available in bridges from repositories of business intelligence tools (e.g. IBM Cognos, SAP BusinessObects, Tableau) and data modeling tools (Erwin mart, ER/Studio repository, based on the repository API. This is also available in data lake crawlers (e.g. File Systems, Amazon S3, Microsoft Azure Data Lake, Hadoop HDFS) based on directory/file date changes combined with other criteria specific to those technologies.
Under Data Management you may check the Data sample, profile and classify after metadata import according to the Data Setup tab option. The data profiling and sampling will not been executed automatically with import of the model otherwise.

Specifying data sampling and profiling options does NOT cause data profiling and/or sampling on every import of the model. Instead, these settings define the parameters defining how the sampling and profiling should be performed.

You may use the Data sample and profile after metadata import checkbox and MQL Statement to cause the profiling and sampling to occur every time the model is imported. However, that is not the best practice, as sampling and profiling large databases could take orders of magnitude more time than the metadata import.

Instead, you may:
- Schedule the sampling and profiling separately using the Data sampling and profiling operation. This process will also only sample and profile what is specified in the MQL STATEMENT.
- Sample and profile on demand via the user interface at any subset of the model you wish to specify.
Data Setup
Click the Data Setup tab.
Specify data sampling and data profiling options, as desired.

Import (a Version) of the model
Be sure to click SAVE if you make any changes.
Click IMPORT.
Check the Yes radio button for the FULL SOURCE IMPORT INSTEAD OF INCREMENTAL options and click IMPORT if you wish to clear any cached metadata and thus force a complete re-harvest.

The Full source import instead of incremental option is a necessary choice for full re-harvesting (without any incremental harvesting reuse of the cache)
- after deploying a new bridge (MIMB cumulative patch)
- changing the bridge parameters (e.g. moving the source from dev to prod)
Because of this option, all Incremental import parameters of any import bridge have been hidden, in order to simplify the user experience.

Therefore, any scheduled import is always using incremental harvesting on multi models.

Generally, imports are scheduled and thus watchers and managers are notified on import failure. If performing the import manually, as here, it is assumed you will simply wait for the process to complete and thus would not need notification of failure. However, you may use NOTIFY ON FAILURE to send messages out to watchers and users who have the metadata management capability on the model.
Check the Yes radio button for the SAVE THE IMPORTED MODEL TO THE DATABASE option.

This is not a common selection, but can be handy when testing bridges in terms of their logs but not wishing to actually load the model into the repository.

In some cases, you will see in the resulting log that the process ran out of memory. Please be sure to set the appropriate amount of memory using the bridge Miscellaneous option (see the bridge tool tip) or the conf.properties file (more details may be found in the deployment guide). There are two places to increate memory, potentially, though. In addition to the bridge requires (above), the Application Server will also need to have at least the same amount of memory available, but whatever is needed to produce the UI.

Example

Sign in as Administrator.

Create the folder and configuration (as needed).

Go to the MANAGE > Configuration. Click the plus sign to create a New Model.

Select the Imported Model radio button, enter "New Model" in the NAME.

The IMPORT SERVER option allows you to specify where the import will take place, on one of the harvesting servers.

Choose the Microsoft SQL Server Database SQL DDL bridge.

The list of available bridges will be quite long. However, you may use the search to find text in the name of the bridge, not just the beginning. In addition, you may click on Vendor or Category to filter to specific groupings of bridges.

In this case, click Vendor then scroll to Microsoft.

It is still a bit long, so lick None and type "DDL" into the Search box and note the list only includes those bridges.

Click OK.

Enter a Definition in the Overview tab and click SAVE.

In the Import Setup tab, use the Browse icon to navigate to C:\Temp\Models\MetadataManagement\Finance\DatabaseDDL. Select the FinanceDWStaging.SQL file. Specify "dbo" as the Default Schema. Be sure to click Save.

If you cannot find the location using the Browse function you must configure (as part of the installation) the available paths to present to users. More details may be found in the deployment guide.

You may use the text Browse box to quickly find a file or folder you are looking for.

The File parameter shows an unusual path. This is because the product hides application drives, like C:\Temp\Models..., and presents them with the browse:// prefix.

Click to edit the Miscellaneous bridge parameter and click Browse (magnifying glass icon).

You are presented with the multi-line Miscellaneous bridge parameter editor.

The tool tip with options is provide below, but also you may select the options from the pull-down.

Click Add Parameter and enter the following in the Miscellaneous bridge options:

Click CANCEL.

Enter "dbo" in the Default schema bridge parameter.

Click SAVE. Then click TEST in the upper right.

If the test connection was unsuccessful, when you click on SHOW LOG you are presented with the log at the location of the first error.

Click CLOSE.

Required bridge parameters are shown in red and have an asterisk (*) appended to the parameter name.

Click IMPORT.

Leave the Full source import instead of incremental set to No.

Click IMPORT.

Also, there is an indicator in the upper right banner showing an active operation

Click on the active operation.

Refresh until you see the Operation succeeded check mark, you have a successful import. So, select the New Model in the Configuration Manager panel and click OPEN to open the model's object page, or navigate to the model or its contents.

Incremental Harvesting

Some bridges are capable of incremental harvesting as they can detect what has changed in the source technology and therefore only import what has changed (updated or new). Enabling this feature offers much faster subsequent harvesting, and much less space utilization. Even with incremental harvesting, a new version of the imported model will be produced if there are changes or if Create new versions only when new import has changes has been specified in the Import Options tab. Nevertheless, the produced new version simply reuses (from preceding model version) everything that has not changed.

For those bridges with this capability, it is enabled by default. If you wish to override this default behavior and import everything, changed or not, from the source, you may either:

Import manually (IMPORT button in MANAGE > Configuration) and check the Yes radio button for the FULL SOURCE IMPORT INSTEAD OF INCREMENTAL.\
Add the "-cache.clear" option to the Miscellaneous bridge parameter and future imports (include scheduled ones) will NOT take advantage of incremental harvesting.

Exceptions

Some metadata sources do not provide enough information for the incremental harvesting algorithms to determine if there have been changes without complete harvesting the source and then comparing with the cache of the previous imports. In particular:

MySQL JDBC import

MySQL does not store the last time a view was re-defined (Altered) in the database metadata. They recommend using a trigger to track this information. In this case, the bridge will count views in order to know there are differences, but skip their last modification time.

Analyze the import log

With the Operation completed box is not checked at the end of the import of you see Import Unsuccessful in the log, you do not have a successful import, and you will need to investigate the log to address any errors. Even without that indication, there may still be informational messages and warning that you will want to investigate.

The log messages are self-documenting and should provide enough information to analyze and correct the issue. If you must report an issue to the MetaKarta support, be sure to include this log. Click Save Log to download the log file.

How to use the UI to analyze a log.

Connections and Configuration Management

As you have imported the model into the current configuration that you are operating under, once you have imported a model, it then must be stitched to surrounding models in the architecture of the configuration.

Configure Naming Standards

Naming standards are used to construct the Business Name of an object based upon its physical name either from a defined set of naming standard name/abbreviation pairs, or if there is no matching name/abbreviation pair or no naming standard is specified, then the Business Name is constructed based upon simple fixed rules (Naming Rules) and the options you specify here.

Return to MANAGE > Configuration and select the same model. Click the Naming Standards tab to define Naming Standards.

Click Yes to enable naming standards.

Select which Naming Standard to use (or build).

You may select from any Naming Standard in the repository-wide Naming Standards. You must be sure to first include the repository-wide Naming Standards object in the current configuration you are working in. You may do so using the MANAGE > Repository facility.

The other options are explained in the naming standards section.

Assign Responsibilities

Click the Responsibilities tab and associate the capability object role assignment with users or groups.

A message like the following is presented if you do not have sufficient permissions (object role assignment with the proper object capabilities):

Graphical user interface, text, application, email Description
automatically generated

Versions

Click the Versions tab to see current and historical versions of the model.

Set new versions as default

When importing a model, if a new version is to be created, then this new version can either be the default version or not. Initially, the new version is set to be the default version of the model.

The default version is the one that:

Is acted upon when you act on a model as a whole (e.g., if one opens a model, but not a specific version, in the metadata manager user interface, then it is the default version that is opened
Is included in a configuration that is set to Auto Update.

Click the Versions tab and right-click on the only version in the list.

As this is the only version, it is already the default. If you import again and create a new version as part of that process, you may assign defaults.

Create new versions only when new import has changes

Checking this box means that a new version will only be created if changes are detected. This generally only applies to BI tools, DI tools and data modeling tools. Differences are detected using various methods depending upon the specific technology imported from.

Import the model again.

Go to the Versions tab and note that no new version is created.

Now, go to the Import Options tab and uncheck Create new versions only when new import has changes. Click Save. Click IMPORT. Click IMPORT again without checking any boxes.

Go to the Versions tab and note that now a new version is created.

Propagate Documentation Checkbox (Deprecated)

MetaKarta supports the propagation of documentation (including Name and Business Definition, diagrams, join relationships and custom attributes) from older versions of a model. This propagation of the documentation is now always enabled Thus, when you check the Propagate documentation checkbox, MetaKarta does not change behavior, and documentation changes made to any historical version are still propagated to new versions.

If you leave this checkbox unchecked, the same behavior is true.

This feature can produce conflicts and unexpected results. For example, when you edit the same piece of documentation in different versions the latest edit wins. It is true even when you edit the latest version first and an old version later. You should disable the feature after you finish making future-proof changes to older versions to avoid unnecessary conflicts.

Data Profiling and Sampling

You may configure the harvest to perform data sampling and/or profiling when importing the metadata. In addition,

Steps

Configure a model to be harvested.
Click the Data Setup tab.
Specify data sampling and data profiling options, as desired.

Specifying data sampling and profiling options does NOT cause data profiling and/or sampling on every import of the model. Instead, these settings define the parameters defining how the sampling and profiling should be performed.

You may use the Data sample and profile after metadata import checkbox and MQL Statement to cause the profiling and sampling to occur every time the model is imported. However, that is not the best practice, as sampling and profiling large databases could take orders of magnitude more time than the metadata import.

Instead, you may:
- Schedule the sampling and profiling separately using the Data sampling and profiling operation. This process will also only sample and profile what is specified in the MQL STATEMENT.
- Sample and profile on demand via the user interface at any subset of the model you wish to specify.
In a best practices environment, the application or model administrators should define the data import policy (profiling, sampling, and classification), and the users or scheduled tasks should follow that policy. Not surprisingly, companies use different frequencies for metadata and data imports, and thus it is discouraged to always run the data import (data sampling and profiling) immediately after every import of metadata.

Thus, the in the user interface, the option to import data is separate from the data import policy by displaying the policy in the Data Defaults tab and leaving the option in the Import Options tab. In this way, the option and policy are independent. The option is turned off by default, but the administrators can enable sampling and profiling according to policy. In addition, the administrators may turn off sampling and profiling by default but enable data classification and data import. Still, data sampling and profiling is required in order to perform auto-tagging via data driven data classification.
Go to the Import Options tab and specify Data sample, profile and classify after metadata import according to the Data Setup tab in order to sample and/or profile or both just after the import. Optionally, you may then enter an MQL STATEMENT to define a data request scope, which is a subset of tables defined by a provided Metadata Query Language (MQL) (e.g. tables from a set of schemas, or table with/without a user defined data sampling flag).
Click SAVE.
Click IMPORT.

Example

Sign in as Administrator.

Create the folder and configuration (as needed).

Go to the MANAGE > Configuration. Click the model named Data Lake that is using the File System bridge.

In the Data Setup tab, specify Data Sampling and Data Profiling with the default number of rows.

Click SAVE.

Go to the Import Options tab and specify Data sample, profile and classify after metadata import according to the Data Setup tab in order to sample and/or profile or both just after the import.

Click SAVE and click IMPORT.

After importing, the data profiling and sampling will still not been executed, unless you did not check the Data sample, profile and classify after metadata import according to the Data Setup tab option.

Specifying data sampling and profiling options does NOT cause data profiling and/or sampling on every import of the model. Instead, these settings define the parameters defining how the sampling and profiling should be performed.

You may use the Data sample and profile after metadata import checkbox and MQL Statement to cause the profiling and sampling to occur every time the model is imported. However, that is not the best practice, as sampling and profiling large databases could take orders of magnitude more time than the metadata import.

Instead, you may:

Schedule the sampling and profiling separately using the Data sampling and profiling operation. This process will also only sample and profile what is specified in the MQL STATEMENT.
Sample and profile on demand via the user interface at any subset of the model you wish to specify.

Email Notification

In addition, one may subscribe to or be a watcher of models in the repository and thus receive notifications of changes to those particular models (or sub-models)

In addition to the watchers of a model, the following are the recipient rules for different types of models:

For configurations the model change notification is sent to the users who have the metadata management capability role on the configuration model
For harvested models and user models, the model change notification is set to the users who are the watchers of the models
The import failed and other operation failed notifications are sent to the users who have the metadata management capability on the model.

Example

Sign in as Administrator and configure the email server information.

Go to OBJECTS > Explore and navigate to Accounting > Accounting.

The object page section provides more information on how to set and manage watchers.

In earlier versions of the product, stewards were the ones notified of changes to model. This is no longer the case. Instead, watchers are notified of changes to models when enabled. The migration from stewards to watchers is performed automatically on upgrade from previous versions and thus all stewards become watchers.

Importing Data Store Models from Metadata Excel Format

You may define a data store (data base, files, etc.) by first defining it in Metadata Excel Format, which is a data modeling format using Excel for the specification.

Once imported using that bridge, the resulting model may be treated just like any other data store model (e.g., an JDBC database model) and included in lineage, data cataloging, etc.

Generally, you should consider this to be a method of last choice, and instead you should use the specific bridge for the technology type.

The Metadata Excel Format has been used in the past to specify a data mapping. This feature is deprecated, and all such Excel spreadsheets should be converted to Data Mapping Script format.

Excel Add-In

MetaKarta includes an Add-In which may be included in your locally installed Microsoft Excel software. Instructions are included in Cell B1 ("How to Use") on any Models tab in any exported Metadata Excel Format spreadsheet or the sample one. Hover the mouse over that cell in the spreadsheet and the documentation for that cell for details how to install for various versions of Microsoft Excel.

Steps

Obtain a spreadsheet in the Metadata Excel Format by either:
- Obtain the sample spreadsheet from conf\MIRModelBridgeTemplate\MIRMicrosoftExcel\Standard-Blank.xlsx
- Use the Metadata Excel Format export bridge from a data store model you have already imported.
Open the spreadsheet in Microsoft Excel.
Go to the Models worksheet.
Hover the mouse over the cell B1 in the spreadsheet and the documentation.

Example

Obtain the sample spreadsheet from conf\MIRModelBridgeTemplate\MIRMicrosoftExcel\Standard-Blank.xlsx.

Go to the Models worksheet and hover the mouse over the cell B1 in the spreadsheet and follow the instructions in that documentation.

The result is a new Metadata ribbon in Microsoft Excel:

The Add-In is self-documented in tool tips:

Cloud Identity

Public clouds provide identity management and access control infrastructure that enable their customers to define one security principal that can access multiple services using secret-protected or temporary credentials. For example, Azure allows you to define an identity for an Application, like MM, that can access your Storage and Database services. The MM application can get temporary credentials, like Access Tokens, that can be used to access Azure services. Public clouds support key vaults that help you to safeguard secrets used by cloud apps and services. Each secret has a unique secret identifier which is a URL to a cloud identity secret vault secret (allowing for external storage of such password in a cloud secret vault).

For more details see the Manage Cloud Identities section.

Open In Tool

You may open a third-party source tool (the tool a model was imported from, such as Qlik Sense). In order to do so, you must:

Enable and specify the Server URL signature to call the third-party tool in the Open in Tool tab in MANAGE > Configuration for that model.
Use the Open in Tool icon (column in a worksheet or on the object page) which only is available if the above is enabled.

Steps

Go to MANAGE > Configuration and select the imported model.
Enable and specify the Server URL signature to call the third-party tool in the Open in Tool tab.
Use the Open in Tool icon (column in a worksheet or on the object page)

Example

Sign in as Administrator, go to Manage > Configuration and select the Qlik Sense Cloud model and go to the Open in Tool tab.

A screenshot of a computer Description automatically
generated

Enable and specify the Server URL signature to call the third-party tool.

Then users have access to the Open In Tool capability at various locations such as:

An individual sheet:

A screenshot of a computer Description automatically
generated

which opens that sheet in the app on the Qlik Sense server as follows:

A screenshot of a computer Description automatically
generated

Importing using Remote Harvesting Agents

In order to harvest metadata (model import) from a tool (e.g. database, DI/ETL, BI tool, etc.), the local MetaKarta (Default) Server may not be able to support that tool and therefore require a remote harvesting server (agent) in the following cases:

a bridge running on Microsoft Windows only (such as COM based tools like SAP BusinessObjects, MicroStrategy), while the MetaKarta Server is running on Linux (on Prem or on Cloud)
a bridge requiring the tool (often the client) to be installed in order to access its SDK. (such as Oracle DI or Talend DI), while the MetaKarta server is running on the cloud (with no local file system access)
a bridge requiring access to local files / drivers (such as Database JDBC bridges), while the MetaKarta server is running on the cloud (with no local file system access)
a bridge connected to a tool running on a different network (such as a MetaKarta server on cloud harvesting from sources on prem).

Go to supported tools for more details on the above requirements:

You may define the remote server when creating a model or after one is created via the Import Options tab.

Steps

Sign in as a user with at least the Metadata Management capability object role assignment on the Configuration you are in.

You will be presented with the MANAGE > Configuration menu if any of the following are true:
- Manager of the folder the configuration is in
- Manager of the configuration
- Manager of the version of the configuration
- Editor of the configuration version
- Manager of at least one model in the configuration version
- Security Administration global capability
Define the remote agent when creating the configuration (as needed)
Go to MANAGE > Configuration in the banner.
Click the plus sign under Manage Configuration and select Imported Model.
Pick an IMPORT SERVER to use for the import.

The status (green means available) of the servers is displayed next to the name of that server.

Example

Sign in as Administrator and go to MANAGE > Configuration. Click the plus sign under Manage Configuration and select Imported Model. Use the pull-down list an IMPORT SERVER to pick a remote agent that is available.

The choice of bridges is defined by the remote agent and thus reflects the bridges available on that machine. E.g., if selecting a Microsoft Windows based remote agent, then you will see that Window-only bridge (such as COM based tools like SAP BusinessObjects, MicroStrategy), will be listed, even though the MetaKarta Server is running on Linux.

Harvest several Models from a directory of external metadata files.

It is common for an organization to have a large number of external metadata files, but does not use an external metadata repository. Often, this organization would like to import the files into MetaKarta in batch in an automated fashion. MetaKarta has the ability to support this scenario with the help of a harvesting script.

In this case, the files are stored under a file directory which is accessible to the MetaKarta application server. The script scans the directory and its subdirectories for files of the particular external metadata type and finds matching Models under a particular folder in MetaKarta. The MetaKarta repository folder and model structure will match the structure of files and their directories on the file system. When the necessary Model does not exist the script creates one and imports the file. When the content is present the script will re-import it if the file's version has not been harvested yet.

One can schedule MM to run the script periodically. It should allow customers to place files under the directory and be assured that MetaKarta will import them automatically. It will work for any single model file-based bridges.

A special model named Settings must also be defined in order to control how the files will be imported (what source tool and what bridge parameters).

Ensure proper permissions

Sign in as a user with at least the Metadata Management capability object role assignment on the Configuration you are in.

Create the configuration (as needed)

One may create, harvest and analyze metadata (models) from the MANAGE > Repository tree. However, generally it is best to create a configuration to work in (and to define the scope of analysis and search) and then go to MANAGE > Configuration and create and harvest there.

Create the MetaKarta folder
Go to the MANAGE > Repository
Right click on a folder in the Repository Tree where you want to place the folder containing the results of the import and select New > Folder.
Name the folder accordingly.

Create the Settings file to control the import of Models:
Right click on that new folder in the Repository Tree and select New > Imported Model.
Select the Overview tab in the Create Model dialog.
Enter "Settings" in the NAME for the model.
Select the correct source format in the IMPORT BRIDGE pull-down.
Click OK.
Select the Import Setup tab.
For each of the Parameters, complete according to the tool-tips displayed in the right hand panel of the dialog. In particular, for the File parameter
- Click on the Browse icon and browse for a file inside the directory structure on the file system.
If you cannot find the location using the Browse function you must configure (as part of the installation) the available paths to present to users. More details may be found in the deployment guide.
- Update the File parameter to so that the path only refers to the top level of the directory structure on the file system (i.e., remove the file name and any sub-directory names as well as removing the trailing slash or backslash.
Click SAVE.

Harvest the Models on demand:
Right click the new folder in the Repository Panel and select Operations > Import new model(s) from folder.
Click Run Operation.
The Log Messages dialog then appears and log messages are presented as the import process proceeds.
If you receive the Import Successful result, click Yes to open the Model. If instead you see the Import Failed result, inspect the log messages and correct the source Model file accordingly.
Be sure to include those new models in the configuration.

You may now browse the Models.

File System Like Import Considerations

There are bridges that can import file (and object) system metadata available on-premise (e.g. Linux, Windows) and cloud (e.g. AWS S3, Google Storage). These File System like bridges (e.g., AWS S3, File System, Azure Blob Storage, etc.) are available in order to import from collections of files (e.g., CSV, PSV, XLSX, etc.) such as in a data lake. Each bridge requires you to authenticate with the file system using its proprietary credentials. All bridges share the same file system import methodology. This document outlines the methodology and best practices for importing file system metadata.

The results represents Cloud file stores using the traditional File system abstraction.

Methodology

The bridge imports metadata about subfolders and files located under the “Root Folder” parameter. Oftentimes many subsets of those files are nearly identical in their structure (metadata definitions) and only vary by variables or dimensions, either in the names (e.g., MM-DD-YYYY-Sales.csv) of the files themselves or the hierarchy folders into which they are organized. These subsets are referred to as partitions, and these bridge will attempt to identify the partitioning automatically, or assist in directly defining partitioning.

The bridge tries to understand the internal schema (fields) of data files by using file type recognizers. The bridge has recognizers for standard formats, like CSV, JSON, and Apache Avro. When the bridge cannot recognize a file (e.g., JPEG), it imports the file with its type set to UNKNOWN.The file system can have standalone data files and partition folders of data files. The bridge imports them as files and partitioned datasets. A dataset is similar to a database table. It has fields as columns. When a file contains the names of fields, MIMB imports them. Otherwise, the bridge identifies fields by their position in the file and calls them col1, col2, etc. The bridges will try to infer the data type of each field from its data.

In the case of the partitioned dataset, a field can represent a partition. Applications, like Hive and Spark reference partitions by name only. Partition fields do not have positions.

Recognizers

A recognizer reads the data in a file. If it recognizes the format of the data, it generates a schema. The recognizer also returns a confidence percentage to indicate how certain the format recognition was. These bridges provide a set of built-in recognizers, but you can also create custom recognizers. A recognizer can be associated with file extensions. The bridges invoke recognizers with matching file extensions first. It invokes matching custom recognizers (in the order that you specify them) before predefined ones. When a recognizer returns a confidence percentage = 100 during processing, it indicates that it is 100 percent certain that it can create the correct schema. The bridge then uses the output of that recognizer.

If no recognizer returns a confidence percentage = 100, the bridge uses the output of the recognizer that has the highest confidence percentage. If no recognizer returns a confidence percentage greater than 0, the bridge classifies the file type as UNKNOWN.

Custom Recognizers

When you have custom text file formats (e.g. logs or CSV files with multi-line preamble) and would like the bridge to recognize (and profile) them automatically you can define custom recognizers for them.

NOTE: When you know all your custom formats you can define them manually in the Data Catalog.

The output of a recognizer includes a string that indicates the file's type or format (e.g. CSV) and the schema of the file. For custom recognizers, you define the logic for creating the schema based on the type of recognizer. Recognizer types include defining schemas based on GROK patterns and CSV format options.

TBD: File-based databases, like Hive and Athena, use configurable SerDe serialization libraries that help them to read hierarchical and custom flat files as tables. Data Catalog can help with configuring these libraries, importing their metadata and data profiling details at the same time.

Grok recognizer

Grok is a tool that is used to parse textual data given a matching pattern. A grok pattern is a named set of regular expressions (regex) that are used to match data one line at a time. The bridge uses grok patterns to infer the schema of your data. When a grok pattern matches your data, the bridge uses the pattern to determine the structure of your data and map it into fields.

MIMB provides many built-in patterns, or you can define your own. You can create a grok pattern using built-in patterns and custom patterns in your custom classifier definition. You can tailor a grok pattern to classify custom text file formats.

Patterns can be composed of other patterns. For example, you can have a pattern for an SYSLOG timestamp that is defined by patterns for the month, day of the month, and time (for example, Feb 1 06:25:43). For this data, you might define the following pattern:

SYSLOGTIMESTAMP %{MONTH} +%{MONTHDAY} %{TIME}

CSV recognizer

You can use a custom CSV recognizer to infer the schema of various types of CSV data. The custom attributes that you can provide for your recognizer include delimiters, options about the header, and whether to perform certain validations on the data.

Compression

If the file that is imported is compressed, the bridge tries to stream its data. It is not reasonable to stream data of file formats, like Apache Parquet, that store metadata at the end of the file. In this case, the bridge must download and uncompress the whole file which can take a long time.

When a bridge runs, it interrogates files to determine their format and compression type and saves these properties on the file object. Some file formats (for example, Apache Parquet) enable you to compress parts of the file as it is written. For these files, the compressed data is an internal component of the file, and the bridge does consider them. In contrast, if an entire file is compressed by a compression algorithm (for example, gzip), then the bridge sets the Compression Type property.

Partition handling

Partitioning refers to the splitting of a dataset along its dimensions. Each partition contains a part of the dataset. For example, a dataset representing Customers could be partitioned by Country. A dataset can be partitioned by more than one dimension. For example, a dataset of Orders could be partitioned by Day and by the Country which generated the order.

Folder-based partitioned datasets

The folder-based partitioning is based on a file system hierarchy. For example, the following file system organization represents a dataset called Sales with Year, Month, and Day dimensions (partition key fields). It is partitioned at the Day level, with one folder per partition.

Sales/Year=2017/Month=08/Day=01/file0.csv

Sales/Year=2017/Month=08/Day=02/file1.csv

Sales/Year=2017/Month=09/Day=01/file0.csv

Sales/Year=2017/Month=09/Day=02/file1.csv

Formally, a folder-based partitioned dataset is a single-rooted hierarchy of folders with similar files at the bottom of the hierarchy, empty intermediate folders, and these files located at the same folder distance from the top folder. The actual data in the files is not used to decide which records belong to which partition. The dataset schema consists of file fields in the file and partition fields represented as folders.

Schema

A file has a type, like CSV or Parquet. The bridge depicts it in the File Type attribute. All file types share the following field characteristics that define the file schema:

Name
Position
Data type
Partitioning
Hierarchy

When a file defines the names of fields the bridge imports them. Flat file formats, like CSV, do not need to define the names of fields (header row). ETL and BI tools address fields primarily by their positions. ETL, BI, and File system bridges identify fields by their positions. It allows Data Catalogs to stitch ETL and BI to File system metadata by position, by default.

When an ETL or BI tool addresses a field by name its bridge depicts the information. Data Catalogs should respect the information and stitch these fields by name. A field has a data type, like DATE and INTEGER. Some file formats, like Parquet, Avro and ORC define data types, and the bridge imports them. When the file format does not define data types the bridge tries to infer them from the raw data. When it cannot, it sets the data type to UNDEFINED (e.g. field does not have any values or they are set to NULL).

The partitioning information, like partition key fields, is usually defined outside a file by its parent folders. The bridge tries to discover the partition key fields and add them to the dataset schema. Some formats, like XML and JSON, are hierarchical. The bridge depicts them in the schema using parental aggregation hierarchy. JSON format supports arrays. The bridge represents them in the schema using a special aggregation level. An XML element can have a value. The bridge represents it using a dedicated attribute called $value$. Some formats, like Parquet, Avro, and ORC, can be hierarchical but are primarily used to carry flat data. The bridge depicts these files’ hierarchy when it is used.

Similarity

Two schemas are similar when they are the same or have small differences. Here are some tolerable small differences:

schema A is a subset of schema B when schema A has all fields of schema B with the same names, positions, and data types. Schema B should have at least two fields.
a,b ≃ a,b,c
a,b ≄ b,a
a(int),b(int) ≄ a(date),b(int)
The above applies to hierarchical schemas at every level without concern for positions. Hierarchical schemas support ordered and unordered lists of children. o/a,o/b ≃ o/b,o/a

Here are some big differences that make schema dissimilar:

Different partition fields
Hierarchical schemas are different when they have an object and attribute by the same name under the same parent object a@b ≄ a/b/c

Partition detection

The bridge tries to automatically detect partitioned datasets and their partition fields (keys). It calls the dataset by the name of its top folder. When the majority of file schemas at a folder level are similar, the bridge creates a partition of a dataset instead of separate datasets per file.

A partition folder of a dataset can have many thousands of similar files. The bridge has the Sample size parameter that allows you to control the number of files that would represent the partition. The default value of the parameter is 5 (TBD). You can leave it empty to request the bridge to inspect all files in all folders.

The bridge skips empty files, files that start with underscores, and other popular files that major tools employ for partition housekeeping. You can remove more files from the consideration by specifying their patterns in the bridge Exclude filtering parameter.

Partition folders

When you decide to use a partitioned dataset you could be interested to know how many years back it goes or if it covers US and EU regions. This information can be derived from the partition folder names.

Sales/region=US/Year=2017/

Sales/region=EU/Year=2021/

Application vs Standard partitioning

Each dimension (partition) has a name. Standard frameworks, like Hive and Spark, follow the PartitionName=Value naming convention partition keys. For example:

Sales/Year=2017/Month=08/Day=01/file0.csv

Proprietary applications do not have to follow the standard naming convention. At best they would use Values for partition folder names.

Sales/2017/08/01/file0.csv

At worst they could have the dataset name anywhere in the folder hierarchy

Folder/2017/08/01/Sales/file0.csv
Folder/2017/08/01/Orders/file0.csv

You can use the bridge Include and exclude filter parameters to specify to the bridge what datasets to import and what datasets to skip. In the above case, you can only specify one dataset to import.

A standard partition folder contains its partition name. An application one does not. The bridge calls them partition1, partition2, etc.

You should plan to set proper dataset and partition names in the target Data Catalog tool. For example, the following application partition structure:

Folder/2017/08/01/Sales/file0.csv

can re-named as:

Folder (dataset) => Sales
partition1 (partition field) => Year
partition2 (partition field) => Month
partition3 (partition field) => Day

When you need to import Sales and Orders partitioned datasets in the one model you can use the bridge Partition Folder parameter to specify the application partition pattern, like Folder/{Year}/{Month}/{Day}/[*]/*.csv

Similar but different neighboring datasets

To influence the bridge to create separate datasets for some partitions, add each dataset's top folder to the bridge Partition Directories parameter. In the next example, consider the following file system structure:

Folder/NewSales/1/file.txt
Folder/NewSales/2/file.txt
Folder/OldSales/1/file.txt
Folder/OldSales/2/file.txt

If the schemas for files under NewSales and OldSales are similar, and the bridge Root Folder parameter is set to Folder or above, the bridge creates a single dataset with two partition key fields. The first “application” partition key field contains NewSales and OldSales, and the second partition key field contains Day partitions. To create two separate NewSales and OldSales datasets, specify them in the bridge Partition Directories parameter.

In file system databases, like Amazon Athena or Apache Presto, each dataset corresponds to a file system path with all the files in it. When files have different schemas, the database does not recognize different files within the same path as separate datasets. This might lead to queries in the database that return zero results. When you are preparing datasets for file system databases you can use the discussion in the chapter to disambiguate similar but different neighboring datasets.

Files-based partitioned dataset (ETL)

ETL tools can access files in a folder by pattern, like ., .csv, or a?c.. In this case, a partition folder represents a dataset.

Sales/2017-08-01.csv
Sales/2017-08-02.csv
Sales/2017-09-01.csv

When the bridge encounters an independent folder with similar files it imports it as a dataset. It can allow you to resolve an ETL connection with the dataset model. Technically this folder is both partition folder and partitioned dataset, or dataset without partition keys.

The bridge tries to infer the partition information from the names of folders but not files.

Incremental import

When the bridge runs more than once, perhaps on a schedule, it looks for new or changed files or partitioned datasets (datasets). The output of the bridge includes new datasets. It can include details about all partition folders optionally.

A partitioned dataset can have thousands of folders and millions of files. It can take the bridge a long time to analyze them to detect the dataset. The bridge tries to avoid re-doing the detection work by using its file cache. The cache has the previous import metadata about the folder hierarchy, imported files, and partitioned datasets. During the incremental import, the bridge can reuse the information to avoid spending time re-detecting partitioned datasets and figuring out schemas of previously imported and potentially unchanged datasets.

The bridge supports the following incremental import levels:

New objects Use it when schemas of (file or partition) datasets do not change over time. It could be useful for large and rarely changing file systems that take a long time to reimport. The bridge assumes that a file can grow in size but its schema doesn’t change. The bridge uses its file cache to avoid re-importing objects even when their modification time changed.
New and changed objects (default ?) Use it when schemas of files or partitioned datasets can change between re-imports. In this case, the bridge spends time looking for the latest file in each imported partitioned dataset and uses the file to represent the dataset. The bridge reimports a file dataset when it is changed.
All objects The bridge imports all objects under the Root Folder from scratch ignoring its file cache. Internally, it runs with the -cache.clear option.

Particular dataset reimport

TBD: When the bridge imports a lot of datasets at once and only some of them change once in a while you can use the New Objects import mode and still request the bridge to import changes for particular datasets using the -dataset.reimport option.

The option takes the path of the dataset relative to the Root Folder. For example, to re-import the Sales dataset located under the Warehouse folder you can specify -dataset.reimport Warehouse/Sales. You can use the option to re-import the schema of a file dataset. For example, -dataset.reimport Folder/file.csv.

You can use the option to optimize the New Objects import mode to reuse stable datasets by default and import changes for manually specified datasets.

The Data Catalog can automate the option specification by allowing users to mark particular datasets as evolving. The solution should allow users to find marked datasets and un-mark them in bulk.

Time dimension

The bridge tries to detect the time dimension of a partitioned dataset. Alternatively, the user can specify the information using the -time.dimension option. When the information is available the bridge uses it to find the latest file representing the dataset. It can improve the bridge change detection performance dramatically, especially when importing the metadata remotely.

Sales/2017/08/01/file0.csv

-time.dimension Warehouse/Sales/YYYY/MM/DD

Sales/2017/Feb/01/06-25-43/file0.csv

-time.dimension Warehouse/Sales/YYYY/MONTHDAY/DD/TIME

The Data Catalog can automate the option specification by allowing users to specify the partition field time format. The solution should allow users to find marked datasets and un-mark them in bulk.

TBD: MM supports Custom data profiling SQL that helps it to find the most relevant (latest) rows to represent a dataset. We can try to do something similar for file system datasets by allowing users to specify a “time dimension” or “partition pattern” where we can find the most relevant (latest) files.

Partition folder

When the -partition.folder option is specified the incremental import collects details about partition folders found since a previous import. When the schemas are compatible, new partitions will be added to existing partitioned datasets.

Incremental Import Performance and Minimal System Impact

Since incremental import, when executed (scheduled), only imports what is updated since the last import, and also only writes new versions those changed sub-models to the repository, several advantages accrue:

Incremental Harvesting takes much less time, both in terms of
Reading from the source environment
Processing the import results
Writing new sub-models to the repository
Impacts the source environment minimally. As many metadata source environments (say business intelligence and reporting) are near capacity and can naturally be impacted by the import process, incremental harvesting means that only the changed objects need to be imported, minimizing the performance hit.
Repository database space requirements can be orders of magnitude less, as only those sub-models with updated objects need to be written to the repository.

Including Models from an erwin Mart

The erwin Data Modeler Mart is a repository of erwin Data Modeler models. One may connect to the Mart and harvest any subset (or all of those models) as needed.

In addition, one may include separate models in a configuration (see configuration management) using the MANAGE > Repository and picking those individual models then dragging the Mart model into the configuration.

Since these individual models appear as stand-alone model in the metadata explorer UI, scheduling, logging, etc., will refer back to the Mart model as a while.

Picking Individual Mart Models

You may include separate models in a configuration (see configuration management) using the repository manager UI and picking those individual models when dragging the Mart model into the configuration.

Steps

Sign in as a user with at least the Metadata Editing capability object role assignment on the configuration.
Go to the MANAGE > Repository.
Right click on a folder in the Repository Tree where you want to place the model containing the results of the import and select New > Model.
Import the model from the Mart.
Go to MANAGE > Configuration for the configuration you wish to add individual models to.
Click Add > External Model and select the Mart model in the Repository Tree.
Select the option to include individual models.
Pick the models you wish to include.

You may always repeat this process to include more models. To remove one, simply remove it from the configuration.

Export to a 3rd Party Tool

In addition, these comments may then be exported out of MetaKarta and opened in the external metadata tool, there to be reviewed and edited in the original external metadata model format (where supported in the external metadata tool user interface).

Sign in as a user with at least the Metadata Viewing capability object role assignment on the model.
Navigate to the object page of the report
Select More actions... > Export.
Choose the export bridge and provide the bridge parameters as need.
Click Export.

Forward Engineering Export menus are only available with certain licenses.

Important Disclaimer: Some of the features detailed in this document may not apply and/or be available for the particular edition/version you are using.

Models imported from databases are now multi-model (one model per schema in Oracle, or per database in a SQL Server), therefore the export menu (e.g. to a BI tool like Tableau) is only available on the object page for the database schema More Actions menu in the upper right.