What IS Semi-Structured Data?

Posted · Add Comment
semi-structure data

Data.  The term is used so frequently, by so many, but it is not often that we stop and think about what we are actually talking about.  Data typically means “pieces” of information that are displayed in a specific way.  But, it gets more complicated than that.  Data can mean slightly different things in different contexts.  According to Webopedia; “Data can exist in a variety of forms — as numbers or text on pieces of paper, as bits and bytes stored in electronic memory, or as facts stored in a person’s mind.”  Even Webster Dictionary has three definitions of data:

1:  factual information (as measurements or statistics) used as a basis for reasoning, discussion, or calculation 

2:  information output by a sensing device or organ that includes both useful and irrelevant or redundant information and must be processed to be meaningful

3:  information in numerical form that can be digitally transmitted or processed

And it gets even more complicated from there.  There are different KINDS of data.  When it comes to databases, there are different terms used to describe their formation or building blocks.  You have structured, semi-structured, and unstructured data.  Today we are going to discuss semi-structured data; what it is and what benefits it has.

What is semi-structured data?

Semi-structured data is a data model that is based on trees.  I don’t mean the green stuff that you put in your living room during the holidays.  No, what I mean is that semi-structured data is hierarchical.  Here is a breakdown of semi-structured data thanks to the University of London Department of Computer Science:

Semi-structured data is:

  • organised into semantic entities
  • similar entities are grouped together
  • entities in same group may not have same attributes
  • order of attributes not necessarily important
  • not all attributes may be required
  • size of same attributes in a group may differ
  • type of same attributes in a group may differ

Where does semi-structured data come from?

Semi-structured data comes from instances where application data does not have a rigidly and predefined schema.  It also comes from a need to make a flexible data schema that can cope with changes and relationships between data records that still maintains structure and type safe schema.

In these cases, the schema is not given in advance, is descriptive, partial, evolving, and usually very large.  When the data cannot be modeled naturally or usefully using a standard data model, voila, you have semi-structured data!

In the early development of semi-structured data there was XML style schema that contains the structure, type, and data in a single document. Modern web styles like JSON are better optimized for the web using concise structure and are native for java script applications.

Real world applications and platforms need a way to evolve with time.  Using the web as an example, web forms can change and capture different data for different users. In a traditional relational database that means changing the database schema every time a new field is needed on a form, and creates inefficiencies in circumstances where fields are left empty. In some cases a default value must be used just to make legacy code to work.

Semi structured data fills this gap by allowing a user to capture any data in any structure without changing the database schema or the coding. The application simply uses what it needs at a single point and ignores the rest. Adding new data or removing it is easy and should not break any functionality or dependencies.

What can you do with a semi-structured data model/schema? 

So what is the point of semi-structured data?  Why not just use structured data to build software? Here are some of the benefits of using semi-structured data:

  • Gives you a flexible representation of data.  Data can also evolve and change overtime without a need for configuration or code changes.
  • Allows you to collect and use data from multiple sources with differences in notation and meaning.  Your data doesn’t have a type anymore (table).
  • Describes relationships as references
  • Incorporates relationships completely into parent objects (tree)
  • Maintains and supports complex query types of data structure and storage while relationships between objects and complex schema are still expressed.
  • Spans queries and reporting over many system and data types.

In other words, semi-structured data is used to exchange data between or integrate heterogeneous data sources.  This is what allows our data audit log software Observato to track data from multiple systems all at the same time.

How do you analyze semi-structured data?

Because of the nature of semi-structured data you are no longer looking at specific types of data (tables) but rather you have a more holistic approach; looking at data types and patterns spanning over all systems monitored.  Semi-structured data allows you to take an interest in how different connected objects are changed together or in a workflow. This gives you the ability to analyze the growth or diminishment of patterns, and to track user activity (who changed what when). The latter is what our product Observato can do for you.

Semi-structured data examines individual properties (column / fields) not objects (tables).  The boundaries between objects are less enforced allowing users to track and examine events happening in many systems at same time and the effect it has on the data. Joining data from multiple sources over many different types is an easy task with semi-structured data.

What’s the point?  Semi-structured data is everywhere, and it is closer than you think.  You will find semi-structured data built into most websites because of its flexibility, its ability to easily integrate data, and due to the evolving schema of a website.  At the same time, semi-structured data is also vastly misunderstood.  Datawatch calls semi-structured data the “duck-billed platypus of the data kingdom.” It is everything in between structured and unstructured data.  But as big data booms, semi-structured data gives computer scientists and organizations the flexibility to adapt, integrate, and reflect data in new ways.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.