Your browser (Internet Explorer 6) is out of date. It has known security flaws and may not display all features of this and other websites. Learn how to update your browser.
X

Archive for November, 2016

Ephesoft: Four Features of Practical Innovation

By Jake Karnes, ECM Consultant at Zia Consulting

With an eye to the future, Ephesoft continues to deliver practical innovation which improves the capabilities and usability of its core platform, Ephesoft Transact. Ephesoft demonstrates this commitment to current and future customers with four new features: cross-section extraction, automatic data conversion, paragraph extraction, and automatic regular expression suggestions and creations.

Ephesoft INNOVATE 2016 brought together leading minds to discuss the latest industry advances. Software companies face the persistent challenge of delivering practical innovation while staying true to their product’s role in a customer’s organization. Ephesoft tackles this problem with a two-pronged approach. Ephesoft remains on the cutting edge of document capture technology with their new big data analytics platform—Ephesoft Insight. Insight promises to extract content and meaning from documents scattered across an organization using machine learning and patented text-based analysis. In addition to pushing the envelope with Insight, Ephesoft is continuing to expand and strengthen Ephesoft Transact.

Feature 1: Cross-Section Extraction

Ephesoft Transact, formerly Ephesoft Enterprise, adds several powerful features in the upcoming 4.1 release with roots in customer feedback and provide out-of-the-box functionality which previously required customization. One such feature is cross-section extraction. This technique uses the intersection of two keys to find the correct value. In the example below, the two keys are “Services Borrower Did Not Shop For” and “Borrower-Paid” which meet at the value “$236.55.” This triangulation using multiple keys allows for the extraction of values which are ill-suited for existing extraction methods such as table extraction.

closingcostdetails

Feature 2: Automatic Data Conversion

Another feature which comes from business use cases is automatic data conversion. This feature allows extracted dates and other values to be automatically normalized to a standard format. For example, a date extracted as “MAR 21 2016” can automatically be converted to “03/15/2016” and vice versa. Other possible data conversions include predefined suffixes and prefixes, data replacement, upper or lower case conversion, and more. One novel use for this functionality would be to clean up imperfect OCR results. The extraction rules could be defined to allow for missing or erroneous characters, and the values could then be corrected during this data conversion step by removing or substituting the known, correct character(s).

dateconversion

Feature 3: Automatic Regular Expression Suggestion and Creation

Another example of Ephesoft’s dedication to improving user experience by expanding Transact’s functionality is the new, automatic regular expression suggestion and creation. Ephesoft has recognized the pain of writing regular expressions by hand, and helps minimize these efforts by suggesting regular expressions automatically. These suggestions are sourced from Ephesoft’s own library of common regular expressions, such as emails, dollar amounts, and dates. But Ephesoft can even help you create custom regular expressions based on the examples provided during extraction training. This strikes a powerful balance between the flexibility to write your own and the ease of having them automatically suggested or created for you. The usefulness of regular expressions is now unlocked without burdening the user with learning the complex regular expression notation. As an added bonus, this feature is already included in the latest release of Transact, and further information can be found at Ephesoft’s wiki page here or in a video demonstration below.

Feature 4: Paragraph Extraction

Paragraph extraction demonstrates Ephesoft Transact capabilities of mining valuable information from unstructured documents. This features enables the user to define values to be extracted from within larger bodies of text, without specific keywords or fixed locations. As an example, consider the following sections of a mortgage note:

Paragraph extraction can be used to extract each of the highlighted values. Even values which wrap around multiple lines (e.g. “Super Mortgage Inc”) can be handled with ease. Previously, this would have required custom scripting or a complex combination of different extraction techniques. Paragraph extraction allows the user to unlock information from their documents which may have been unused before.

These features indicate that Ephesoft’s innovation is not limited to their groundbreaking analytics platform. They continue to implement practical innovation which is equally important for new and existing customers. These features provide straightforward solutions to common pain points. By inviting and accepting feedback from their customers and partners, Ephesoft is pushing the capture industry forward on multiple fronts.

borrow

Jake Karnes – ECM Consultant Zia ConsultingJake Karnes is an ECM Consultant at Zia Consulting. He extends and integrates Ephesoft and Alfresco to create complete content solutions. In addition to client integrations, Jake has helped create Zia stand-alone solutions such as mobile applications, mortgage automation, and analytic tools. He’s always eager to discuss software to the finest details, you can find Jake on LinkedIn.

Tech Post: Extracting Metadata in Alfresco

Extracting Metadata in Alfresco

by Jeff Rosler, Solutions Architect at Zia

When importing files, each is uploaded with additional information including things like title, description, and text. Out of the box, Alfresco extracts the properties that have been mapped and metadata is taken from the content using Apache Tika. The TikaAutoMetadataExtracter class loads the supported mime types so all users have to do is create a bean that references that class and then set the properties desired in extraction.

The following are some simple samples for how metadata can be pulled from different mime types and set to Alfresco properties. Since Apache Tika is used as a basic metadata extractor in Alfresco, you can use that to extract metadata for all the mime types that it supports. The current version of Tika that Alfresco is using (for Alfresco 5.0.2.5 and 5.1) is basically Tika 1.6 which supports the following file types. The TikaAutoMetadataExtracter class loads all the mime types that embedded version of Tika supports. So, all you need to do is to create a spring bean that references that class and set the properties to extract and set the Alfresco properties you’d like to have set. You don’t have to write any custom code.

Example 0 – Set logging to see what metadata can be extracted

Before defining your metadata extraction, it’s good to set your logging level for metadata extraction to DEBUG. When you do this, the extracted metadata for a file is shown in the log. This lets you correctly choose the embedded metadata property names to configure. You can set this by going to your log4j.properties file for the repo (alfresco) and adding the following line.

log4j.logger.org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter=DEBUG

Restart alfresco and import a file. You should see something like this in the log. You can see properties with name spaces such as dc:title (the dc stands for dublin core, a metadata standard) as well as other properties that don’t contain a namespace. You can use these embedded properties to map to standard or custom Alfresco properties.

2016-02-03 10:03:49,474 DEBUG [content.metadata.AbstractMappingMetadataExtracter]
 [http-bio-8080-exec-10] Extracted Metadata from ContentAccessor[ 
 contentUrl=store://2016/2/3/10/3/068b7c2b-1f7f-4b12-aa90-e78794eb8e77.bin, 
 mimetype=application/vnd.openxmlformats-officedocument.wordprocessingml.document,
 size=286436, encoding=UTF-8, locale=en_US]
 Found: {date=2016-01-22T18:59:00Z, Total-Time=1, extended-properties:AppVersion=14.0000,
 meta:paragraph-count=12, subject=beer, ipsum, meta:print-date=2016-01-22T18:59:00Z,
 Word-Count=405, meta:line-count=45, Manager=null, Template=Normal.dotm, Paragraph-Count=12,
 meta:character-count-with-spaces=2246, dc:title=Tom's Ipsum Beer, modified=2016-01-22T18:59:00Z,
 meta:author=Jeff Rosler, meta:creation-date=2015-12-31T15:49:00Z,
 Last-Printed=2016-01-22T18:59:00Z, extended-properties:Application=Microsoft Macintosh Word,
 author=Jeff Rosler, created=2015-12-31T15:49:00Z, Creation-Date=2015-12-31T15:49:00Z,
 Character-Count-With-Spaces=2246, Last-Author=Jeff Rosler, Character Count=1853, Page-Count=2,
 Application-Version=14.0000, extended-properties:Template=Normal.dotm, Author=Jeff Rosler,
 publisher=Zia Consulting, meta:page-count=2, cp:revision=4,
 Keywords=beer, ipsum, meta:word-count=405,
 dc:creator=Jeff Rosler, extended-properties:Company=Zia Consulting,
 description=beer, ipsum, dcterms:created=2015-12-31T15:49:00Z,
 Last-Modified=2016-01-22T18:59:00Z, dcterms:modified=2016-01-22T18:59:00Z,
 title=Tom's Ipsum Beer, Last-Save-Date=2016-01-22T18:59:00Z, meta:character-count=1853,
 Line-Count=45, meta:save-date=2016-01-22T18:59:00Z, Application-Name=Microsoft Macintosh Word,
 extended-properties:TotalTime=1, extended-properties:Manager=null,
 Content-Type=application/vnd.openxmlformats-officedocument.wordprocessingml.document,
 creator=Jeff Rosler, comments=null, dc:subject=beer, ipsum, meta:last-author=Jeff Rosler,
 xmpTPg:NPages=2, Revision-Number=4, meta:keyword=beer, ipsum, dc:publisher=Zia Consulting}

Example 1 – Set author, title, description

Specify your spring bean. You can name the id anything you want (that is a legitimate XML id) and point to the TikaAutoMetadataExtracter class (yes I know, that isn’t the way you spell Extractor, but the code has misspelled Extractor with an “e” instead of an “o”). In the code block below, we are overriding the default mapping and pointing to a separate property file. The properties could have been listed inline here, but pointing to the property files allows for easier editing.

 

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd">

   <bean id="extractor.auto" class="org.alfresco.repo.content.metadata.TikaAutoMetadataExtracter" parent="baseMetadataExtracter">
      <constructor-arg>
         <ref bean="tikaConfig"/>
      </constructor-arg>
      <property name="inheritDefaultMapping">
         <value>false</value>
      </property>
      <property name="mappingProperties">
         <bean class="org.springframework.beans.factory.config.PropertiesFactoryBean">
            <property name="location">
               <value>classpath:alfresco/extension/TikaAutoMetadataExtracter.properties</value>
            </property>
        </bean>
      </property>
   </bean>

</beans>

After specifying your spring bean that points to a properties file (e.g. TikaAutoMetadataExtracter.properties), within the properties file, set any Alfresco namespaces you’re specifying for the content model and then each property to be mapped. Note that during the extraction if you specify properties on aspects, those aspects will be applied to the content node automatically for you. Note that you put the embedded metadata property name on the left of the equal sign and the Alfresco property on the right. If you are specifying an embedded property that has a namespace prefix (e.g. dc:title) remember to escape the colon with a backslash (e.g. dc\:title). You don’t need to do that on the property value, just the property.

 

# Namespaces
namespace.prefix.cm=http://www.alfresco.org/model/content/1.0
&nbsp;
# Mappings
author=cm:author
dc\:title=cm:title
description=cm:description

Example 2 – Setting multiple Alfresco properties 

Embedded Metadata can be mapped to multiple Alfresco properties by specifying those properties as comma separated values. The example below shows setting the embedded author value to both cm:author and cm:description.

 

# Namespaces
namespace.prefix.cm=http://www.alfresco.org/model/content/1.0
 
# Mappings
author=cm:author,cm:description

Example 3 – Specifying when properties are extracted

The Metadata extractor has something called an OverwritePolicy. The OverwritePolicy specifies when an Alfresco property is overwritten. For example, you might not want your extractor to overwrite every time a new version is stored of a file as this would overwrite any of the mapped property values that were updated manually via Share or automatically through actions, workflows or other processes. Therefore, Alfresco defaults the OverwritePolicy to PRAGMATIC. This basically sets it to extract if the extracted property is not null  and the Alfresco property is not set or is empty.

However, if you want to change the behavior so that the extraction happens all the time (e.g. when content is updated), then you should set the OverwritePolicy to EAGER. This can be done by passing that as a parameter within your extractor bean as can be seen below.

<bean id="extractor.auto" class="org.alfresco.repo.content.metadata.TikaAutoMetadataExtracter" parent="baseMetadataExtracter">
   <constructor-arg>
      <ref bean="tikaConfig"/>
   </constructor-arg>
   <property name="inheritDefaultMapping">
      <value>false</value>
   </property>
   <property name="overwritePolicy">
     <value>EAGER</value>
   </property>
 
   <property name="mappingProperties">
      <bean class="org.springframework.beans.factory.config.PropertiesFactoryBean">
         <property name="location">
            <value>classpath:alfresco/extension/TikaAutoMetadataExtracter.properties</value>
         </property>
      </bean>
   </property>
</bean>

Example 4 – Setting tags

Support for mapping tags was added in Alfresco 4.2.c. Details are mentioned in this blog post. You can easily add that to your extraction mapping. It just needs to be enabled in the extract-metadata bean and then the mapping set within your properties file.

NOTE: When setting tags, don’t do this while running from the Alfresco SDK using springloaded. Tagging won’t work and as soon as you try and import some content with tags (after you’ve made the updates below), your content will fail to load.

ALSO NOTE: I noticed in Alfresco 5.0 that the embedded keywords are getting concatenated into a single comma separated tag. This has been identified as a bug and a JIRA (MNT-15497) was created for fixing it. The fix was put in 5.0.4 and 5.1.1.

The following code block can be added to your spring bean xml config file to enable tagging.

 

<!--
    Override metadata extraction bean from action-services-context.xml to turn on the taggingService and enableStringTagging
    This will allow keywords to get mapped to tags.
 -->
<bean id="extract-metadata" class="org.alfresco.repo.action.executer.ContentMetadataExtracter" parent="action-executer">
  <property name="nodeService">
    <ref bean="NodeService" />
  </property>
  <property name="contentService">
    <ref bean="ContentService" />
  </property>
  <property name="dictionaryService">
    <ref bean="dictionaryService" />
  </property>
  <property name="taggingService">
      <ref bean="TaggingService" />
  </property>
  <property name="metadataExtracterRegistry">
    <ref bean="metadataExtracterRegistry" />
  </property>
  <property name="applicableTypes">
    <list>
      <value>{http://www.alfresco.org/model/content/1.0}content</value>
    </list>
  </property>
  <property name="carryAspectProperties">
    <value>true</value>
  </property>
  <property name="enableStringTagging">
    <value>true</value>
  </property>
</bean>

After tagging is enabled, just update your property file to map the appropriate embedded Keywords property to cm:taggable. The example below uses the embedded Keywords property.

# Namespaces
namespace.prefix.cm=http://www.alfresco.org/model/content/1.0
 
# Mappings
Keywords=cm:taggable

 

Metadata and Alfresco by Jeff Rosler, Solutions ArchitectJeff Rosler has more than 15 years’ experience architecting and developing enterprise content management solutions for customers across multiple verticals to help solve different business challenges. These solutions include digital asset management, component content management using XML, business process management, and web content management utilizing Alfresco and related standards, technologies, and products.