External Dissemination Ideas Portal

Interact with Temis Luxid

Following our meeting with DKI a couple of weeks ago, we agreed to do a prototype/exploration-type story to see how easy or complicated it would be to integrate the Temis webservice into Kappa v3. Thierry V kindly provided the documentation attached to this ticket.

Limitations

The limitations and restrictions that we've set ourselves or that are inherent to Temis are:

  • Only work with summaries (as they are short texts which are well handled by the cartridge); don't use larger documents
  • Only work with English summaries (this is not a problem at all because the contents of the other languages should yield the same results) because the cartridge only works for English and French
  • Only keywords and countries are returned by this cartridge (no theme/topic yet)
  • The current implementation (where no "zoning" is used) will require us to extract the contents parts of the document, and discard parts of the text that are recurring and of no interest. This will probably mean to use the XML version and convert it to a text-only equivalent.
  • Although it might be a worthy contender, we won't use vertx or another message bus.
  • The Luxid API is only accessible within the OECD intranet (sorry Basheer!).

Use cases

The two use cases that we've identified are:

  1. Temis will analyse an English summary on import into Kappa v3 via a trigger
  2. Temis will analyse on request via the user interface

1. On import, via trigger

When an English summary is imported for the first time or when it is updated, it should be sent to Temis. When Temis returns, the keywords and countries extracted should be added to the Work.

As this is a non-interactive action, no feedback for a user is required or expected. Obviously, once imported the displayed Work section should display the concepts and countries.

When reloading a summary, existing concepts and countries should be deleted and the document should be re-analysed.

2. On request, via UI

We need to add a button so that appropriately authorized users can trigger the Temis analysis manually. While the content to be analysed will be the English expression, I would argue we should have the button on the Work level as it is there that the result will be displayed. The button should probably only be there if we're looking at a Summary that has an English expression.

The user will expect some kind of visual feedback in the user interface:

  • display of the extracted information in the Work section
  • status message regarding success or error

If there are already concepts and countries that have been extracted, it should be possible for the user to re-analyse the document. This should delete the existing concepts and countries and the trigger the same action as before. A warning may be displayed for the user.

Data

If all goes well, Luxid returns something like this (incidentally, this is the result for In It Together: Why Less Inequality Benefits All [1]:

<enrichment>
    <annotation annotationPlanName="Document Classification (Plan_DC)" annotationDate="2016/01/27 10:14:01">
        <docID value="conv-609397273"/>
        <language value="English"/>
        <ContentFilter>No chapter to filter out for conv-609397273</ContentFilter>
        <coverage relevance="normal" uri="http://kim.oecd.org/Taxonomy/GeographicalAreas#LatinAmerica" score="0.6640791" zScore="-0.99999994">Latin America</coverage>
        <coverage relevance="high" uri="http://kim.oecd.org/Taxonomy/GeographicalAreas#OECDCountries" score="8.6069765" zScore="1.0000001">OECD countries</coverage>
        <subject relevance="normal" uri="http://kim.oecd.org/Taxonomy/Topics#T1181" score="0.60154617" zScore="-0.9954106">middle class</subject>
        <subject relevance="normal" uri="http://kim.oecd.org/Taxonomy/Topics#T1184" score="0.6266513" zScore="-0.9891159">social mobility</subject>
        <subject relevance="normal" uri="http://kim.oecd.org/Taxonomy/Topics#T194" score="5.785686" zScore="0.30443656">labour</subject>
        <subject relevance="normal" uri="http://kim.oecd.org/Taxonomy/Topics#T2044" score="1.3454864" zScore="-0.80887854">stone</subject>
        <subject relevance="normal" uri="http://kim.oecd.org/Taxonomy/Topics#T2253" score="1.0170248" zScore="-0.8912355">hand tools</subject>
        <subject relevance="high" uri="http://kim.oecd.org/Taxonomy/Topics#T2381" score="11.104547" zScore="1.638063">household</subject>
        <subject relevance="high" uri="http://kim.oecd.org/Taxonomy/Topics#T2836" score="7.2742634" zScore="0.67767555">labour market</subject>
        <subject relevance="normal" uri="http://kim.oecd.org/Taxonomy/Topics#T2852" score="0.6359601" zScore="-0.9867819">employment security</subject>
        <subject relevance="high" uri="http://kim.oecd.org/Taxonomy/Topics#T2940" score="12.10087" zScore="1.8878765">wages</subject>
        <subject relevance="normal" uri="http://kim.oecd.org/Taxonomy/Topics#T3208" score="0.90914965" zScore="-0.9182836">women's participation</subject>
        <subject relevance="normal" uri="http://kim.oecd.org/Taxonomy/Topics#T3803" score="2.4825416" zScore="-0.5237786">women workers</subject>
        <subject relevance="high" uri="http://kim.oecd.org/Taxonomy/Topics#T5808" score="16.50421" zScore="2.9919496">income inequality</subject>
        <subject relevance="normal" uri="http://kim.oecd.org/Taxonomy/Topics#T6359" score="0.72237843" zScore="-0.9651137">social partners</subject>
        <subject relevance="normal" uri="http://kim.oecd.org/Taxonomy/Topics#T847" score="3.184035" zScore="-0.34788936">wealth</subject>
        <subject relevance="normal" uri="http://kim.oecd.org/Taxonomy/Topics#T980" score="4.278329" zScore="-0.07351119">women</subject>
    </annotation>
</enrichment>

Obviously, we need to record whether it's a subject or a coverage (I would argue to use country internally). Then, at a minimum, we need to keep the subject's and coverage's label as it is returned by Luxid and the uri value. As Luxid will return a maximum 15 subjects and 15 coverages, we will take them all. I guess we can also record the other properties (especially relevance, but also score and zScore).

Tasks ("Implementation, implementation, implementation")

  • Extend Work schema to accommodate concepts (i.e. keywords) and countries covered
  • Add XSLT (not necessarily 1.0!!) to discard non-essential sections and create a text-only output. Specifically, only the contents of the body element should be used, and even more specifically, the section that has a heading containing the word "Disclaimer" (and/or that has an id attribute that starts with disclaimer) must be excluded ... argh this is getting ugly but that's life ...
    This should be written in a way as to allow us to plug in other XSLT extractors for other content types later on.
  • Adapt UI to display them in the Work section (there should be some indication that they have been extracted automatically via Luxid, "no human being, only Luxid(tm) was abused when extracting these concepts and countries")
  • Adapt UI to add button to the Work section of a Summary that has an English expression
  • Adapt import procedure to add a trigger action
  • Create a trigger action that calls Temis web service
  • If called via import, integrate returned results in Pyramid before saving
  • If called via UI insert returned results (if any good) into Work section
  • If called via UI display appropriate message (success or failure)
  • If called via UI reload Word section (or, probably, more likely, reload the whole page?)
  • Creation of a role like semantic_enrichment that can be given to users for easier identification (instead of an opaque pac role or similar).
  • Bonus: authorized users may delete inappropriate countries or concepts as they see fit via the UI. Ideally, this would not require a reload of the page.

Miscellaneous

Errors

Luxid may return

  • an error (only 415 unsupported media type identified so far),
  • an empty result (no coverage or subject elements) or
  • a result with a stack trace (ditto).

See Thierry's document for examples.

415 "Unsupported Media Type" is triggered if

  • body of POST request is empty
  • body of POST request contains contents that is not text, PDF or XML (although it might differ because Thierry says that Luxid can handle more formats, I tested with XSL, PNG, but not PPTX, XLSX, ...)

Zoning

Apparently, it's possible to send some Temis-specific XML that describes zones in the document which can be handled by different cartridges. This can be interesting in the future but is not part of this prototype.

[1] http://kappa.oecd.org/v3/Expression/Details/bc9f5d0b

  • Claudia Tromboni
  • May 31 2017
  • Shipped
  • Attach files
  • Jakob Fix commented
    June 08, 2017 09:52

    This was formerly known as https://pacps01.oecd.org/redmine/issues/10983 but this ticket has mysteriously disappeared (despite the fact that it is referenced from elsewhere). :-(

  • Admin
    Claudia Tromboni commented
    July 26, 2017 11:42

    sorry I did not realised it was referenced, my fault. It did not disappear I erased the old tickets created before this site and put them here. Sorry if it created an inconvenience

  • Admin
    Claudia Tromboni commented
    July 26, 2017 11:45

    It was left here and not on the redmine because due to abundance of projects and lack of ressources it was not something which was going to happen soon. It will very shortly become more "d'actualité" on this form or another 

  • Admin
    Carolina Tobon commented
    August 25, 2017 15:01

    Jakob, could you please elaborate on the benefits or the reasoning behind storing the enrichment at Work level? Is it to allow finding the same content in other languages, if the content has been enriched in more than one language? 

  • Jakob Fix commented
    August 29, 2017 11:42

    Regarding Carolina's question as to the reasons of storing semantic information on the work level:

     

    Luxid will extract the following information:

    • countries that are covered by the publication
    • subjects (aka keywords)
    • topics

     

    All of these are concepts, i.e. they are language-independent. And countries, subjects and keywords (I think they are known as "Forms" now) are already stored at the work level.  Or is there an underlying point to your question that I'm not seeing?