Everything2
Near Matches
Ignore Exact
Full Text
Everything2

Building a Concept Network to represent bibliographic references

created by parmentier

(idea) by parmentier (2.3 y) (print)   ?   (I like it!) 1 C! Mon Jul 16 2001 at 17:33:46

In order to allow BAsCET to recognize bibliographic references logical structure, an appropriate Concept Network has to be built. This Concept Network have to contain knowledge on the logical structure of the bibliographic references of the base, but also on their physical structure. This model can be divided according its nodes genericity, or according to their logical or physical aspect.

From the generic point of view, the model contains two main parts: the generic one, and the specific one (containing terms of the database). The figure below shows these two parts and their links (REFERENCE contains AUTHOR, SEP:AU-T, TITLE, etc.; AUTHOR followed by SEP:AU-T; A:C.Y.Suen instance of A; W:ocr co-occurs with Y:1993). The generic part contains the fields hierarchy and separators among them. The specific part contains only logical data, as the physical one is generic (for references of one type, fields separators are the same).

From the physical / logical point of view, nodes beginning with SEP are the separators and are the physical nodes of the Concept Network; others are the logical ones, that is to say, they belong to the database.

Figure 1: Concept Network's structure for the references

        +--------------------REFERENCE--------------------+ 
        |            +-------+ |   |                      | 
        v            v         v   v                      v                Generic 
     AUTHOR---->SEP:AU-T-->TITLE-->SEP:T-... --->...---->YEAR
        |                     |                            |
        v                     v                            |
A <-> SEP:A-A             SEP:W-W <--> W                   |
^                                      ^                   |
|                                      |                   +----------+
+----------------+              +------+----+              |          |
|~~~~~~~~~~~~~~~~|~~~~~~~~~~~~~~|~~~~~~~~~~~|~~~~~~~~~~~~~~|~~~~~~~~~~|~~~~~~~~~~~
+-+              |              |           |              |          |   
  v              v              v           v              v          v
A:C.Y.Suen <-> A:S.N.Srihari  W:ocr <-> W:document  ...  Y:1995    Y:1993
   ^              ^  ^         ^ ^         ^  ^            ^ ^        ^ 
   |              |  |         | |         |  |            | |        |
   |              |  +---------------------+  +------------+ |        |
   |              |            | |                           |        |
   |              +------------------------------------------+        |   Specific
   +---------------------------+ +------------------------------------+

The automatic building of the Concept Network implies the use of a knowledge database and the use of conversion tools. The bibliographic database uses the BibTeX format is converted into XML (using Dilib).

Figure 2 shows the format that a reference from the database has. The first step is its recast, using a Dilib's tool, into XML. Figure 3 gives the result of this operation.

Figure 2: Example of BibTeX reference, from article type

@ARTICLE{joseph92a,
  AUTHOR = {S. H. Joseph and T. P. Pridmore},
  TITLE = {Knowledge-Directed Interpretation of Mechanical
           Engineering Drawings},
  JOURNAL = {IEEE Transactions on PAMI},
  YEAR = {1992},
  NUMBER = {9},
  VOLUME = {14},
  PAGES = {211--222},
  MONTH = {September},
  KEYWORDS = {segmentation, forms},
  ABSTRACT = {The approach is based on item extraction}
}

Figure 3: BibTeX reference, converted to XML

<doc>
     <ref>joseph92a</ref>
     <author><a>S. H. Joseph</a><a>T. P. Pridmore</a></author>
     <title><mot>Knowledge-Directed</mot><mot>Interpretation</mot><mot>of</mot>
            <mot>Mechanical</mot><mot>Engineering</mot><mot>Drawings</mot>
     </title>
     <journal>IEEE Transactions on PAMI</journal>
     <year>1992</year>
     <number>9</number>
     <volume>14</volume>
     <pages>211--222</pages>
     <month>September</month>
     <keywords><k>segmentation</k><k>forms</k></keywords>
</doc>

An advantage of the use of a BibTeX base is the facile getting of its physical version, using LaTeX and dvips. A program which "substracts" the logical version from the physical one, which automatically extract the separators was written. That allows to treat many bibliographic styles: one is not compelled to know of the printing rules of all the styles; an unknown style is treated as a known one.

That's what Figure 4 shows: a tool translating the logical references into PostScript and into the XML format used is sufficient to generate a Concept Network adapted to the recognition of references using the same bibliographic style, into the format of the database.

Figure 4: building of a Concept Network for bibliographic references

                                 Generic

       __----- XML-------> Structure extraction  ---->+----------------+
      /                   (fields and sub-fields)     |                |
references                                            |                |
database ---PostScript---> Separators detection ----->|  Statistic     +---> Concept Network
(BibTeX)                                              | (occurrences,  |
      \                                               | co-occurrences)|
       +-------XML-------> Terms extraction --------->|                |
                          (fields instances)          +----------------+

                                 Specific

See also:

The previous nodes show the working of a system that automatically build a Concept Network from a BibTeX database and a bibliographic style. From the logical point of view, the Concept Network is divided into two parts: the generic one, containing the fields' hierarchy, and the specific one, containing links between the leaves instances of the fields' hierarchy, and these instances. The physical aspect of the network is wholly located in its generic part, because it has to be general for a bibliographic style: it consists only in "separator" nodes, linking two fields of the same level in the fields' hierarchy, and contained in higher-level nodes. No node from the specific part gives information on the physical aspect of a reference (at least, on its typography, or on its punctuation).

As the computation of the node's attributes, and especially of the co-occurrence links between the specific nodes only depends on occurrence and co-occurrence counting, one can consider the building of system that "learns" when it discovers something. That is to say, it could integrate a recognized reference (possibly validated by a human operator) into the Concept Network, simply by incrementing counters in the links (co-occurrence counters) and in the nodes (occurrence counters). Notwithstanding that one should rather remember the counters than the weights in the Concept Network, some problems raise: how to know that the recognized reference is not already integrated in the network? This reference may already exist in the network under a slightly different form? In this case, it should be better not to bend statistics, adding already existing co-occurrences. This is a complex problem of data unicity in a base, of distance between two records... Fortunately, as soon as the number of references is sufficient the relative importance of doubles addition diminishes. Indeed, all weight that are likely to be changed are local, that is to say that they depend of the number of occurrences of the nodes from which their links start. One can thus assert that adding doubles in the Concept Network is negligible as soon as the number of known references is already big.


printable version
chaos

Influence and co-occurrence Building the physical part of a Concept Network representing bibliographic references Building the logical part of a Concept Network representing bibliographic references Concept Network
stop agent separator seeker PostScript XML
Physical structure of a bibliographic reference logical structure of a bibliographic reference BAsCET's application on bibliographic references recognition BAsCET
Network user typography Concept Network influence latex
BibTeX Bibliographic References Concept Network node Subaru 360 Minicar
co-occurrence
Y'know, if you log in, you can write something here, or contact authors directly on the site. Create a New User if you don't already have an account.
  Epicenter
Login
Password

password reminder
register

Everything2 Help

Cool Staff Picks
Little presents from the Node Fairy:
Reptile
Honor Roll
Chocolate pudding
Airport security
Straight razor shave
Rasputin : A hard man to kill
Nina Hartley
How to wash handknits
Gone in Sixty Seconds 2006 - Theatre Quest Entries
I'm a pig, not a god!
In the bathtub or on the bed, looking down
Pirate Radio
Falling in love with your best friend
New Writeups
Lord Brawl
Dr. Horrible's Sing-Along Blog(review)
a8ksh4
regret(idea)
Heisenberg
Editor Log: July 2008(log)
sam512
halfway homes, catacombs, twilight zones(fiction)
Timeshredder
The Texas UFO Crash of 1897(event)
Heitah
The Dark Knight(review)
ignis_glaciesque
Uppsala(place)
ignis_glaciesque
diffusion of responsibility(idea)
TheOrientalAfrican
The Soft Meadow of my Childhood(event)
BookReader
The Dragon Slayers(fiction)
kohlcass
religiously fashionable(review)
Pavlovna
waulking song(thing)
tentative
Stick Man(poetry)
Ereneta
The Fight with the Snapping Turtle: Or, the American St. George(poetry)
sitaraika
Fog and fire(personal)
Everything 2 is brought to you by the letter C and The Everything Development Company