US20080126080A1 - Creation of structured data from plain text - Google Patents
Creation of structured data from plain text Download PDFInfo
- Publication number
- US20080126080A1 US20080126080A1 US11/982,386 US98238607A US2008126080A1 US 20080126080 A1 US20080126080 A1 US 20080126080A1 US 98238607 A US98238607 A US 98238607A US 2008126080 A1 US2008126080 A1 US 2008126080A1
- Authority
- US
- United States
- Prior art keywords
- maps
- map
- choosing
- dml
- nml
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/258—Data format conversion from or to a database
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/131—Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
- G06F40/143—Markup, e.g. Standard Generalized Markup Language [SGML] or Document Type Definition [DTD]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
- Y10S707/99936—Pattern matching access
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99941—Database schema or data structure
- Y10S707/99942—Manipulating data structure, e.g. compression, compaction, compilation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99941—Database schema or data structure
- Y10S707/99943—Generating database or data structure, e.g. via user interface
Definitions
- a computer program listing appendix is included in the attached CD-R created on Dec. 12, 2000, labeled “Creation of Structured Data from Plain Text,” and including the following files: CommodityProperty.nml (13 KB), DefaultSeg14Result.xml, (2 KB), ElectricalProperty.nml (16 KB), Example.txt, Grammar.txt, INML.xml, (5 KB), MeasurementProperty.nml (22 KB), Output.txt, (3 KB), PeriodProperty.nml (6 KB), PhysicalProperty.nml (36 KB), ReservedNameProperty.nml (6 KB), Seg14.nml (30 KB), Seg14Phrasing.nml (71 KB), UsageProperty.nml (7 KB), and Utility.nml (6 KB). These files are incorporated by reference herein.
- the present invention relates to creation of structured data from plain text, and more particularly, to creation of structured data from plain text based on attributes or parameters of a web-site's content or products.
- the search problem can be described at least two levels: searching across multiple web-sites, and searching within a given site.
- the first level of search is often addressed by “search engines” such as GoogleTM or Alta VistaTM of directories such as YahooTM.
- search engines such as GoogleTM or Alta VistaTM of directories such as YahooTM.
- the second level which is specific to the content of a site, is typically handled by combinations of search engines and databases. This approach has not been entirely successful in providing users within effiencents access to a site's content.
- the problem in searching a website or other information-technology based service is composed of two subproblems: first, indexing or categorizing the corpora (body of material) to be searched (i.e., content synthesis), and second, interpreting a search request and executing it over the corpora (i.e., content retrieval).
- the corpora to be searched typically consist of unstructured information (text descriptions) of items.
- the corpora may be the catalog of the items available through that web-site.
- the catalog entry for a description might well be the sentence “aqua cashmere v-neck, available in small, medium, large, and extra large.” Such an entry cannot be retrieved by item type or attribute, since the facts that v-neck is a style or sweater, cashmere a form of wool, and aqua a shade of blue, are unknown to current catalogs or search engines.
- this description In order to retrieve the information that this item is available, by item type and/or attribute, this description must be converted into an attributed, categorized description.
- such an attributed, categorized description may include properly categorizing the item as a sweater, extracting the various attributes, and tagging their values. An example of such a description is illustrated in Table 1.
- numeric codes are assigned to make the job of search and representation easier.
- One such code is the UN Standard Products and Services Code (UN/SPSC), which assigns a standard 8-digit code to any human product or service.
- UN/SPSC UN Standard Products and Services Code
- the second problem is one of retrieving data successfully; once the data has been created and attributed, it must be accessible.
- E-commerce and parametric content sites are faced with a unique challenge, since they must offer search solutions that expose only those products, contents or services that exactly match a customer's specifications.
- e-commerce web sites continue to offer their customers unmatched variety, category-based navigation of e-commerce sites (“virtual aisles”), which have become increasingly complex and inadequate.
- many web-sites that offer a large catalog of products are often unable to find products with precise or highly parameterized specifications, and instead require the user to review dozens of products that potentially match these specifications.
- Keyword-based search In this method, users type a set of words or phrases describing what they want to a text box, typically on the main page of the site. A program on the site then takes each individual word entered (sometimes discarding “noise” words such as prepositions and conjunctions), and searches through all pages and product descriptions to find items containing either any combination of the words. This method, when given an English sentence or phrase, either returns far too many results or too few. For example, if a customer requests, “show me men's blue wool sweaters,” the search could be unsuccessful for the following reasons. It would either return only those pages that contain all the words in this request, or return any page that contained any single word in the search.
- Keyword-based approaches are widely used in medical transcription applications, database access, voice-mail control and web search. Virtually all commercial natural-language interface products use this approach. In this approach, certain words are regarded as meaningful, and the remainder as meaningless “glue” words. Thus, for example, in the sentence “show all books written by Squigglesby” the words “show,” “book,” and “written” may be regarded as keywords, the word “by” as a meaningless glue word, and the word “Squigglesby” as an argument. The query would then be formed on the theory that a book author named Squigglesby was being requested.
- keywords are generally some of the common nouns, verbs, adverbs and adjectives, and arguments are proper nouns and numbers. There are exceptions, however. Prepositions are usually regarded as glue words, but in some circumstances and in some systems are regarded as keywords. Generally, this is due to the human tendency to omit words in sentences, known in the argot as “ellipses.” The sentence “Show all books by Squigglesby” is an example of this, where the verb “written” is excluded. In order to cope with this, some keyword-based systems make “by” a keyword.
- Database approaches are an example of a widely used variant on keyword-based approaches.
- the database developer associates keywords or identifiers with specific database fields (columns in specific tables).
- Various words, specifically interrogative pronouns and adjectives, some verbs, and some prepositions, have fixed meanings to the database query program. All other words can be available as keywords for a template-based recognition system.
- the interface system may match the user's sentence to a template set constructed from the database developer's information about database structure and identifiers, and its built-in interpretation of its hardwired keywords.
- a Structured Query Language (SQL) statement would then be generated which encodes the meaning of the user's sentence, as interpreted by the interface system.
- SQL Structured Query Language
- catalogs are databases of products and services.
- a “category” is the name of a table: the attributes of the category are some columns of the table.
- a question is first searched by a category word, and then the remainder of the question is used as keywords to search for matching items within the category. For example, “blue woolen sweater” would first search for “blue” “woolen” and “sweater” as keywords indicating a category, and then (assuming “sweater” succeeded as a category keyword and the others did not), for “blue” and “woolen” as keywords within the sweater category.
- the difficulty with this approach is that cross-category queries fail, since no individual category is available to match in such cases. Further, parameters that are not present in the product descriptions in the category are not used.
- Keywords have fixed semantics. This is a distinct departure from the use of normal language by humans. Words in natural language derive their meaning through a combination of “symbol” (the word itself) and “context” (the surrounding text and background knowledge). The most glaring example is prepositions in the presence of ellipses. For instance, “by” can indicate the subject of almost any transitive verb, as well as physical proximity or indicating an object or method to use to accomplish a particular task. Another example of meaning dependent on context is that “green” can refer to a color, a state of freshness or newness, or, disparagingly, to inexperience.
- glue words Missed meanings attached to glue words, especially prepositions.
- An assumption behind keyword-based approaches is that glue words carry no meaning or semantic content.
- the words chosen as glue words are those whose meaning is most context-dependent, and thus their semantic content is largely missed.
- F REE - FORM KEYWORD SEARCH This category replaces keywords with previously-asked questions and the “right” answers, and returns the answers to the typed-in question. Examples of such systems are described in detail in U.S. Pat. No. 5,309,359, entitled “Method and Apparatus for Generating and Utilizing Annotations to Facilitate Computer Text Retrieval,” issued on May 3, 1994 to Katz, et al., and U.S. Pat. No. 5,404,295, entitled “Method and Apparatus for Utilizing Annotations to Facilitate Computer Retrieval of Database Material,” issued on Apr. 4, 1995 to Katz, et al. In systems employing free-form keyword searching, questions and answers are stored as sets.
- the question is typically stored in a canonical form, and a rewrite engine attempts to rewrite the user question into this form. If the user question maps into a pre-determined question for which the answer is known, then the answer is returned by the system.
- a rewrite engine attempts to rewrite the user question into this form. If the user question maps into a pre-determined question for which the answer is known, then the answer is returned by the system.
- Such an approach is used by http://www.AskJeeves.com for Web searching applications, and for lookups of frequently-asked questions (FAQs).
- a relatively small number of questions can be answered The number of questions that can be answered is linearly proportional to the number of questions stored—thus, this method can only be used when it is acceptable to have a relatively small number of questions that can be answered by the system.
- U NDERSTANDING - BASED SEARCHES Systems incorporating understanding-based searches attempt to understand the actual meaning of a user's request, including social and background information.
- An example of such a system is Wilensky's UNIX-based Help system, UC. UC had built into it a simple understanding of a user's global goals. Wilensky explained that a consequence of not having such a deep understanding was that the system might offer advice, which literally addressed the user's immediate question in a way that conflicted with the user's global goals.
- a specific example is that a request for more disk space might result in the removal of all the user's files—an action that met the immediate request, but probably not in a way that the user would find appropriate.
- Winograd's system accepted only a highly stylized form of English, and its natural-language abilities were entirely restricted to the blocks' domain. The emphasis in the system was on deducing the appropriate sequence of actions to achieve the desired goal, not on the understanding and parsing of unrestricted English.
- Database schemas are designed for rapid processing of database queries, not semantic information regarding the databases. Relationships between database tables are indicated by importing indicators from one table into another (called “foreign keys”). Using the relationships in the schema as a framework for questions ignores some vital relationships (since the relationship is not explicitly indicated by key importation).
- Prepositions and other “noise” words often carry significant semantic information, which is context-dependent.
- the preposition “by” may indicate either a publisher or an author, and may indicate the act of publishing or authoring a book.
- the present invention provides a system, method, and an architecture for receiving unstructured text, and converting it to structured data. In one embodiment, this is done by mapping the grammatical parse of a sentence into an instance tree of application domain objects. In addition, the present invention is portable across different application domains.
- a system in accordance with the present invention can be used for creating structured data from plain text, to allow for the efficient storing this structured data in a database.
- the structured data (which could be an extracted object and its attributes) can be used to create individual entries in a product database, and thus create content for an ecommerce website or web market.
- such a system can be used for creating structured data from a plain text query, for using this structured data to retrieve relevant data from a database.
- a user's free text query can be converted to a database query that corresponds to the objects of the database and their attributes.
- the present invention recognizes that understanding natural language is neither required nor desired in generating structured data; rather, what is desired is the ability to map natural language onto program structure. Further, there is a natural relationship between the parse of the sentence as expressed in a parse tree and a component tree in a program. Thus, the natural language sentence is understood as instructions to build a component tree. A content engine takes in a natural language sentence and produces a program component tree. The component tree is then further simplified before it is passed to a program for execution.
- a system in accordance with the present invention can be used across various applications.
- the meaning of a word is dependent only on the application and the role of the word in the sentence.
- the definition of a word is largely the province of the application developer.
- words act as identifiers for components.
- a word in a sentence serves as an identifier for program objects.
- many words in English or other natural languages have multiple meanings with the meanings dependent upon context.
- a word may be used as an identifier for multiple objects.
- the present invention transforms an English sentence into a set of software objects that are subsequently passed to the given application for execution.
- One of the advantages of this approach is the ability to attach a natural language interface to any software application with minimal developer effort.
- the objects of the application domain are captured, in one embodiment, by using the Natural Markup Language (“NML”).
- NML Natural Markup Language
- the resulting interface is robust and intuitive, as the user now interacts with an application by entering normal English sentences, which are then executed by the program.
- an application enhanced with the present invention significantly augments the functionality available to a user.
- a parsing algorithm applies a formal context-free grammar for the natural language to derive all parses of a given sentence.
- English is used as an example of the natural language of the plain text.
- the present invention may be used for any natural language.
- all parses of the sentence are derived in the time taken to derive a single parse (e.g., concurrently).
- Preferrably all parses are stored in a single data structure whose size is dramatically smaller than the number of individual parse trees, often just a constant factor larger than the size taken to store a single parse tree.
- the correct map of a sentence is only known after all possible parses have been attempted.
- a mapping algorithm then uses the structure of each parse tree for a given sentence to attempt to derive an object representation of the sentence within the domain of interest based on the application-specific NML model.
- the mapping algorithm maps each parse outputted by the parser, into an instance tree of objects. In one embodiment, this is done by generating instance trees, mapping each parse onto an instance tree, pruning the instance trees generated, and then using a best-match algorithm on the pruned trees to select the best match.
- a reduced form of the NML object description instance is created as an instance of a Domain Markup Language (“DML”). This DML is passed to the application program for execution.
- DML Domain Markup Language
- FIG. 1 is an illustration of the architecture of a system in accordance with an embodiment of the present invention.
- FIG. 2 is a block diagram of the components of the content engine.
- FIG. 3A is an example of a parse tree for “abb” using a first grammar.
- FIG. 3B is an example of two different parse trees for “abb” using a second grammar.
- FIG. 3C illustrates how various parse trees can be represented as a single parse DAG.
- FIG. 4 is a flowchart illustrating the functionality of the content engine.
- FIG. 5A illustrates one possible parse tree for the sentence “The boy helped the girl with the suitcase.”
- FIG. 5B illustrates another possible parse tree for the sentence “The boy helped the girl with the suitcase.”
- FIG. 5C illustrates how the different parse trees for the sentence “The boy helped the girl with the suitcase” can be represented as a single parse DAG.
- FIG. 6 is a flowchart illustrating the generation of instance trees by the mapper.
- FIG. 7 illustrates the pruning of invalid instance trees after all instance trees have been generated by the mapper.
- FIG. 8 illustrates a cost function employed by the mapper to pick the best map from the valid instance trees in accordance with an embodiment of the present invention.
- FIG. 9 is a flowchart illustrating DML generation in accordance with one embodiment of the present invention.
- FIG. 1 illustrates an overview of the architecture of a system in accordance with one embodiment of the present invention.
- the system comprises a content engine 110 , an online dictionary 120 , a domain dictionary 130 , a Natural Markup Language (“NML”) module 140 , a vertical domain concepts module 150 , a custom client specifications module 160 , a grammar storage 170 , and a client data module 182 .
- NML Natural Markup Language
- the content engine 110 receives as input plain text, parses it, and maps the parses into instance trees. As can be seen from FIG. 1 , in one embodiment of the present invention, the content engine 110 receives input from both the online dictionary 120 (which includes words in a natural language), and a domain dictionary 130 (which includes terms specific to a domain).
- the content engine 110 receives input from the NML module 140 , which contains an NML model specific to the application or domain for which the system is being used.
- the application-specific NML is created, in one embodiment, using a combination of automatic and manual editing from the vertical domain concepts obtained from the vertical domain concepts module 150 , and the custom client specifications obtained from the custom client specifications module 160 .
- the present invention is customized to a vertical domain 150 of application by creating an object oriented data model that represents the intended functionality of the site.
- An example of the vertical domain concepts 150 is taxonomy such as the United Nations Standard Product & Services Code (UN/SPSC).
- vertical domain concepts 150 Another example of the vertical domain concepts 150 is the set of concepts that are pertinent to financial information for a company such as, company name, location, officers, products, competitors, annual sales, revenues, employees, etc.
- An example of custom client specifications 160 is a collection of concepts similar to the vertical domain concepts 150 , but specific to a web-site (i.e. not found on all web-sites that may be in the same domain).
- the grammar storage 170 stores a grammar for a particular language. In one embodiment, the grammar storage 170 stores a full context-free grammar for the English language.
- An example of such a grammar is included in the computer program listing appendix in file grammar.txt.
- the grammar shown in grammar.txt has its start symbol as ⁇ Paragraph>. The rules indicate that a ⁇ Paragraph> is composed of one or more ⁇ Sentence> symbols separated by ⁇ Terminator>. Similarly, a ⁇ Sentence> is composed of a ⁇ Clause> and so on. Grammars are discussed in greater detail below.
- the content engine 110 also has access to a module containing client data 182 .
- This data is used for client-specific or dynamic vocabulary that does not transfer across client sites or applications. Examples of such vocabulary include trade or brand names (e.g. “Explorer”, “Expedition”, or “Excursion” for Ford sport utility vehicles, or the names of confections made by Hershey Foods Company).
- FIG. 2 illustrates the architecture of the content engine 110 in an embodiment of the present invention.
- the content engine 110 comprises a parser 210 , a mapper 220 , and a Domain Markup Language (“DML”) generator 230 .
- DML Domain Markup Language
- the parser 210 parses the text input by the user into all possible parses, based on the grammar stored in the grammar storage 170 .
- the parser 210 applies a formal context-free grammar for the language in which the user is working, to derive all parses of a given sentence.
- all parses are derived in the time taken to derive a single parse.
- all of the parses are stored in a single data structure of size equivalent to that taken to store a single parse tree.
- the parser 210 may generate meaningless parses, but this is acceptable because, as will be discussed below, these meaningless parses will not yield valid mappings into the NML and will be automatically discarded from consideration during the mapping process. The functionality of the parser 210 is discussed in greater detail below.
- the mapper 220 accesses all the parses of the text input by the user produced by the parser 210 .
- the mapper 220 uses the structure of each parse tree for a given sentence to attempt to derive an object representation of the sentence within the domain of interest based on the application-specific NML model provided by the NML module 140 .
- the mapper 220 maps each parse outputted by the parser 210 , into an instance tree of objects.
- the functionality of the mapper 220 is discussed in detail below.
- the result of the mapper 220 is not the final result of the content engine 110 .
- the DML generator 230 reduces the structure produced by the mapper 220 to a simpler form.
- the generation of the DML is directed, in one embodiment, by DML_ELEMENT declarations contained in the NML model provided by the NML module 140 .
- the result of this process is to produce a document in the Domain Markup Language (“DML”).
- the DML description can then be passed as an input to the underlying application (not shown in the figures).
- the application takes the DML input and use it to populate a database, using each instance tree as the description of an entity (and its attributes) in the application domain, and creating the appropriate entries in the database.
- the application takes the DML input and uses it as a query on an underlying database, to retrieve entries (e.g., products) that satisfy the query, and hence match the user's interests (to the extent that such interest is well expressed in the original text input).
- entries e.g., products
- a grammar is a series of mathematical objects called “productions,” which describe mathematically the well-formed “sentences” of the grammar.
- the symbols “S”, “A”, and “B” are called “non-terminals” or “phrases.” They represent purely abstract objects, which do not appear in any sentence in the language, but represent a group of symbols of a language sentence.
- the symbols “a” and “b” represent words in the language, and are called “terminals” or “words.”
- every grammar has a phrase “S” for “sentence”, which appears alone on the left-hand side of one production. A production is applied by replacing the left-hand side of the production with the right-hand side in a string.
- a sequence ⁇ of terminals is said to be derived from a sequence ⁇ of non-terminals and terminals if ⁇ can be transformed into ⁇ by applying a succession of productions of the grammar.
- “aabb” can be derived from “aAbB” because the rules A a and B b, applied to aAbB yield aabb.
- a sequence of terminals, or a “sentence,” is said to be in the language of the grammar if it can be derived from the start symbol, S.
- the sequence “abb” is in the language of the grammar, because S AB aB abB abb.
- “abab” is not in the language, since no succession of productions can be used to derive “abab” from S.
- the non-terminals and terminals correspond intuitively to the standard grammatical objects learned by a school child.
- the terminals are simply the words and punctuation symbols of the language; the non-terminals are the standard phrase constructs and word types learned in elementary school: noun, verb, noun phrase, verb phrase, etc.
- the set of non-terminals in human languages tend to be fairly limited; the set of terminals and the productions vary widely, and in their variance is the rich diversity of human language.
- any sequence of non-terminals and terminals may appear on either side of a grammar rule.
- grammars which exploit this freedom are computationally intractable. Thus various restrictions are often placed on the form of the left-hand side and the productions which make parsing these restricted grammars computationally tractable.
- the context-free grammar used in one embodiment by the content engine 110 provides the minimal amount of grammatical information necessary to capture the correct parse of any grammatically correct English sentence.
- the main intent of the grammar is to capture the correct parse of a sentence without attempting to understand the meaning (or semantics) of the sentence.
- the grammar is thus created to include every correct parse of every sentence in the language. Naturally, for any single sentence this results in several ambiguous parses, only one of which is the (semantically) correct parse of the given sentence.
- the grammar provided by the grammar storage 170 can be substantially compacted from a full grammar of the English language, so as to facilitate brevity of the grammar.
- the grammar shown in grammar.txt comprehensively ignores grammatical features like verb conjugations, plurality of nouns, tense, active or passive voice etc. This is acceptable because these features are irrelevant to the parse of a sentence and are only needed if the semantics of a sentence were to be analyzed in detail.
- parse of a particular sentence In grammatical analysis, the particular sequence of rewrite rules used to derive the sentence is usually called the parse of the sentence. In a context-free grammar, the parse of a particular sentence can be represented mathematically as a “parse tree.”
- FIG. 3A depicts an example of a parse tree for “abb”, using the Grammar1 above.
- a parse may not be unique. For example, consider now the Grammar2.
- Such a grammar which can result in multiple parse trees for a string, is said to be “ambiguous.”
- Most grammars for human languages are ambiguous in this precise technical sense, for the excellent reason that human language is itself ambiguous. For instance, in the sentence “The boy helped the girl with the suitcase,” the modifier “with the suitcase” can either apply to the girl, or to the act of helping. In general, a modifier can modify any part of the sentence. Resolution of ambiguities is an important problem in parsing, and will be discussed below.
- all parses of a given sentence can be represented as a single parse Directed Acyclic Graph (“DAG”) 300 . This is illustrated in FIG. 3C for sentence “abb”.
- DAG Directed Acyclic Graph
- the dashed edges 310 of DAG 300 represent optional parses; selection of a set encompasses a valid parse tree.
- FIGS. 3B and 3C it can be seen that the two trees in FIG. 3B have a total of 14 nodes and 12 edges; in contrast, the parse DAG shown in FIG. 3C has a total of only nine nodes and 11 edges.
- the space and time savings represented by using the parse DAG are dramatic when there are hundreds or thousands of parses, as is typical for English sentences.
- the space and time taken to construct the parse DAG is proportional to the number of distinct nodes in the component parse trees, whereas the space and time taken by conventional algorithms is proportional to the number of nodes of the parse trees.
- NML Natural Markup Language
- the approach of the present invention is based on describing the set of concepts of a specific application area or domain as a set of objects.
- Objects are grouped into two fundamental classes:
- Enumerations These are objects defined by single words or fixed phrases in English over the given domain.
- An Enumeration is the object Color, which is defined by the color words (e.g., red, blue, mauve) of everyday experience.
- Composites These are objects are defined as collections of sub-objects.
- the sub-objects of a composite are called its “attributes.”
- One example of a composite is the object Desk, which can have attributes PrimitiveDeskWord (e.g., the enumerated object consisting of the word desk and its synonyms), PedestalType (e.g., a composite describing whether this desk has a right, left, or double pedestal), Dimension (e.g., a composite giving the height, width, and depth of the desk), Use (e.g., an enumeration consisting of executive, computer, student, secretary), and various other attributes describing the material, finish, and optional features of the desk.
- PrimitiveDeskWord e.g., the enumerated object consisting of the word desk and its synonyms
- PedestalType e.g., a composite describing whether this desk has a right, left, or double pedestal
- Dimension e.g., a composite giving the height
- NML is a language for declaring objects, enumerations, and the relations between objects.
- the NML programmer declares the composites and enumerations of the domain.
- NML is based on the Extensible Markup Language (“XML”) standard. It should be noted that the NML description of a domain describes a graph of objects, where the sinks of the graph (the nodes with no outgoing edges) are the Enumerations of the domain.
- the NML module 140 provides an application-specific NML to the content engine 110 .
- NML is a tool for describing an application's object hierarchy and the vocabulary by which the hierarchy is referenced in natural language to the content engine 110 . Because the meanings of words themselves are not relevant to the actual implementation of a system, the present invention can be used for various different applications.
- An NML document may be created for each application, and, typically, a small special-purpose markup language for the domain itself may be created. The markup language and the NML document are strongly related.
- An NML document captures the concepts of an application domain, and the markup language is designed to hold the values for those concepts for a particular query.
- NML is an extension of the eXtensible Markup Language (XML).
- XML is the core of all tag-based markup languages. It is almost never used standalone, but is configured into an application-specific tag-based markup language. Examples of such languages are the Mathematical Markup Language, MML, and Commerce One's product interchange language.
- An XML document consists of a set of “elements.”
- An element is a chunk of a document contained between an HTML-style tag and its matching closing tag.
- HTML Unlike HTML, however, XML has no built-in tags—rather, the set of tags for a specific document are defined by its Document Type Definition, or DTD.
- DTD Document Type Definition
- Program1 above is extremely simple; it just recognizes an object indexed by the string “hello, world”, and maps it to the object “HelloWorld.”
- the IDENTIFIER element within the ENUMERATION element indicates that the LITERAL argument, when it occurs in the text, creates an instance of the relevant ENUMERATION.
- the phrase “hello, world” creates an instance of the HelloWorld object, and this maps that exact phrase.
- This program while simple, recognizes only the exact phrase “hello, world” with various capitalizations. A simple program which recognized only this exact phrase would have served as well, and been far simpler to write. However, in NML, a program which recognizes much more is almost as easy to write. This is shown in the next example in Program2.
- Program2 above declares an object HelloWorld with two sub-objects, or ATTRIBUTES: Greeting and World. Greeting is indexed by the literals “hello”, “hi”, “good morning”, and “good afternoon”; World by “everyone”, “everybody”, and “world”.
- Program2 when implemented by the content engine 110 , is designed to recognize the following phrases.
- Program2 does not quite work to recognize these phrases. In fact, Program2 recognizes nothing. Rather, the Program3 below, which differs from the Program2 by a single word, does in fact recognize the above phrases.
- Program3 The difference in behavior between Program2 and Program3 is due to one other factor: in Program3, all nouns and verbs in a sentence must be matched in a tree rooted in a single object, or the sentence as a whole is not considered mapped.
- NML is the means by which the application developer describes the structure of his application to the content engine 110 .
- it is equivalent to defining an Application Programs Interface (API) for the application, with a key property, in one embodiment, that the “application programmer” in this case is a user speaking a specific language (e.g., English).
- API Application Programs Interface
- the API is very simple: it encapsulates only those objects and attributes which a user can create with a single English sentence and which would be expected to be known by users of the application.
- the NML would describe objects such as Desk, which can have attributes such as PrimitiveDeskWord (e.g., the enumerated object consisting of the word desk and its synonyms), and PedestalType (e.g., a composite describing whether this desk has a right, left, or double pedestal).
- PrimitiveDeskWord e.g., the enumerated object consisting of the word desk and its synonyms
- PedestalType e.g., a composite describing whether this desk has a right, left, or double pedestal.
- an NML file looks similar to a Java interface file or a C++ .h file: it is a description of the objects of an application, without their implementation.
- the object hierarchy described in the NML file is in logical structure and function very much the programmer's object hierarchy for the application: a few additional objects are typically added to provide targets for English mapping. This section concerns itself with the raw structure of NML: the means by which this is deployed in an application will be seen below.
- the NML_MODEL element is the root of the NML file. This contains a set of IMPORTs, and a set of OBJECTs.
- the DOMAIN argument to the NML_MODEL element is simply an indication to the content engine 110 of the name of the particular domain or application being processed by the content engine.
- the required FILE argument contains the path of the file to import.
- a typical NML application contains a small set of custom objects and a much larger set imported from standard libraries.
- a classic example is the Date package, which recognizes common date phrasings: everything from “the last week of the second quarter before last” to “12/19/98”.
- the IMPORT element may look like:
- the COMMENT element is used to denote an NML comment (as opposed to a general XML comment), and may be attached to the model as a whole or to any single object.
- the COMMENT element may look like:
- the OBJECT element is the heart of NML. It may look like:
- An OBJECT can be thought of as a type in a programming language. Unlike types in programming languages, however, an object in NML has no real implementation. Its purpose is to provide a target for the content engine's 110 mapping of a word, a phrase or a sentence, and a source for the Domain back end's mapping to the application's API. As such, it merely needs provide type information: this is the type to which the phrase and sentence is mapped. The substructure of the Object element gives the explicit instructions for mapping the phrase.
- NAME The first argument, is required, and gives the name of the Object. All references to the Object, specifically those in ATTRIBUTE elements, are done by the NAME of the Object.
- EXPR refers to the ability of this object to form expressions—phrases involving “and”, “or”, “;”, “/”, or “,”.
- Such expressions are always formed over homogenous objects.
- the PEER and DML_arguments control DML generation, described below.
- the SINGLETON argument indicates that any instance of this object can take only a single attribute. This is used when an object is, logically, an abstract superclass of several objects, only one of which can be represented.
- the MAX attribute declaration (see below) is not adequate to control this case, since the MAX attribute declaration controls the number of instances of a single attribute object: this controls the number of attribute objects.
- the ROOT argument indicates whether an instance of this object can be at the root of an instance NML tree.
- An Object contains an optional comment (see above) and a set of ATTRIBUTES. If OBJECT is analogized to a type in a programming language, ATTRIBUTE is analogous to a member of the type. Reference is by name.
- NML Object provides a target for mapping
- member names distinct from types are only useful when there is more than one object of a specific type as a member. If this were the case in NML, the content engine 110 would be unable to know which object to map to which attribute. In one embodiment, this problem may be solved by permitting multiple attributes of a specific type, and letting the back end sort out their roles in the sentence.
- the ATTRIBUTE element is empty, and has the following arguments:
- MIN This argument indicates the minimum number of attributes of this ID that this object must have.
- a HelloWorld object must have at least one Greeting attribute and one everybody attribute.
- the values of MIN can be 0, 1, or 2, with a default of 0.
- the set of possible values may be expanded if a need is ever found.
- the minimum cardinality of an object is known. For example, a book must have a title. This can be exploited in the mapping process by deleting objects which do not achieve the minimum cardinality for an attribute.
- MAX This argument indicates the maximum number of attributes of this ID that this object must have. In the example, a HelloWorld object must have at most one Greeting attribute and one everybody attribute. The values of MAX can be 1, 2, or many, with a default of many. The set of possible values may be expanded if a need is ever found. Often the maximum cardinality of an object is known. For example, a book must have only one title. This can be exploited in the mapping process by deleting objects which do exceed the maximum cardinality for an attribute.
- the NML document produced the mapper 220 can, however, be too cumbersome for easy processing.
- the mapping algorithm described in detail below creates a node in the NML instance object for each phrase successfully mapped. Some of these phrases have no semantic significance in the sentence. Moreover, many separate phrasings may be used to create the same logical object. Since the NML objects are closely tied to the phrasings used, multiple possible NML objects are used to denote the same logical object. Further semantic processing of the NML instance is required before the results can be used to populate a database or launch a search query.
- This element is an element of a Domain Markup Language designed for electrical devices. It is automatically extracted from any NML instance corresponding to a text fragment which describes the logical entity “65 amps”.
- the Domain Markup Language corresponding to an NML model is specified in the NML model itself, with one specific NML Element and three attribute declarations. These are described here:
- This element directs the DML Generator 230 to begin a new DML instance with a root element whose name is the required attribute of DML_CALL, whenever an NML Element whose name corresponds to a TRIGGER is detected in the NML Instance. For example,
- the following three attributes attach to any NML OBJECT, ENUMERATION, CALLBACK, PATTERN, or ATTRIBUTE, and control the creation of DML Elements and Attributes, and (optionally) setting the values of DML Attributes. They are described below.
- DML_ELEMENT “Current”. If absent, the name is assumed to be the NAME of the NML OBJECT, ENUMERATION, PATTERN, or CALLBACK, or the ID of the NML ATTRIBUTE. It directs the creation of a DML Element of type name, whenever the corresponding NML structure is encountered in the NML instance. This differs from DML_CALL in that the DML Element is not created as the root of a new DML structure; rather, the new element is embedded as a subobject of any containing DML Element. This will be explained in more detail, below, when the DML generation algorithm is explicated.
- DML_VALUE is an optional adjunct to DML_ATTRIBUTE, and permits an NML programmer to override the default value assigned to an attribute by the DML Generation procedure. This is most often used when synonyms or multiple phrasings can appear, and a normalized value is desired.
- FIG. 4 is a flowchart illustrating the functionality of the content engine 110 in accordance with an embodiment of the present invention.
- the content engine 110 receives the input 410 and tokenizes it.
- the parser 210 then creates 420 all the parse trees based on the tokenized input and the grammar from the grammar storage 170 .
- the mapper 220 generates 430 an instance tree based on the application domain specific NML provided by the NML Model Module 140 .
- the mapper 220 then also prunes 440 the instance trees, and then chooses 450 the best map.
- the DML generator 230 uses this best map to generate 460 the appropriate DML.
- the functionality of the content engine 110 outlined in FIG. 4 can be used both for content synthesis and for retrieving data.
- the input received 410 may, for instance, be a catalog of items (and their descriptions) offered by an e-commerce site.
- the input received 410 may, for instance, be a search query by a user.
- the DML generated 460 may be used to populate a database, while in the case of data retrieval, the DML generated 460 may be used to search a database that has been previously populated.
- the input is tokenized 410 by the content engine 110 .
- tokens are simply the words in the input text.
- multiple words may sometimes be treated as a single token, for example, the two or more words that form a name such as San Francisco, or New York City.
- Multiple words that form a compound noun or other concepts such as dates, times, number patterns etc., may also be aggregated into a single token.
- the parser 210 generates parse trees from the tokenized input based on the grammar obtained from the grammar storage 170 . In one embodiment, the parser 210 creates all possible parse trees.
- the parser 210 creates parse trees, similar in form to the parse tree (conceptually) created by a compiler from a program.
- the leaves of this tree are the tokens (or words of the input text); the internal nodes represent phrases and subunits of the sentence, where each node represents the subunit containing all the tokens descended from that node.
- the root node represents the sentence itself.
- FIG. 5A depicts the first tree
- FIG. 5B depicts the second tree.
- the boxes mark the recognized grammar symbols such as “SVO” (for Subject-Verb-Object), “NP” (Noun Phrase), and so on.
- SVO Subject-Verb-Object
- NP Noun Phrase
- FIGS. 5A and 5B reveals that the nodes of the trees are the same, and are distinguished only by the edge into the node representing the phrase “with the suitcase.”
- the edge 510 runs from the node representing the verb phrase “helped”; in the second case, the edge 520 runs from the node representing the phrase “the girl.”
- DAG Directed Acyclic Graph
- the DAG is depicted in FIG. 5C .
- the DAG itself contains exactly the same number of nodes as each of the two component parse trees, and only one more edge than either of the two component parse trees.
- the parser 220 can employ any parsing algorithm.
- the parsing algorithm of Cocke-Younger-Kasami may be used. Details of the Cocke-Younger-Kasami algorithm can be found in the Introduction to Formal Language Theory , Harrison, M. A., Addison-Wesley, 1978. A sample of the Cocke-Younger-Kasami algorithm is shown below in Tables 12 A-E. While the algorithm shown below provides a single parse of a sentence, it may be modified to generate all parses of the sentence.
- the core of this algorithm is an (n+1) ⁇ (n+1) table, where “n” is the number of tokens in the parse.
- the tokens are here denoted a 0 . . . a n-1 , and the table elements T 0,0 , . . . , T n,n .
- the upper half of the table is filled from i,i+1 to n, n in the order given below.
- the items just above the diagonal are filled with the grammar nonterminals that directly derive the relevant token.
- the items in the remaining token are filled in as follows:
- T i,j ⁇ A
- T i,j contains precisely the set of nonterminals which derive the phrase beginning at a i and terminating in a j .
- T 0nj then contains the set of non-terminals which derive the entire sentence.
- the initial matrix is shown below.
- the root of the parse tree is contained in the element T[0][4]—in other words, in the cell in the top-right corner of the matrix. At this point the parsing algorithm terminates and the correct parses are read from the top-right corner of the matrix.
- the mapper 220 generates 430 instance trees for each parse tree based on the application-specific NML provided by the NML module 140 .
- the mapper 230 then prunes 440 these instance trees to discard invalid and/or incomplete trees.
- the mapper then chooses 450 the best map.
- An object in the instance tree is said to cover a node of the parse tree (equivalently, a node is said to “map” to an object), if the mapper 220 matches the object to the node, by the rules explained below.
- the goal of the mapping algorithm is to map a single object to the root node of the tree. In one embodiment, if a single NML instance cannot be obtained for a sentence, the system switches to another mapping mechanism that tries to obtain the best set of disjoint NML instances that cover the entire sentence. There are several different methods to perform a partial map of a sentence.
- instance trees are generated by starting out at the leaf (or terminal) nodes of a parse tree.
- a terminal node is created for each token.
- all enumerated objects are indexed by the terminal word.
- An inference process is then executed to create inferred objects.
- the algorithm then moves up the parse tree, generating a new object at a parent node by composing the objects of the child nodes at the node.
- At each node there is one child node that is predetermined to be the main child of the node.
- the main child corresponds to the grammatical object that plays the central role in the grammatical structure represented by the node. For a noun phrase, this is the head noun, for a prepositional phrase this the prepositional complement, etc.
- Objects can be generated in several ways. Specifically, objects can be generated by enumeration from identifiers, enumeration from callbacks, and enumeration from patterns. In addition, objects can also be inferred from other objects. Let us consider each of these in turn.
- An Enumeration is an object created by the presence of a single word or phrase.
- an Enumeration is in every way identical to an object, except for the fact that an object is always inferred from an existing attribute and an Enumeration is inferred from a word or phrase.
- the IDENTIFIER element recognizes a single word that forces creation of the object. The specific word is given in the LITERAL argument.
- the IDENTIFIER element has no substructure, and can take the following arguments, listed below:
- TYPE This restricts the mapping to the particular part of speech given as the argument. Often, words can assume several different parts of speech. For example, the word “green” is a noun (denoting a patch of grassy land or a color), an adjective, or a verb. It is often desired to restrict an IDENTIFIER to only one of these roles. If Verb is given as the value of TYPE, then only verbs will map to this particular identifier. The default value, ANY, maps any part of speech to this IDENTIFIER.
- the CALLBACK element functions in a fashion similar to ENUMERATION: it is a means for mapping individual tokens in a sentence to OBJECTS. It is designed for the specific case where the set of IDENTIFIERs for a particular OBJECT is very large, changes dynamically, or both.
- the CALLBACK element defines a Java class which contains at least two methods: a method which takes a string and returns a boolean (this is named in the PARSER argument), and a method which takes a string and returns another string (this is named in the MAPPER argument). While this was specifically designed with a SQL interface in mind, there is no restriction in the code for this: any Java class having the appropriate methods will do.
- the CALLBACK element may have no structure, and have the following arguments, all of which are required:
- CLASS This is the name of the fully-qualified Java class containing the two methods referenced above.
- the Content Engine will call the method ⁇ CLASS>. ⁇ PARSER>(token); to recognize the token, and ⁇ CLASS>. ⁇ MAPPER>(token); (in the example above, “ecCallback.CompanyFundIndexNameDatabase.isCompanyFundIndexName(token);” for recognition, and “ecCallback.CompanyFundIndexNameDatabase.findCompanyFundIndex Symbol(token);” for mapping).
- the CLASS must be accessible to the Content Engine from the string as given here using the standard Java class loader methods.
- PARSER This is the name of the method within CLASS called to do the recognition: it should take a single String argument and return a boolean. This functions exactly as the LITERAL argument to IDENTIFIER; Content Engine will pass the root form of the token, not the token itself, to the parser. Thus, the word “Microsoft's”, appearing in a sentence, yields the call “ecCallback.CompanyFundIndexNameDatabase.isCompanyFundIndexName(microsoft)”. When this returns true, the behavior of the compiler is exactly identical to that produced when “microsoft” had appeared in a list of IDENTIFIERs for this OBJECT.
- MAPPER This is the name of the method within CLASS called to map recognized tokens to a canonical form: it should take a String and return a String. This functions exactly as the MAP argument to IDENTIFIER. As with PARSER, Content Engine will pass the root form of the token, not the token itself, to the mapper. To obtain the default behavior of IDENTIFIER, MAPPER should simply return its argument.
- ecCallback.CompanyFundIndexNameDatabase.findCompanyFundIndexSymbol returns the symbol associated with the name.
- CALLBACK 520 may be simplified if the Content Engine 110 adopts an interface-based protocol for its callbacks. In this case, the PARSER and MAPPER arguments to CALLBACK will disappear, and the CALLBACK CLASS will be required to implement the Content Engine 110 callback protocol.
- a pattern is the third logical equivalent to an enumeration. This is used when a large number of identifiers can be specified by a regular expression.
- regular expressions formally, regular languages
- the most simple example of a regular expression is a Social Security Number, which is represented by the regular expression:
- a social security number is any string which begins with a digit between one and 9, followed by two digits between 0 and 9, an optional dash, two digits between 0 and 9, and optional dash, and then four digits between 0 and 9.
- the content engine 110 accepts any regular expressions specified by the PERL 5 compiler (see http://www.perldoc.com/perl5.6/pod/perlre.html for the current specification).
- the regular expressions are captured in the STR argument of the contained REGEXP element. Occasionally, it is useful to specify multiple regular expressions in the same pattern, which are separated by an optional SEP character (space by default).
- Inference is when the presence of a modifier can imply the existence of an object, even when the object is not explicitly identified. This can occur through ellipsis, or, more commonly, because the underlying object is abstract and is not always (and perhaps never) explicitly identified.
- inference is related to type inference in an object-oriented language in a deep and non-obvious fashion.
- a type A is a subclass of a type B in an object-oriented language
- every instance of A bears within it an instance of type B.
- creation of an instance of A forces the creation of an instance of B.
- the declaration of a sub-type in a program is a declaration of an inferencing attribute.
- the inferencing attribute may directly infer the object.
- the attribute can be directly recognized, and the inferred object can be built directly from it.
- the INFER element is an argument of an attribute, which, when true, instructs the content engine 110 to immediately build the OBJECT whenever an object of the type named in ID is built.
- the INFER element is an argument of an attribute, which, when true, instructs the content engine 110 to immediately build the OBJECT whenever an object of the type named in ID is built.
- the “handle” of the instance tree must be adjusted. It may be helpful to define some terminology here.
- the dominant element is the verb phrase; in the case of a noun phrase, it is the head noun; in the case of an adjectival phrase, it is the adjective. This element is referred to as the head word or head phrase of the phrase.
- the mapper 220 As the mapper 220 progresses, it creates trees of objects centered on nodes of the parse tree. Such a tree of objects, centered on a node of the parse tree, is said to be a map of the node.
- the link between a tree of objects and the parse tree is a single object within the map, called the handle of the map.
- the handle of the map may be thought of as the root of the map of the head phrase of the mapped node in the parse tree. Its role (and how the handle moves during the mapping process) will be explained below.
- both the Greeting object and the everybody object create a HelloWorld object through the inference mechanism.
- Both of these HelloWorld objects have a missing, required attribute: once merged into a single object, the required attributes for both are complete.
- the HelloWorld objects created by the Greeting object containing “hello” and the everybody object containing “everyone” should not merge: this would imply that there was a single phrase containing both Greeting and everybody, and this is false.
- a second method that might be imagined would have an object adding as attributes only the maps of the modifiers of its head phrase.
- the semantic analysis of a sentence often contains inversions of its grammatical structure.
- a map may add as an attribute:
- the mapper 220 Based on the application-specific NML obtained from the NML module 140 , the mapper 220 starts the generation 410 of instance trees by considering one process node 601 . The mapper 220 first determines 602 whether the node it is considering is a leaf node. If the node is determined 602 to be a leaf node, the object array is initialized 604 with generated objects.
- the mapper 220 iterates 606 - 610 over all the objects in the array. For each such existing object, all objects that can be “inferred” from the existing object are added 610 to the object array. “Inference” is the only other way in which instance objects are generated, as described above. Once it is determined 606 that there are no more objects in the array, the object array is returned
- the object array is initialized 614 to empty.
- the mapper 220 determines 616 whether all the children of the node have been processed. If all the children of the node have not been processed, the next child node is selected 618 and processed 620 .
- the maps of the child node are copied 622 to the object array, and the root of each copied object is set 624 to the child node.
- each object of the array is selected in turn as the object to which to attach attributes. This object is denoted as obj and is indexed by the variable i. Each object of the array is selected in turn using the index j initialized 630 to zero. The object indexed by j is examined 640 and is henceforth referred to as obj 1 . The goal of steps 640 - 648 is to determine whether obj 1 can be attached as an attribute of obj, and to perform the attachment if it is possible.
- obj is examined 642 to see if it has as an attribute an object whose name is the name of obj 1 . If this is true, then the second test is performed 644 : whether the handle of obj 1 modifies the handle of obj. If this is true, then obj 1 is attached 646 as an attribute of obj. Following this, or if either of the tests 642 , 644 failed, the next item in the array is selected 648 as obj 1 648 .
- the final step is the reassignment of obj's handle, steps 634 - 636 .
- the handle of obj is set to obj itself if obj has been inferred; if not, the handle of obj is left undisturbed.
- pruning 440 is performed by the mapper 220 to discard invalid/incomplete instance trees.
- a list of the tokens mapped into the instance tree are recorded; an instance tree for the sentence which does not map all the verbs and nouns are discarded.
- Pruning starts 701 at the root of an instance tree.
- An array is designated 702 as the array of objects (i.e. components of the instance tree) associated with the root of the parse DAG.
- the content engine determines 704 whether there are any more objects in the array. As long as there are more objects remaining in the array, obj is assigned 706 the next object in the array.
- the content engine determines 708 whether the obj covers all nouns and verbs in the sentence. If not, the object is deleted 710 from the array.
- the content engine determines 712 whether the MIN and MAX attributes of the object are satisfied. If they are not satisfied, the object is deleted 710 from the array. If these attributes are satisfied, the content engine loops back to determine 704 whether there are any more objects left in the array. When such determinations have been made for all the objects in the array, the array is returned 714 . Thus, only those instance trees that account for all the verbs and nouns of the given sentence, and which satisfy the MIN and MAX attributes, are retained.
- a different algorithm may be used to discard instance trees.
- the step of pruning 440 need not be performed at all.
- FIG. 8 illustrates how the best map is chosen 450 in one embodiment of the present invention.
- the “best” map can be chosen 450 in several other ways.
- a cost function is used to impose a partial order on maps of a sentence.
- the maps of the sentence which are maximal under this partial order are chosen to be the best maps of the sentence, and returned as the result(s) of the mapping procedure.
- the cost function in FIG. 8 compares two maps (map A and map B), and returns which is the superior map. It consists of a set of eight comparisons 810 - 880 , run in order. The kth comparison in the sequence is used only if the preceding k ⁇ 1 comparisons have resulted in ties; thus, it is a hierarchy of tiebreakers. These are, in order:
- map 830 If the maps are equal under criteria #1 and #2, choose the map with the least distance between the tokens.
- tokens are assigned indices. The leftmost token is assigned index 0, and the token to the immediate right of the token with index i is assigned index i+1.
- the reasoning here is that compact maps are preferred over disjoined maps.
- the data structure produced by the mapper 220 is an instance of the domain described in the NML document. In one embodiment, this data structure is then used to generate DML. DML Generation is done in a depth-first fashion over the NML Instance tree.
- FIG. 9 is a flowchart that illustrates the generation 460 of DML.
- the generateDML process is called on each root node of each tree, in turn. Once it has completed on a root node, any open DML elements are closed and output.
- a natural language request for a stock chart 1230 is converted into a program invocation 1260 .
- the description is fed into the Content Engine 1220 , which produces a DML Document 1240 .
- a DML Processing System 1250 then generates the program invocation 1260 .
- the program invocation here is shown as an HTTP cgi request. It will be obvious to one skilled in the art that the DML Processing System 1250 could generate a program invocation in any scripting, web, or API environment, and is not restricted to HTTP requests.
- the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
- the various algorithms are illustrative, and variations are easily implemented.
- a different cost function could be used to compute the best map, or the pruning step may be left out altogether.
- the particular capitalization or naming of the modules, protocols, features, attributes, data structures, or any other aspect is not mandatory or significant, and the mechanisms that implement the invention or its features may have different names or formats.
- functionality which is shown or described as being provided by a single module may be provided instead by multiple different modules; likewise functionality provided by multiple modules may be provided instead by lesser or a single module.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
A method and system for converting plain text into structured data. Parse trees for the plain text are generated based on the grammar of a natural language, the parse trees are mapped on to instance trees generated based on an application-specific model. The best map is chosen, and the instance tree is passing to an application for execution. The method and system can be used both for populating a database and/or for retrieving data from a database based on a query.
Description
- A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
- A computer program listing appendix is included in the attached CD-R created on Dec. 12, 2000, labeled “Creation of Structured Data from Plain Text,” and including the following files: CommodityProperty.nml (13 KB), DefaultSeg14Result.xml, (2 KB), ElectricalProperty.nml (16 KB), Example.txt, Grammar.txt, INML.xml, (5 KB), MeasurementProperty.nml (22 KB), Output.txt, (3 KB), PeriodProperty.nml (6 KB), PhysicalProperty.nml (36 KB), ReservedNameProperty.nml (6 KB), Seg14.nml (30 KB), Seg14Phrasing.nml (71 KB), UsageProperty.nml (7 KB), and Utility.nml (6 KB). These files are incorporated by reference herein.
- A. Technical Field
- The present invention relates to creation of structured data from plain text, and more particularly, to creation of structured data from plain text based on attributes or parameters of a web-site's content or products.
- B. Background of the Invention
- In recent years, the Internet has grown at an explosive pace. More and more information, goods, and services are being offered over the Internet. This increase in the data available over the Internet has made it increasingly important that users be able to search through vast amounts of material to find information that is relevant to their interests and queries.
- The search problem can be described at least two levels: searching across multiple web-sites, and searching within a given site. The first level of search is often addressed by “search engines” such as Google™ or Alta Vista™ of directories such as Yahoo™. The second level, which is specific to the content of a site, is typically handled by combinations of search engines and databases. This approach has not been entirely successful in providing users within effiencents access to a site's content.
- The problem in searching a website or other information-technology based service is composed of two subproblems: first, indexing or categorizing the corpora (body of material) to be searched (i.e., content synthesis), and second, interpreting a search request and executing it over the corpora (i.e., content retrieval). In general, the corpora to be searched typically consist of unstructured information (text descriptions) of items. For e-commerce web-sites, the corpora may be the catalog of the items available through that web-site. For example, the catalog entry for a description might well be the sentence “aqua cashmere v-neck, available in small, medium, large, and extra large.” Such an entry cannot be retrieved by item type or attribute, since the facts that v-neck is a style or sweater, cashmere a form of wool, and aqua a shade of blue, are unknown to current catalogs or search engines. In order to retrieve the information that this item is available, by item type and/or attribute, this description must be converted into an attributed, categorized description. In this example, such an attributed, categorized description may include properly categorizing the item as a sweater, extracting the various attributes, and tagging their values. An example of such a description is illustrated in Table 1.
-
TABLE 1 Item Style Color Material Sizes Sweater v-neck Aqua Cashmere S, M, L, XL - Current technology permits such representations in databases. Further, for many standard items, numeric codes are assigned to make the job of search and representation easier. One such code is the UN Standard Products and Services Code (UN/SPSC), which assigns a standard 8-digit code to any human product or service.
- However, while the taxonomies and the technology to represent the taxonomies may exist, conventional systems are unable to generate the taxonomic and attributed representation for an object from its textual description. This leads to the first of the two problems outlined above: the content synthesis problem. More specifically, that is the problem of how to convert plain text into structured objects suitable for automated search and other computational services.
- The second problem is one of retrieving data successfully; once the data has been created and attributed, it must be accessible. E-commerce and parametric content sites are faced with a unique challenge, since they must offer search solutions that expose only those products, contents or services that exactly match a customer's specifications. Today, more than 50% of visitors use search as their preferred method for finding desired goods and services. However, e-commerce web sites continue to offer their customers unmatched variety, category-based navigation of e-commerce sites (“virtual aisles”), which have become increasingly complex and inadequate. In particular, many web-sites that offer a large catalog of products are often unable to find products with precise or highly parameterized specifications, and instead require the user to review dozens of products that potentially match these specifications.
- A few statistics help to emphasize the importance of good searching ability. An important metric that measures the conversion rate of visitors to e-commerce sites into buyers is the book-to-look ratio. The industry average is that only 27 visitors in a 1000 make a purchase. The biggest contributor to this abysmal ratio is failed search. Forrester Research reports that 92% of all e-commerce searches fail. Major sites report that 80% of customers leave the site after a single failed search. Therefore, improving the search capability on a site directly increases revenue through increased customer acquisition, retention, and sales.
- While all web-sites experience some form of these search problems to some extent, the problem is particularly acute for web-sites with a deep and rich variety of content or products. Examples are electronic procurement networks, financial sites, sporting goods stores, grocery sites, clothing sites, electronics, software, and computer sites, among many others. Another class of sites with a deep search problem comprises of those carrying highly configurable products such as travel and automotive sites. Ironically, as a rule of thumb, the more a web-site has to offer, the greater the risk that customers will leave the site because of a failed search.
- When a customer physically enters a large department store, she can ask a clerk where she can find what she is looking for. The clerk's “search” is flexible in that he can understand the customer's question almost no matter how it is worded. Moreover, the clerk's “search” is generally accurate since the clerk can often specifically identify a product, or initial set of products, that the customer needs. Searches on web sites need to be equally flexible and accurate. In order for that to happen, a visitor's request must be understood not only in terms of the products, but also in terms of the request's parameters or characteristics. However, conventional information retrieval systems for web-site content have been unable to achieve this.
- Some of the conventionally used methods used to find goods and services on web sites, and some problems with these conventional methods are outlined below:
- 1. Keyword-based search: In this method, users type a set of words or phrases describing what they want to a text box, typically on the main page of the site. A program on the site then takes each individual word entered (sometimes discarding “noise” words such as prepositions and conjunctions), and searches through all pages and product descriptions to find items containing either any combination of the words. This method, when given an English sentence or phrase, either returns far too many results or too few. For example, if a customer requests, “show me men's blue wool sweaters,” the search could be unsuccessful for the following reasons. It would either return only those pages that contain all the words in this request, or return any page that contained any single word in the search. In the former case, no items would be found, though there might be many products with those characteristics for sale. For instance, it is possible that aqua cashmere cardigan would not be matched, since it contains none of the keywords. In the latter case, a large number of items would be found, most of which would be of no interest to the customer. For example, blue wool slack may be incorrectly matched, since it contains the keywords “blue” and “wool.” Some keyword-based searches weight results based on how many keywords are matched.
- Keyword-based approaches are widely used in medical transcription applications, database access, voice-mail control and web search. Virtually all commercial natural-language interface products use this approach. In this approach, certain words are regarded as meaningful, and the remainder as meaningless “glue” words. Thus, for example, in the sentence “show all books written by Squigglesby” the words “show,” “book,” and “written” may be regarded as keywords, the word “by” as a meaningless glue word, and the word “Squigglesby” as an argument. The query would then be formed on the theory that a book author named Squigglesby was being requested.
- In such systems, keywords are generally some of the common nouns, verbs, adverbs and adjectives, and arguments are proper nouns and numbers. There are exceptions, however. Prepositions are usually regarded as glue words, but in some circumstances and in some systems are regarded as keywords. Generally, this is due to the human tendency to omit words in sentences, known in the argot as “ellipses.” The sentence “Show all books by Squigglesby” is an example of this, where the verb “written” is excluded. In order to cope with this, some keyword-based systems make “by” a keyword.
- There are a few specialized cases of, or variations on, keyword searches. Database approaches are an example of a widely used variant on keyword-based approaches. In these systems, the database developer associates keywords or identifiers with specific database fields (columns in specific tables). Various words, specifically interrogative pronouns and adjectives, some verbs, and some prepositions, have fixed meanings to the database query program. All other words can be available as keywords for a template-based recognition system. In response to a user's sentence, the interface system may match the user's sentence to a template set constructed from the database developer's information about database structure and identifiers, and its built-in interpretation of its hardwired keywords. A Structured Query Language (SQL) statement would then be generated which encodes the meaning of the user's sentence, as interpreted by the interface system.
- Another example of a specialization of the keyword-based approach is a catalog-based approach. Catalogs are databases of products and services. A “category” is the name of a table: the attributes of the category are some columns of the table. In this approach, a question is first searched by a category word, and then the remainder of the question is used as keywords to search for matching items within the category. For example, “blue woolen sweater” would first search for “blue” “woolen” and “sweater” as keywords indicating a category, and then (assuming “sweater” succeeded as a category keyword and the others did not), for “blue” and “woolen” as keywords within the sweater category. The difficulty with this approach is that cross-category queries fail, since no individual category is available to match in such cases. Further, parameters that are not present in the product descriptions in the category are not used.
- Some of the central limitations of keyword-based systems are described below:
- Meanings of words are fixed, independent of context. In keyword-based systems, keywords have fixed semantics. This is a distinct departure from the use of normal language by humans. Words in natural language derive their meaning through a combination of “symbol” (the word itself) and “context” (the surrounding text and background knowledge). The most glaring example is prepositions in the presence of ellipses. For instance, “by” can indicate the subject of almost any transitive verb, as well as physical proximity or indicating an object or method to use to accomplish a particular task. Another example of meaning dependent on context is that “green” can refer to a color, a state of freshness or newness, or, disparagingly, to inexperience. A quick glance at any page of any dictionary will show that most words have multiple, and often unrelated, meanings, and context is what disambiguates them. Contrary to this nuanced usage of words, in general, keyword-based approaches choose one single meaning for each word, and apply that meaning consistently in all searches. This problem is fundamentally unfixable in these systems: in order to attach a contextual semantic to a word, strong parsing technology is required and a means must be found of specifying a word in context, sufficient for a program to understand the contextual meaning.
- Strongly tied to an application. Since the meanings of words must be fixed so strongly, these systems have the interface strongly tied to (and, in general, inseparable from) the application. There is no toolkit comparable to the popular Graphical User Interface (“GUI”) toolkits to form a keyword-based natural-language interface to an arbitrary application.
- Missed meanings attached to glue words, especially prepositions. An assumption behind keyword-based approaches is that glue words carry no meaning or semantic content. Unfortunately, in practice there are very few words whose meanings are always unimportant. The words chosen as glue words are those whose meaning is most context-dependent, and thus their semantic content is largely missed.
- High error rates, non-robust. Since meanings are attached to words independent of context, meanings can often be guessed wrong. For example, one vendor in this space, Linguistic Technology Corporation, distributes a product (“EnglishWizard”) that permits database users to ask questions of a database. A demonstration is given with a database of purchasers, employees, sales, and products. In this example database, numbers always refer to the number of employees. This produces a sequence where, when a user asks “who purchased exactly two items,” the answer is “no one.” However, when a user asks how many items a particular individual purchased, the answer is “two.” The reason for the discrepancy could be that EnglishWizard did not really understand the question. Instead, the first user question was mapped to a question about employees since it included a number in it.
- 2. F
REE -FORM KEYWORD SEARCH : This category replaces keywords with previously-asked questions and the “right” answers, and returns the answers to the typed-in question. Examples of such systems are described in detail in U.S. Pat. No. 5,309,359, entitled “Method and Apparatus for Generating and Utilizing Annotations to Facilitate Computer Text Retrieval,” issued on May 3, 1994 to Katz, et al., and U.S. Pat. No. 5,404,295, entitled “Method and Apparatus for Utilizing Annotations to Facilitate Computer Retrieval of Database Material,” issued on Apr. 4, 1995 to Katz, et al. In systems employing free-form keyword searching, questions and answers are stored as sets. The question is typically stored in a canonical form, and a rewrite engine attempts to rewrite the user question into this form. If the user question maps into a pre-determined question for which the answer is known, then the answer is returned by the system. Such an approach is used by http://www.AskJeeves.com for Web searching applications, and for lookups of frequently-asked questions (FAQs). - Such systems have several limitations, including the following:
- A relatively small number of questions can be answered: The number of questions that can be answered is linearly proportional to the number of questions stored—thus, this method can only be used when it is acceptable to have a relatively small number of questions that can be answered by the system.
- Cannot directly answer a user's question: Since such a system processes a user question in toto, and does not attempt to parse it or extract information from the parts, it cannot be used where the solution to the user question requires the use of a parameter value that can be extracted from the question. In sum, the system can merely point the user at a page where his question can be answered—it cannot directly answer the user question.
- 3. U
NDERSTANDING -BASED SEARCHES : Systems incorporating understanding-based searches attempt to understand the actual meaning of a user's request, including social and background information. An example of such a system is Wilensky's UNIX-based Help system, UC. UC had built into it a simple understanding of a user's global goals. Wilensky explained that a consequence of not having such a deep understanding was that the system might offer advice, which literally addressed the user's immediate question in a way that conflicted with the user's global goals. A specific example is that a request for more disk space might result in the removal of all the user's files—an action that met the immediate request, but probably not in a way that the user would find appropriate. - Understanding based systems are generally confined to conversational partners, help systems, and simple translation programs. In general, it should be noted that the underlying application is quite trivial; in fact, the interface is the application. Various specialized systems have also been built, to parse specific classes of documents. A good example is Junglee's resume-parser. Researchers in this area have now largely abandoned this approach. Indeed, the academic consensus is that full understanding is “AI-complete”: a problem that requires a human's full contextual and social understanding.
- There have been multiple previous attempts to use natural language as a tool for controlling search and computer programs. One example of these is Terry Winograd's “Planner” system, which was described in his 1972 doctoral thesis. Winograd developed an abstract domain for his program, called the “Blocks World.” The domain consisted of a set of abstract three-dimensional solids, called “blocks,” and a set of “places” on which the blocks could rest. Various blocks could also rest on top of other blocks. Planner would accept a variety of natural language commands corresponding to the desired states of the system (e.g., “Put the pyramid on top of the small cube”), and would then execute the appropriate actions to achieve the desired state of the system. Winograd's system accepted only a highly stylized form of English, and its natural-language abilities were entirely restricted to the blocks' domain. The emphasis in the system was on deducing the appropriate sequence of actions to achieve the desired goal, not on the understanding and parsing of unrestricted English.
- A variety of programs emerged in the 1980's to permit English-language queries over databases. EasyAsk offers a representative program. In this system, the organization or schema of the database is used as a framework for the questions to be asked. The tables of the database are regarded as the objects of the application, the columns their attributes, and the vocabulary for each attribute the words within the column. Words that do not appear within the columns, including particularly prepositions, are regarded as “noise” words and discarded in query processing.
- Such understanding-based systems have a variety of problems, including the following:
- Ignored vital relationships: Database schemas are designed for rapid processing of database queries, not semantic information regarding the databases. Relationships between database tables are indicated by importing indicators from one table into another (called “foreign keys”). Using the relationships in the schema as a framework for questions ignores some vital relationships (since the relationship is not explicitly indicated by key importation).
- Lost semantic information: Prepositions and other “noise” words often carry significant semantic information, which is context-dependent. For example, in a database for books, authors, and publishers, the preposition “by” may indicate either a publisher or an author, and may indicate the act of publishing or authoring a book.
- In addition to the problems described above with respect to some of the different approaches that currently exist for retrieving data, all of the above approaches share the limitation that the Natural Language (“NL”) interface for each application must be handcrafted; there is no separation between the NL parser and interface, and the application itself. Further, development of the interface often consumes more effort than that devoted to the application itself. None of the currently existing approaches to NL interfaces is portable across applications and platforms. There is no NL toolkit analogous to the Windows API/Java AWT for GUIs, nor a concrete method for mapping constructs in NL to constructs in software programs.
- Thus, there exists a need for a system and method for creating structured parametric data from plain text, both for purposes of content synthesis and for purposes of data retrieval. Further, such a system should be portable across applications and platforms. In addition, such a system should be able to support searches on any relevant criteria which may be of interest to a web-site's visitors, and by any arbitrary range of values on any parameter. Further, there exists a need for a system which updates seamlessly, invisibly, and rapidly to accommodate a change, when a website adds or modifies the products it offers.
- The present invention provides a system, method, and an architecture for receiving unstructured text, and converting it to structured data. In one embodiment, this is done by mapping the grammatical parse of a sentence into an instance tree of application domain objects. In addition, the present invention is portable across different application domains.
- A system in accordance with the present invention can be used for creating structured data from plain text, to allow for the efficient storing this structured data in a database. For example, from the free text description of a number of products, the structured data (which could be an extracted object and its attributes) can be used to create individual entries in a product database, and thus create content for an ecommerce website or web market. Alternately, or in addition, such a system can be used for creating structured data from a plain text query, for using this structured data to retrieve relevant data from a database. For example, a user's free text query can be converted to a database query that corresponds to the objects of the database and their attributes. Such a system overcomes the limitations of conventional search engines by accepting free form text, and mapping it accurately into a structured search query.
- The present invention recognizes that understanding natural language is neither required nor desired in generating structured data; rather, what is desired is the ability to map natural language onto program structure. Further, there is a natural relationship between the parse of the sentence as expressed in a parse tree and a component tree in a program. Thus, the natural language sentence is understood as instructions to build a component tree. A content engine takes in a natural language sentence and produces a program component tree. The component tree is then further simplified before it is passed to a program for execution.
- As mentioned above, a system in accordance with the present invention can be used across various applications. In the various embodiments of the present invention, the meaning of a word is dependent only on the application and the role of the word in the sentence. Thus, the definition of a word is largely the province of the application developer. Briefly, words act as identifiers for components. A word in a sentence serves as an identifier for program objects. As discussed above, many words in English or other natural languages have multiple meanings with the meanings dependent upon context. Similarly, for the present invention, a word may be used as an identifier for multiple objects.
- In one embodiment, the present invention transforms an English sentence into a set of software objects that are subsequently passed to the given application for execution. One of the advantages of this approach is the ability to attach a natural language interface to any software application with minimal developer effort. The objects of the application domain are captured, in one embodiment, by using the Natural Markup Language (“NML”). The resulting interface is robust and intuitive, as the user now interacts with an application by entering normal English sentences, which are then executed by the program. In addition, an application enhanced with the present invention significantly augments the functionality available to a user.
- When given a plain text sentence in a natural language, a system in accordance with one embodiment of the present invention performs the following steps:
- (i) A parsing algorithm applies a formal context-free grammar for the natural language to derive all parses of a given sentence. For purposes of discussion, English is used as an example of the natural language of the plain text. However, it is to be noted that the present invention may be used for any natural language. In one embodiment, all parses of the sentence are derived in the time taken to derive a single parse (e.g., concurrently). Preferrably all parses are stored in a single data structure whose size is dramatically smaller than the number of individual parse trees, often just a constant factor larger than the size taken to store a single parse tree. It is to be noted that, in one embodiment, the correct map of a sentence is only known after all possible parses have been attempted.
(ii) A mapping algorithm then uses the structure of each parse tree for a given sentence to attempt to derive an object representation of the sentence within the domain of interest based on the application-specific NML model. In other words, the mapping algorithm maps each parse outputted by the parser, into an instance tree of objects. In one embodiment, this is done by generating instance trees, mapping each parse onto an instance tree, pruning the instance trees generated, and then using a best-match algorithm on the pruned trees to select the best match.
(iii) A reduced form of the NML object description instance is created as an instance of a Domain Markup Language (“DML”). This DML is passed to the application program for execution. - The features and advantages described in this summary and the following detailed description are not all-inclusive, and particularly, many additional features and advantages will be apparent to one of ordinary skill in the art in view of the drawings, specification, and claims hereof. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter, resort to the claims being necessary to determine such inventive subject matter.
-
FIG. 1 is an illustration of the architecture of a system in accordance with an embodiment of the present invention. -
FIG. 2 is a block diagram of the components of the content engine. -
FIG. 3A is an example of a parse tree for “abb” using a first grammar. -
FIG. 3B is an example of two different parse trees for “abb” using a second grammar. -
FIG. 3C illustrates how various parse trees can be represented as a single parse DAG. -
FIG. 4 is a flowchart illustrating the functionality of the content engine. -
FIG. 5A illustrates one possible parse tree for the sentence “The boy helped the girl with the suitcase.” -
FIG. 5B illustrates another possible parse tree for the sentence “The boy helped the girl with the suitcase.” -
FIG. 5C illustrates how the different parse trees for the sentence “The boy helped the girl with the suitcase” can be represented as a single parse DAG. -
FIG. 6 is a flowchart illustrating the generation of instance trees by the mapper. -
FIG. 7 illustrates the pruning of invalid instance trees after all instance trees have been generated by the mapper. -
FIG. 8 illustrates a cost function employed by the mapper to pick the best map from the valid instance trees in accordance with an embodiment of the present invention. -
FIG. 9 is a flowchart illustrating DML generation in accordance with one embodiment of the present invention. - The figures depict a preferred embodiment of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.
-
FIG. 1 illustrates an overview of the architecture of a system in accordance with one embodiment of the present invention. The system comprises acontent engine 110, anonline dictionary 120, adomain dictionary 130, a Natural Markup Language (“NML”)module 140, a verticaldomain concepts module 150, a customclient specifications module 160, agrammar storage 170, and aclient data module 182. - The
content engine 110 receives as input plain text, parses it, and maps the parses into instance trees. As can be seen fromFIG. 1 , in one embodiment of the present invention, thecontent engine 110 receives input from both the online dictionary 120 (which includes words in a natural language), and a domain dictionary 130 (which includes terms specific to a domain). - In addition, the
content engine 110 receives input from theNML module 140, which contains an NML model specific to the application or domain for which the system is being used. The application-specific NML is created, in one embodiment, using a combination of automatic and manual editing from the vertical domain concepts obtained from the verticaldomain concepts module 150, and the custom client specifications obtained from the customclient specifications module 160. The present invention is customized to avertical domain 150 of application by creating an object oriented data model that represents the intended functionality of the site. An example of thevertical domain concepts 150 is taxonomy such as the United Nations Standard Product & Services Code (UN/SPSC). Another example of thevertical domain concepts 150 is the set of concepts that are pertinent to financial information for a company such as, company name, location, officers, products, competitors, annual sales, revenues, employees, etc. An example ofcustom client specifications 160 is a collection of concepts similar to thevertical domain concepts 150, but specific to a web-site (i.e. not found on all web-sites that may be in the same domain). - In addition, an input to the
content engine 110 is also provided by thegrammar storage 170. Thegrammar storage 170 stores a grammar for a particular language. In one embodiment, thegrammar storage 170 stores a full context-free grammar for the English language. An example of such a grammar is included in the computer program listing appendix in file grammar.txt. The grammar shown in grammar.txt has its start symbol as <Paragraph>. The rules indicate that a <Paragraph> is composed of one or more <Sentence> symbols separated by <Terminator>. Similarly, a <Sentence> is composed of a <Clause> and so on. Grammars are discussed in greater detail below. - The
content engine 110 also has access to a module containingclient data 182. This data is used for client-specific or dynamic vocabulary that does not transfer across client sites or applications. Examples of such vocabulary include trade or brand names (e.g. “Explorer”, “Expedition”, or “Excursion” for Ford sport utility vehicles, or the names of confections made by Hershey Foods Company). -
FIG. 2 illustrates the architecture of thecontent engine 110 in an embodiment of the present invention. As can be seen fromFIG. 2 , thecontent engine 110 comprises aparser 210, amapper 220, and a Domain Markup Language (“DML”)generator 230. - The
parser 210 parses the text input by the user into all possible parses, based on the grammar stored in thegrammar storage 170. In one embodiment, theparser 210 applies a formal context-free grammar for the language in which the user is working, to derive all parses of a given sentence. In one embodiment, all parses are derived in the time taken to derive a single parse. In a preferred embodiment, all of the parses are stored in a single data structure of size equivalent to that taken to store a single parse tree. Theparser 210 may generate meaningless parses, but this is acceptable because, as will be discussed below, these meaningless parses will not yield valid mappings into the NML and will be automatically discarded from consideration during the mapping process. The functionality of theparser 210 is discussed in greater detail below. - The
mapper 220 accesses all the parses of the text input by the user produced by theparser 210. Themapper 220, in turn, uses the structure of each parse tree for a given sentence to attempt to derive an object representation of the sentence within the domain of interest based on the application-specific NML model provided by theNML module 140. In other words, themapper 220 maps each parse outputted by theparser 210, into an instance tree of objects. The functionality of themapper 220 is discussed in detail below. - In one embodiment, the result of the
mapper 220 is not the final result of thecontent engine 110. One more step remains: theDML generator 230 reduces the structure produced by themapper 220 to a simpler form. The generation of the DML is directed, in one embodiment, by DML_ELEMENT declarations contained in the NML model provided by theNML module 140. The result of this process, described in detail below, is to produce a document in the Domain Markup Language (“DML”). The DML description can then be passed as an input to the underlying application (not shown in the figures). In one embodiment, the application takes the DML input and use it to populate a database, using each instance tree as the description of an entity (and its attributes) in the application domain, and creating the appropriate entries in the database. In another embodiment, the application takes the DML input and uses it as a query on an underlying database, to retrieve entries (e.g., products) that satisfy the query, and hence match the user's interests (to the extent that such interest is well expressed in the original text input). - Before discussing the functionality of an embodiment of a system in accordance with the present invention, it will be helpful to discuss what a grammar is, what NML is, and what DML is.
- 1. Grammar
- Languages, both natural and computer, are described by means of a “grammar.” A grammar is a series of mathematical objects called “productions,” which describe mathematically the well-formed “sentences” of the grammar.
- A simple example of a grammar, “Grammar1” is as follows:
-
- SAB
- AaA
- Aa
- BbB
- Bb
- The symbols “S”, “A”, and “B” are called “non-terminals” or “phrases.” They represent purely abstract objects, which do not appear in any sentence in the language, but represent a group of symbols of a language sentence. The symbols “a” and “b” represent words in the language, and are called “terminals” or “words.” By convention, every grammar has a phrase “S” for “sentence”, which appears alone on the left-hand side of one production. A production is applied by replacing the left-hand side of the production with the right-hand side in a string.
- A sequence α of terminals is said to be derived from a sequence γ of non-terminals and terminals if α can be transformed into γ by applying a succession of productions of the grammar. For example, for Grammar1, “aabb” can be derived from “aAbB” because the rules Aa and Bb, applied to aAbB yield aabb. A sequence of terminals, or a “sentence,” is said to be in the language of the grammar if it can be derived from the start symbol, S. For example, for Grammar1, the sequence “abb” is in the language of the grammar, because SABaBabBabb. Conversely, “abab” is not in the language, since no succession of productions can be used to derive “abab” from S.
- In English and other natural languages, the non-terminals and terminals correspond intuitively to the standard grammatical objects learned by a school child. The terminals are simply the words and punctuation symbols of the language; the non-terminals are the standard phrase constructs and word types learned in elementary school: noun, verb, noun phrase, verb phrase, etc. The set of non-terminals in human languages tend to be fairly limited; the set of terminals and the productions vary widely, and in their variance is the rich diversity of human language. In general, any sequence of non-terminals and terminals may appear on either side of a grammar rule. However, grammars which exploit this freedom are computationally intractable. Thus various restrictions are often placed on the form of the left-hand side and the productions which make parsing these restricted grammars computationally tractable.
- Of particular interest are “context-free grammars,” which are distinguished in that the left-hand side of each production is restricted to be a single non-terminal. Grammar1 given above is context-free. In fact, it is of a slightly more restricted type: “regular”.
- As will be explained in more detail below, the context-free grammar used in one embodiment by the
content engine 110 provides the minimal amount of grammatical information necessary to capture the correct parse of any grammatically correct English sentence. The main intent of the grammar is to capture the correct parse of a sentence without attempting to understand the meaning (or semantics) of the sentence. The grammar is thus created to include every correct parse of every sentence in the language. Naturally, for any single sentence this results in several ambiguous parses, only one of which is the (semantically) correct parse of the given sentence. - One skilled in the art will note that the grammar provided by the
grammar storage 170, in one embodiment, can be substantially compacted from a full grammar of the English language, so as to facilitate brevity of the grammar. For example, the grammar shown in grammar.txt comprehensively ignores grammatical features like verb conjugations, plurality of nouns, tense, active or passive voice etc. This is acceptable because these features are irrelevant to the parse of a sentence and are only needed if the semantics of a sentence were to be analyzed in detail. - In grammatical analysis, the particular sequence of rewrite rules used to derive the sentence is usually called the parse of the sentence. In a context-free grammar, the parse of a particular sentence can be represented mathematically as a “parse tree.”
-
FIG. 3A depicts an example of a parse tree for “abb”, using the Grammar1 above. For an arbitrary grammar, a parse may not be unique. For example, consider now the Grammar2. -
- SAB
- SCB
- CaB
- AaA
- Aa
- BbB
- Bb
- Based on Grammar2, the string “abb” would have two distinct parses as depicted by the two separate parse trees shown in
FIG. 3B . - Such a grammar, which can result in multiple parse trees for a string, is said to be “ambiguous.” Most grammars for human languages are ambiguous in this precise technical sense, for the excellent reason that human language is itself ambiguous. For instance, in the sentence “The boy helped the girl with the suitcase,” the modifier “with the suitcase” can either apply to the girl, or to the act of helping. In general, a modifier can modify any part of the sentence. Resolution of ambiguities is an important problem in parsing, and will be discussed below.
- Referring again to
FIG. 3B , it can be noted that conventionally, different parses result in different parse trees. However, in accordance with an embodiment of the present invention, all parses of a given sentence can be represented as a single parse Directed Acyclic Graph (“DAG”) 300. This is illustrated inFIG. 3C for sentence “abb”. - The dashed edges 310 of
DAG 300 represent optional parses; selection of a set encompasses a valid parse tree. By examiningFIGS. 3B and 3C , it can be seen that the two trees inFIG. 3B have a total of 14 nodes and 12 edges; in contrast, the parse DAG shown inFIG. 3C has a total of only nine nodes and 11 edges. The space and time savings represented by using the parse DAG are dramatic when there are hundreds or thousands of parses, as is typical for English sentences. The space and time taken to construct the parse DAG is proportional to the number of distinct nodes in the component parse trees, whereas the space and time taken by conventional algorithms is proportional to the number of nodes of the parse trees. - 2. Natural Markup Language (“NML”)
- The approach of the present invention is based on describing the set of concepts of a specific application area or domain as a set of objects. Objects are grouped into two fundamental classes:
- (i) Enumerations: These are objects defined by single words or fixed phrases in English over the given domain. A simple example of an Enumeration is the object Color, which is defined by the color words (e.g., red, blue, mauve) of everyday experience.
- (ii) Composites: These are objects are defined as collections of sub-objects. The sub-objects of a composite are called its “attributes.” One example of a composite is the object Desk, which can have attributes PrimitiveDeskWord (e.g., the enumerated object consisting of the word desk and its synonyms), PedestalType (e.g., a composite describing whether this desk has a right, left, or double pedestal), Dimension (e.g., a composite giving the height, width, and depth of the desk), Use (e.g., an enumeration consisting of executive, computer, student, secretary), and various other attributes describing the material, finish, and optional features of the desk.
- NML is a language for declaring objects, enumerations, and the relations between objects. In one embodiment, the NML programmer declares the composites and enumerations of the domain. In one embodiment, NML is based on the Extensible Markup Language (“XML”) standard. It should be noted that the NML description of a domain describes a graph of objects, where the sinks of the graph (the nodes with no outgoing edges) are the Enumerations of the domain.
- As discussed above with reference to
FIG. 1 , theNML module 140 provides an application-specific NML to thecontent engine 110. NML is a tool for describing an application's object hierarchy and the vocabulary by which the hierarchy is referenced in natural language to thecontent engine 110. Because the meanings of words themselves are not relevant to the actual implementation of a system, the present invention can be used for various different applications. An NML document may be created for each application, and, typically, a small special-purpose markup language for the domain itself may be created. The markup language and the NML document are strongly related. An NML document captures the concepts of an application domain, and the markup language is designed to hold the values for those concepts for a particular query. - An example of such a markup language document (from the “CompanyProfileAPI” Markup Language) is shown below, corresponding to the values for the query “Who is the director of human resources for Microsoft in the United Kingdom?”
-
<COMPANY_PROFILE_API> <API_COMPANY_PERSON> <PERSON_FULL_NAME GET_OPERATOR=“value”/> <COMPANY_NAME SET_VALUE=“microsoft”/> <LOCATION> <COUNTRY SET_VALUE=“uk”/> </LOCATION> <PERSON_TITLE SET_VALUE=“boss”/> <DIVISION SET_VALUE=“human resource”/> </API_COMPANY_PERSON> </COMPANY_PROFILE_API> - In this example, it will be seen that the morphology and, in some cases, the actual words of the query have been eliminated; rather, the concepts and values have been inserted in the document, and whether the user query requested or set the specific value. In this case, the person's full name was requested, and the identifying information given was the company he worked for, the country he worked in, his conceptual title (“boss”) and his division (“human resources”). This is sufficient information to run a query to satisfy the user's request, but all knowledge of the actual English he used in stating his query (and all requirements to parse it) have been eliminated.
- As mentioned above, in one embodiment of the present invention, NML is an extension of the eXtensible Markup Language (XML). Briefly, XML is the core of all tag-based markup languages. It is almost never used standalone, but is configured into an application-specific tag-based markup language. Examples of such languages are the Mathematical Markup Language, MML, and Commerce One's product interchange language.
- An XML document consists of a set of “elements.” An element is a chunk of a document contained between an HTML-style tag and its matching closing tag. Unlike HTML, however, XML has no built-in tags—rather, the set of tags for a specific document are defined by its Document Type Definition, or DTD. The distinction between two separate XML extension languages are, simply, their DTDs.
- Let us introduce NML with a “Hello, world” program. Unlike most programming languages, however, NML isn't good for printing “hello, world”; rather, it's good for recognizing “hello, world”. The program which recognizes “hello, world” appears below in Program1.
-
<?xml version=“1.0”?> <!DOCTYPE NML_MODEL > <NML_MODEL DOMAIN=“HelloWorld1” > <COMMENT> This file shows the simplest Hello, World example </COMMENT> <ENUMERATION NAME=“HelloWorld”> <IDENTIFIER LITERAL=“Hello, World”/> </OBJECT> </NML_MODEL> - Program1 above is extremely simple; it just recognizes an object indexed by the string “hello, world”, and maps it to the object “HelloWorld.” The IDENTIFIER element within the ENUMERATION element indicates that the LITERAL argument, when it occurs in the text, creates an instance of the relevant ENUMERATION. Thus, the phrase “hello, world” creates an instance of the HelloWorld object, and this maps that exact phrase. This program, while simple, recognizes only the exact phrase “hello, world” with various capitalizations. A simple program which recognized only this exact phrase would have served as well, and been far simpler to write. However, in NML, a program which recognizes much more is almost as easy to write. This is shown in the next example in Program2.
-
<?xml version=“1.0”?> <!DOCTYPE NML_MODEL > <NML_MODEL DOMAIN=“HelloWorld2” > <COMMENT> This file shows a non-working Hello, World example </COMMENT> <OBJECT NAME=“HelloWorld”> <ATTRIBUTE MIN=“1” MAX=“1” INFER=“false” ID=“Greeting”/> <ATTRIBUTE MIN=“1” MAX=“1” INFER=“false” ID=“World”/> </OBJECT> <ENUMERATION NAME=“Greeting”> <IDENTIFIER LITERAL=“hello”/> <IDENTIFIER LITERAL=“hi”/> <IDENTIFIER LITERAL=“greeting”/> <IDENTIFIER LITERAL=“good morning”/> <IDENTIFIER LITERAL=“good afternoon”/> </ENUMERATION> <ENUMERATION NAME=“World”> <IDENTIFIER LITERAL=“world”/> <IDENTIFIER LITERAL=“everyone”/> <IDENTIFIER LITERAL=“everybody”/> </ENUMERATION> </NML_MODEL> - Program2 above declares an object HelloWorld with two sub-objects, or ATTRIBUTES: Greeting and World. Greeting is indexed by the literals “hello”, “hi”, “good morning”, and “good afternoon”; World by “everyone”, “everybody”, and “world”. The MIN=1 argument to both ATTRIBUTES indicates that any object of type HelloWorld must have both a Greeting and World ATTRIBUTE. The sentence “Hello”, for example, will not match, because the World ATTRIBUTE would be missing. Similarly, MAX=1 indicates that only one ATTRIBUTE of each type can be present: “Hello everyone good afternoon” would be unmapped, since two Greeting objects would be created to be sub-objects of HelloWorld.
- Program2 when implemented by the
content engine 110, is designed to recognize the following phrases. -
Hello, world Hi, world Good morning, Good afternoon, world world Hello, everyone Hi, everyone Good morning, Good afternoon, everyone everyone Hello, everybody Hi, everybody Good morning, Good afternoon, everybody everybody - However, Program2 does not quite work to recognize these phrases. In fact, Program2 recognizes nothing. Rather, the Program3 below, which differs from the Program2 by a single word, does in fact recognize the above phrases.
-
<NML_MODEL DOMAIN=“HelloWorld2” > <COMMENT> This file shows a working Hello, World example </COMMENT> <OBJECT NAME=“HelloWorld”> <ATTRIBUTE MIN=“1” MAX=“1” INFER=“false” ID=“Greeting”/> <ATTRIBUTE MIN=“1” MAX=“1” INFER=“true” ID=“World”/> </OBJECT> <ENUMERATION NAME=“Greeting”> <IDENTIFIER LITERAL=“hello”/> <IDENTIFIER LITERAL=“hi”/> <IDENTIFIER LITERAL=“greeting”/> <IDENTIFIER LITERAL=“good morning”/> <IDENTIFIER LITERAL=“good afternoon”/> </ENUMERATION> <ENUMERATION NAME=“World”> <IDENTIFIER LITERAL=“world”/> <IDENTIFIER LITERAL=“everyone”/> <IDENTIFIER LITERAL=“everybody”/> </ENUMERATION> </NML_MODEL> - As can be seen from examining Program2 and Program3, the change is in the World ATTRIBUTE of the HelloWorld OBJECT: in Program3, the INFER argument is set to true. Inference is when the presence of a modifier can imply the existence of an object, even when the object is not explicitly identified. Here this means that whenever a World OBJECT is created, a HelloWorld OBJECT will be created containing it. This is the second of the two methods by which OBJECTs are created: the first, which has already been described, is when an IDENTIFIER is encountered. In Program3, Greeting and World objects were created, but no HelloWorld object; in fact, in that program, no HelloWorld object could be created, since it had no IDENTIFIERS, nor was it INFERred from any ATTRIBUTE.
- The difference in behavior between Program2 and Program3 is due to one other factor: in Program3, all nouns and verbs in a sentence must be matched in a tree rooted in a single object, or the sentence as a whole is not considered mapped.
- As mentioned above, NML is the means by which the application developer describes the structure of his application to the
content engine 110. In many ways, it is equivalent to defining an Application Programs Interface (API) for the application, with a key property, in one embodiment, that the “application programmer” in this case is a user speaking a specific language (e.g., English). Thus, the API is very simple: it encapsulates only those objects and attributes which a user can create with a single English sentence and which would be expected to be known by users of the application. For example, in a furniture catalog, the NML would describe objects such as Desk, which can have attributes such as PrimitiveDeskWord (e.g., the enumerated object consisting of the word desk and its synonyms), and PedestalType (e.g., a composite describing whether this desk has a right, left, or double pedestal). - In one embodiment, an NML file thus looks similar to a Java interface file or a C++ .h file: it is a description of the objects of an application, without their implementation. The object hierarchy described in the NML file is in logical structure and function very much the programmer's object hierarchy for the application: a few additional objects are typically added to provide targets for English mapping. This section concerns itself with the raw structure of NML: the means by which this is deployed in an application will be seen below.
- The easiest way to look at NML is to start with its document type definition (DTD) given below.
-
<!DOCTYPE NML_MODEL [ <!ELEMENT NML_MODEL (COMMENT?,IMPORT*,(OBJECT|ENUMERATION|CALLBACK| PATTERN|COMMENT|DML_CALL)*)> <!ATTLIST NML_MODEL DOMAIN CDATA #REQUIRED GENERATE_PEER (true | false | TRUE | FALSE | True | False) “true”> <!ELEMENT IMPORT EMPTY> <!ATTLIST IMPORT FILE CDATA #REQUIRED> <!ELEMENT OBJECT (COMMENT?,ATTRIBUTE*)> <!ATTLIST OBJECT NAME CDATA #REQUIRED EXPR (true | false | TRUE | FALSE | True | False) “true” SINGLETON (true | false | TRUE | FALSE | True | False) “false” ROOT (true | false | TRUE | FALSE | True | False) “false” DML_ELEMENT CDATA #IMPLIED DML_ATTRIBUTE CDATA #IMPLIED DML_VALUE CDATA #IMPLIED PEER (true | false | TRUE | FALSE | True | False) “true”> <!ELEMENT ENUMERATION (COMMENT?,IDENTIFIER*)> <!ATTLIST ENUMERATION NAME CDATA #REQUIRED EXPR (true | false | TRUE | FALSE | True | False) “true” ROOT (true | false | TRUE | FALSE | True | False) “false” DML_ELEMENT CDATA #IMPLIED DML_ATTRIBUTE CDATA #IMPLIED DML_VALUE CDATA #IMPLIED PEER (true | false | TRUE | FALSE | True | False) “true”> <!ELEMENT COMMENT ANY> <!ELEMENT IDENTIFIER EMPTY> <!ATTLIST IDENTIFIER MAP CDATA #IMPLIED LITERAL CDATA #REQUIRED UNKNOWN (true | false | TRUE | FALSE | True | False) “false” TYPE (Interrogative | Adjective | Verb | Noun | Adverb | Pronoun | Preposition | Literal) REQUIRED> <!-- An ATTRIBUTE can be an OBJECT, ENUMERATION, OR CALLBACK --> <!ELEMENT ATTRIBUTE EMPTY> <!ATTLIST ATTRIBUTE INFER (true | false | TRUE | FALSE | True | False) “false” MIN (0 | 1 | 2) “0” MAX (1 | 2 | many) “many” ID CDATA #REQUIRED DML_ELEMENT CDATA #IMPLIED DML_ATTRIBUTE CDATA #IMPLIED DML_VALUE CDATA #IMPLIED PEER (true | false | TRUE | FALSE | True | False) “true”> <!ELEMENT CALLBACK EMPTY> <!ATTLIST CALLBACK NAME CDATA #REQUIRED EXPR (true | false | TRUE | FALSE | True | False) “true” ROOT (true | false | TRUE | FALSE | True | False) “false” CLASS CDATA #REQUIRED TOKENIZER CDATA #REQUIRED MAPPER CDATA #REQUIRED DML_ELEMENT CDATA #IMPLIED DML_ATTRIBUTE CDATA #IMPLIED DML_VALUE CDATA #IMPLIED PEER (true | false | TRUE | FALSE | True | False) “true”> <!ELEMENT PATTERN (REGEXP+)> <!ATTLIST PATTERN NAME CDATA #REQUIRED EXPR (true | false | TRUE | FALSE | True | False) “true” ROOT (true | false | TRUE | FALSE | True | False) “false” DML_ELEMENT CDATA #IMPLIED DML_ATTRIBUTE CDATA #IMPLIED DML_VALUE CDATA #IMPLIED PEER (true | false | TRUE | FALSE | True | False) “true”> <!ELEMENT REGEXP EMPTY> <!ATTLIST REGEXP STR CDATA #REQUIRED SEP CDATA #IMPLIED> <!ELEMENT DML_CALL (TRIGGER+)> <!ATTLIST DML_CALL NAME CDATA #REQUIRED> <!ELEMENT TRIGGER EMPTY> <!ATTLIST TRIGGER NAME CDATA #REQUIRED> ]> - The NML_MODEL element is the root of the NML file. This contains a set of IMPORTs, and a set of OBJECTs. The DOMAIN argument to the NML_MODEL element is simply an indication to the
content engine 110 of the name of the particular domain or application being processed by the content engine. - Some elements that can be used in NML are discussed below.
- FILE
- The required FILE argument contains the path of the file to import. A typical NML application contains a small set of custom objects and a much larger set imported from standard libraries. A classic example is the Date package, which recognizes common date phrasings: everything from “the last week of the second quarter before last” to “12/19/98”. In one embodiment, the IMPORT element directs a compiler to import a library from its FILE argument. For example, <IMPORT FILE=“Utils/Date.nml”/> imports the date package. The IMPORT element may look like:
-
<!ELEMENT IMPORT EMPTY> <!ATTLIST IMPORT FILE CDATA #REQUIRED> - COMMENT
- In an embodiment of the present invention, the COMMENT element is used to denote an NML comment (as opposed to a general XML comment), and may be attached to the model as a whole or to any single object. The COMMENT element may look like:
-
- <!ELEMENT COMMENT ANY>
- OBJECT
- The OBJECT element is the heart of NML. It may look like:
-
<!ELEMENT OBJECT (COMMENT?ATTRIBUTE*> <!ATTLIST OBJECT NAME CDATA #REQUIRED EXPR (true | false | TRUE | FALSE | True | False) “true” SINGLETON (true | false | TRUE | FALSE | True | False) “false” ROOT (true | false | TRUE | FALSE | True | False) “false” DML_ELEMENT CDATA #IMPLIED DML_ATTRIBUTE CDATA #IMPLIED DML_VALUE CDATA #IMPLIED PEER (true | false | TRUE | FALSE | True | False) “true”> - An OBJECT can be thought of as a type in a programming language. Unlike types in programming languages, however, an object in NML has no real implementation. Its purpose is to provide a target for the content engine's 110 mapping of a word, a phrase or a sentence, and a source for the Domain back end's mapping to the application's API. As such, it merely needs provide type information: this is the type to which the phrase and sentence is mapped. The substructure of the Object element gives the explicit instructions for mapping the phrase.
- There are eight arguments to the Object element itself. The first argument, NAME, is required, and gives the name of the Object. All references to the Object, specifically those in ATTRIBUTE elements, are done by the NAME of the Object.
- The second argument, EXPR, refers to the ability of this object to form expressions—phrases involving “and”, “or”, “;”, “/”, or “,”. “Monday or Tuesday”, for example, forms an expression over the Weekday object. Such expressions are always formed over homogenous objects. Thus “Monday or December 23”, for example, would not form an expression over the Weekday object, though they would form an expression over a somewhat more abstract object.
- The PEER and DML_arguments control DML generation, described below.
- The SINGLETON argument indicates that any instance of this object can take only a single attribute. This is used when an object is, logically, an abstract superclass of several objects, only one of which can be represented. The MAX attribute declaration (see below) is not adequate to control this case, since the MAX attribute declaration controls the number of instances of a single attribute object: this controls the number of attribute objects.
- The ROOT argument indicates whether an instance of this object can be at the root of an instance NML tree. An Object contains an optional comment (see above) and a set of ATTRIBUTES. If OBJECT is analogized to a type in a programming language, ATTRIBUTE is analogous to a member of the type. Reference is by name. The declaration:
-
<OBJECT NAME=“HelloWorld”> <ATTRIBUTE INFER=“false” MIN=“1” MAX=“1” ID=“Greeting”/gt;
indicates that the HelloWorld object has a member of type (object name) Greeting. Note that there is no distinction between attribute name, type name, and member name—all refer simply to the object name of the attribute. -
<!ELEMENT ATTRIBUTE EMPTY> <!ATTLIST ATTRIBUTE INFER (true | false | TRUE | FALSE | True | False) “false” MIN (0 | 1 | 2) “0” MAX (1 | 2 | many) “many” ID CDATA #REQUIRED> - As mentioned above, ATTRIBUTE declares a subobject or member of an object. Thus, ID=“Greeting” says that this object contains a Greeting object as a subobject. First-time NML programmers often comment that there is no distinction between the member name and type, in contrast to most programming languages. To see this, consider the Java HelloWorld class:
-
public class HelloWord { public Greeting greeting; public Everyone everybody; } - In contrast, the NML equivalent
-
<OBJECT NAME=“HelloWorld”> <ATTRIBUTE INFER=“false” MIN=“1” MAX=“1” ID=“Greeting”> <ATTRIBUTE INFER=“true” MIN=“1” MAX=“1” ID=“Everyone”> </OBJECT>
would correspond to: -
public class HelloWord { public Greeting; public Everyone; } - To see why this is true, consider that the NML Object provides a target for mapping, and that member names distinct from types are only useful when there is more than one object of a specific type as a member. If this were the case in NML, the
content engine 110 would be unable to know which object to map to which attribute. In one embodiment, this problem may be solved by permitting multiple attributes of a specific type, and letting the back end sort out their roles in the sentence. - ATTRIBUTE
- The ATTRIBUTE element is empty, and has the following arguments:
- ID: This argument refers to the object name of the attribute, and must be present. If the name is simple (a single word) it refers to an object in the current NML_MODEL. If it is qualified, it refers to an object from an imported model. Thus, for example, ID=“Date.Date” refers to the Date object of the (imported) Date NML_MODEL. In one embodiment, objects referenced from imported files must use the qualified name, even if there are no conflicts. Thus, for example, even if there were no “Date” objects except in the “Date” NML_MODEL, attribute IDs in any file that imported “Utils/Date.nml” must reference the Date object as “Date.Date”. Qualifications of this form do not reference the directory structure at all: even if “Utils/Date.nml” appeared in the IMPORT declaration, “Date.Date”, not “Utils/Date.Date” would be the attribute ID of the Date object. Finally, qualifications are always single-level: “Utils.Date.Date” is not a valid attribute ID.
- INFER: This argument, when true, instructs the
content engine 110 to immediately build this OBJECT whenever an object of the type named in ID is built. In the example: -
<OBJECT NAME=“HelloWorld”> <ATTRIBUTE INFER=“false” MIN=“1” MAX=“1” ID=“Greeting”> <ATTRIBUTE INFER=“true” MIN=“1” MAX=“1” ID=“Everyone”> </OBJECT>
whenever an Everyone object is built, a HelloWorld object containing it as an attribute is also built. By contrast, the creation of a Greeting object does not infer the creation of the HelloWorld object. The default value for INFER is false. - MIN: This argument indicates the minimum number of attributes of this ID that this object must have. In the example, a HelloWorld object must have at least one Greeting attribute and one Everyone attribute. The values of MIN can be 0, 1, or 2, with a default of 0. The set of possible values may be expanded if a need is ever found. Often the minimum cardinality of an object is known. For example, a book must have a title. This can be exploited in the mapping process by deleting objects which do not achieve the minimum cardinality for an attribute.
- MAX: This argument indicates the maximum number of attributes of this ID that this object must have. In the example, a HelloWorld object must have at most one Greeting attribute and one Everyone attribute. The values of MAX can be 1, 2, or many, with a default of many. The set of possible values may be expanded if a need is ever found. Often the maximum cardinality of an object is known. For example, a book must have only one title. This can be exploited in the mapping process by deleting objects which do exceed the maximum cardinality for an attribute.
- An extended example using NML is included in the attached appendix on the CD, which is hereby incorporated by reference herein.
- 3. DML
- The NML document produced the
mapper 220 can, however, be too cumbersome for easy processing. In one embodiment, the mapping algorithm described in detail below creates a node in the NML instance object for each phrase successfully mapped. Some of these phrases have no semantic significance in the sentence. Moreover, many separate phrasings may be used to create the same logical object. Since the NML objects are closely tied to the phrasings used, multiple possible NML objects are used to denote the same logical object. Further semantic processing of the NML instance is required before the results can be used to populate a database or launch a search query. - Consider the NML models that recognizes an “ElectricalCurrent” object. There are many ways in English to specify a device's electrical current. One can refer to current or amperage; refer to the value as an English string (“forty-five” or “one hundred and seventy five”) or as a number (45 or 175); attach the units implicitly (“amperage 65”) or explicitly (“current 65 amps”); or attached to the value (“65A”); and so on. Each of these variations is captured in an NML model as a separate object; however, an application is dependent only upon the fact that current is specified, the units specified, and the specified value. In the ideal case, this is captured as an XML element in a document:
-
- <CURRENT UNIT=Amp VALUE=65/>
- This element is an element of a Domain Markup Language designed for electrical devices. It is automatically extracted from any NML instance corresponding to a text fragment which describes the logical entity “65 amps”.
- The Domain Markup Language corresponding to an NML model is specified in the NML model itself, with one specific NML Element and three attribute declarations. These are described here:
-
DML_CALL <!ELEMENT DML_CALL (TRIGGER+)> <!ATTLIST DML_CALL NAME CDATA #REQUIRED> <!ELEMENT TRIGGER EMPTY> <!ATTLIST TRIGGER NAME CDATA #REQUIRED> - This element directs the
DML Generator 230 to begin a new DML instance with a root element whose name is the required attribute of DML_CALL, whenever an NML Element whose name corresponds to a TRIGGER is detected in the NML Instance. For example, -
<DML_CALL NAME=”CURRENT”> <TRIGGER NAME=”SimpleAmperageObject”/> <TRIGGER NAME=”SimpleCurrentObject”/> </DML_CALL> - Directs the DML Generator to begin a new DML Instance with root element CURRENT whenever an instance of either a SimpleAmperageObject or a SimpleCurrentObject is detected in the NML Instance.
- The following three attributes attach to any NML OBJECT, ENUMERATION, CALLBACK, PATTERN, or ATTRIBUTE, and control the creation of DML Elements and Attributes, and (optionally) setting the values of DML Attributes. They are described below.
- DML_ELEMENT
- This attribute optionally appears with a name (e.g., DML_ELEMENT=“Current”). If absent, the name is assumed to be the NAME of the NML OBJECT, ENUMERATION, PATTERN, or CALLBACK, or the ID of the NML ATTRIBUTE. It directs the creation of a DML Element of type name, whenever the corresponding NML structure is encountered in the NML instance. This differs from DML_CALL in that the DML Element is not created as the root of a new DML structure; rather, the new element is embedded as a subobject of any containing DML Element. This will be explained in more detail, below, when the DML generation algorithm is explicated.
- Examples:
- <OBJECT NAME=“Current” DML_ELEMENT=“CURRENT”>
- Directs the creation of a DML Element named CURRENT whenever an NML Object named Current is encountered in the NML Instance tree. Exactly the same declarations would apply for ENUMERATION, CALLBACK, or PATTERN, with exactly the same effect.
-
<OBJECT NAME=”Current” DML_ELEMENT=”CURRENT”> <ATTRIBUTE ID=”AmpDeclaration” DML_ELEMENT= ”Amperage”.../> - This declaration directs the creation of a DML Element named CURRENT whenever an NML Object named Current is encountered in the NML Instance tree. In addition, if the Current object had an AmpDeclaration subobject, then an Amperage DML_ELEMENT would be created as a sub-element of CURRENT, as can be seen in the following:
-
NML Instance DML Instance <OBJECT NAME=”Current”...> <CURRENT... <OBJECT NAME=”AmpDeclaration”> <Amperage ...> ... ... </OBJECT> </Amperage> </OBJECT> </CURRENT> - DML_ATTRIBUTE
- This attribute optionally appears with a name (e.g., DML_ATTRIBUTE=“Current”). If absent, the name is assumed to be the NAME of the NML OBJECT, ENUMERATION, PATTERN, or CALLBACK, or the ID of the NML ATTRIBUTE. It directs the creation of a DML Attribute of type name, whenever the corresponding NML structure is encountered in the NML instance. The new attribute is attached as an attribute of the nearest containing DML Element, generated either from a DML_CALL or DML_ELEMENT declaration. This will be explained in more detail, below, when the DML generation algorithm is explicated.
- Examples:
-
<ENUMERATION NAME=“VoltWord” DML_ATTRIBUTE=“VoltUnit” > <IDENTIFIER TYPE=“Noun” LITERAL=“gigavolt” UNKNOWN=“false” /> <IDENTIFIER TYPE=“Noun” LITERAL=“kilovolt” UNKNOWN=“false” /> <IDENTIFIER TYPE=“Noun” LITERAL=“megavolt” UNKNOWN=“false” /> <IDENTIFIER TYPE=“Noun” LITERAL=“millivolt” UNKNOWN=“false” /> <IDENTIFIER TYPE=“Noun” LITERAL=“volt” UNKNOWN=“false” /> </ENUMERATION> - The above code directs the creation of a DML Attribute named VoltUnit whenever an NML Object named VoltWord is encountered in the NML Instance tree. The value of the attribute, unless directly specified by a DML_VALUE declaration (see below), is taken to be the literal which generated the VoltWord object, and thus:
-
<ENUMERATION NAME=”VoltWord”> <IDENTIFIER LITERAL=”gigavolt”/> </ENUMERATION>
generates the DML Attribute and value VoltUnit=“gigavolt”. This is attached to the containing DML_ELEMENT, e.g. -
<OBJECT NAME=“Voltage” DML_ELEMENT=“Voltage” > <ATTRIBUTE INFER=“true” MIN=“0” MAX=“1” ID=“VoltWord” /> ... </OBJECT> - Coupled with the VoltWord declaration above, gives the following NML Instance and DML instance for the word “gigavolt”, as illustrated below:
-
NML Instance DML Instance <OBJECT NAME=”Voltage”...> <Voltage <ENUMERATION NAME=”VoltWord”> Voltunit=”gigavolt”...> <IDENTIFIER LITERAL=”gigavolt”/> ... </ENUMERATION> </Voltage> </OBJECT> - DML_VALUE
- DML_VALUE is an optional adjunct to DML_ATTRIBUTE, and permits an NML programmer to override the default value assigned to an attribute by the DML Generation procedure. This is most often used when synonyms or multiple phrasings can appear, and a normalized value is desired.
- B. Functionality of the Content Engine
-
FIG. 4 is a flowchart illustrating the functionality of thecontent engine 110 in accordance with an embodiment of the present invention. As can be seen fromFIG. 4 , thecontent engine 110 receives theinput 410 and tokenizes it. Theparser 210 then creates 420 all the parse trees based on the tokenized input and the grammar from thegrammar storage 170. Next, for each parse tree, themapper 220 generates 430 an instance tree based on the application domain specific NML provided by theNML Model Module 140. Themapper 220 then also prunes 440 the instance trees, and then chooses 450 the best map. Finally, theDML generator 230 uses this best map to generate 460 the appropriate DML. These steps are discussed in detail below. - The functionality of the
content engine 110 outlined inFIG. 4 can be used both for content synthesis and for retrieving data. For content synthesis, the input received 410 may, for instance, be a catalog of items (and their descriptions) offered by an e-commerce site. For retrieving data, the input received 410 may, for instance, be a search query by a user. In the case of content synthesis, the DML generated 460 may be used to populate a database, while in the case of data retrieval, the DML generated 460 may be used to search a database that has been previously populated. - The input is tokenized 410 by the
content engine 110. In one embodiment of the present invention, tokens are simply the words in the input text. However, multiple words may sometimes be treated as a single token, for example, the two or more words that form a name such as San Francisco, or New York City. Multiple words that form a compound noun or other concepts such as dates, times, number patterns etc., may also be aggregated into a single token. - 1. Parsing
- Once the input is tokenized 410, the
parser 210 generates parse trees from the tokenized input based on the grammar obtained from thegrammar storage 170. In one embodiment, theparser 210 creates all possible parse trees. - The
parser 210 creates parse trees, similar in form to the parse tree (conceptually) created by a compiler from a program. The leaves of this tree are the tokens (or words of the input text); the internal nodes represent phrases and subunits of the sentence, where each node represents the subunit containing all the tokens descended from that node. The root node represents the sentence itself. - To see in detail how this is done, consider the ambiguous sentence “The boy helped the girl with the suitcase.” This sentence leads to two parse trees, which are distinguished by the placement of the prepositional phrase “with the suitcase.” In the first tree, the phrase “with the suitcase” modifies the verb “help.” In the second tree, the phrase modifies the noun “girl.”
FIG. 5A depicts the first tree, whileFIG. 5B depicts the second tree. In these descriptions, the boxes mark the recognized grammar symbols such as “SVO” (for Subject-Verb-Object), “NP” (Noun Phrase), and so on. The generating tokens are beneath the lowest-level boxes in the figure. - Consideration of
FIGS. 5A and 5B reveals that the nodes of the trees are the same, and are distinguished only by the edge into the node representing the phrase “with the suitcase.” In the first case, theedge 510 runs from the node representing the verb phrase “helped”; in the second case, theedge 520 runs from the node representing the phrase “the girl.” This aspect leads one to the conclusion that both parse trees can be represented in a single parse Directed Acyclic Graph (“DAG”). The DAG is depicted inFIG. 5C . As can be seen fromFIG. 5C , the DAG itself contains exactly the same number of nodes as each of the two component parse trees, and only one more edge than either of the two component parse trees. - The
parser 220 can employ any parsing algorithm. In one embodiment, the parsing algorithm of Cocke-Younger-Kasami may be used. Details of the Cocke-Younger-Kasami algorithm can be found in the Introduction to Formal Language Theory, Harrison, M. A., Addison-Wesley, 1978. A sample of the Cocke-Younger-Kasami algorithm is shown below in Tables 12 A-E. While the algorithm shown below provides a single parse of a sentence, it may be modified to generate all parses of the sentence. - The core of this algorithm is an (n+1)×(n+1) table, where “n” is the number of tokens in the parse. The tokens are here denoted a0 . . . an-1, and the table elements T0,0, . . . , Tn,n. The upper half of the table is filled from i,i+1 to n, n in the order given below. The items just above the diagonal are filled with the grammar nonterminals that directly derive the relevant token. The items in the remaining token are filled in as follows:
- The result of these equations is that, at the completion of the algorithm, Ti,j contains precisely the set of nonterminals which derive the phrase beginning at ai and terminating in aj. T0nj then contains the set of non-terminals which derive the entire sentence.
-
for (i = 0; i < n; i++) { t[i][i+1] = {A | A=>ai} } for (d = 2; d <= n; d++) { for (i = 0; i <= n − d; i++) { j = d + i; for (k = i+1; k < j; k++) { t[i][j] = t[i][j] ∪ {A | A=>BC, B ∈ t[i][k], C ∈ t[k][j]}; } } } - It can be seen from the above pseudocode that the order of magnitude of the time taken by this parsing algorithm run is proportional to PN3, where N is the number of words in the sentence and P is the number of distinct parses. The algorithm is shown running on the string aabb, given the Grammar3.
-
- S=>AB
- S=>PB
- P=>AS
- A=>a
- B=>b.
- The initial matrix is shown below.
-
T0,0 T0,1 T0,2 T0,3 T0,4 A A B B - After the first iteration of the loop with loop variable d, the matrix is:
-
T0,0 T0,1 T0,2 T0,3 T0,4 A S, P A S S, P B B - After the final iteration, the matrix is:
-
T0,0 T0,1 T0,2 T0,3 T0,4 A S, P S, P A S S, P B B - The root of the parse tree is contained in the element T[0][4]—in other words, in the cell in the top-right corner of the matrix. At this point the parsing algorithm terminates and the correct parses are read from the top-right corner of the matrix.
- 2. Mapping
- As discussed above, the
mapper 220 generates 430 instance trees for each parse tree based on the application-specific NML provided by theNML module 140. In one embodiment, themapper 230 then prunes 440 these instance trees to discard invalid and/or incomplete trees. The mapper then chooses 450 the best map. Each of these steps is discussed in detail below. - An object in the instance tree is said to cover a node of the parse tree (equivalently, a node is said to “map” to an object), if the
mapper 220 matches the object to the node, by the rules explained below. The goal of the mapping algorithm is to map a single object to the root node of the tree. In one embodiment, if a single NML instance cannot be obtained for a sentence, the system switches to another mapping mechanism that tries to obtain the best set of disjoint NML instances that cover the entire sentence. There are several different methods to perform a partial map of a sentence. - a) Generation of Instance Trees
- In one embodiment, instance trees are generated by starting out at the leaf (or terminal) nodes of a parse tree. In brief, a terminal node is created for each token. At each terminal node of a parse tree, all enumerated objects are indexed by the terminal word. An inference process is then executed to create inferred objects. The algorithm then moves up the parse tree, generating a new object at a parent node by composing the objects of the child nodes at the node. At each node there is one child node that is predetermined to be the main child of the node. The main child corresponds to the grammatical object that plays the central role in the grammatical structure represented by the node. For a noun phrase, this is the head noun, for a prepositional phrase this the prepositional complement, etc.
- Objects can be generated in several ways. Specifically, objects can be generated by enumeration from identifiers, enumeration from callbacks, and enumeration from patterns. In addition, objects can also be inferred from other objects. Let us consider each of these in turn.
- Enumeration from Identifiers:
- An Enumeration is an object created by the presence of a single word or phrase.
-
<!ELEMENT ENUMERATION (COMMENT?IDENTIFIER*)> <!ATTLIST ENUMERATION NAME CDATA #REQUIRED EXPR (true | false | TRUE | FALSE | True | False) “true”> - In the example shown below, the enumeration “Greeting” is created when the word “hello” is encountered, because of the code snippet:
-
<ENUMERATION NAME=“Greeting”> <IDENTIFIER LITERAL=“hello”> </ENUMERATION> - It is important to note that an Enumeration is in every way identical to an object, except for the fact that an object is always inferred from an existing attribute and an Enumeration is inferred from a word or phrase.
- The IDENTIFIER element recognizes a single word that forces creation of the object. The specific word is given in the LITERAL argument.
-
<!ELEMENT IDENTIFIER EMPTY> <!ATTLIST IDENTIFIER MAP CDATA #IMPLIED LITERAL CDATA #REQUIRED UNKNOWN (true | false | TRUE | FALSE | True | False) “false” TYPE (Any | Adjective | Verb | Noun | Adverb | Pronoun | Preposition) “Any”> - The IDENTIFIER element has no substructure, and can take the following arguments, listed below:
- LITERAL: This argument gives the literal string that maps to the object. In general, only the root of a specific verb or noun should appear in the literal argument; the Content Engine will recognize and map tenses, declensions, and all derivative forms of verbs and nouns. For example, <IDENTIFIER LITERAL=“have”> will map “has”, “had”, “having”, “has had”, and so on, and <IDENTIFIER LITERAL=“woman”> will map “women”, “women's”, “womanly”, and so on. LITERAL is the only required argument of IDENTIFIER, and will often be the only argument.
- MAP: Occasionally, synonyms are used to indicate a single object, and the semantic processing of the object is independent of which synonym is used. A good example is “stock” and “security”. In this case, the back-end code can be simplified if the synonyms are reduced to a single canonical case. MAP does this. If MAP appears, then the recognized literal will be mapped to the string that is given as the argument to MAP. The default value for MAP is the value of the LITERAL argument.
- TYPE: This restricts the mapping to the particular part of speech given as the argument. Often, words can assume several different parts of speech. For example, the word “green” is a noun (denoting a patch of grassy land or a color), an adjective, or a verb. It is often desired to restrict an IDENTIFIER to only one of these roles. If Verb is given as the value of TYPE, then only verbs will map to this particular identifier. The default value, ANY, maps any part of speech to this IDENTIFIER.
- Enumeration from Callbacks:
- Another way in which objects can be created is from Callbacks. The CALLBACK element functions in a fashion similar to ENUMERATION: it is a means for mapping individual tokens in a sentence to OBJECTS. It is designed for the specific case where the set of IDENTIFIERs for a particular OBJECT is very large, changes dynamically, or both.
-
<!ELEMENT CALLBACK EMPTY> <!ATTLIST CALLBACK NAME CDATA #REQUIRED EXPR (true | false | TRUE | FALSE | True | False) “true”> CLASS CDATA #REQUIRED PARSER CDATA #REQUIRED MAPPER CDATA #REQUIRED> - A good example of such a situation is the set of stock symbols, which number in the thousands and which change daily due to IPOs, mergers, and name and symbol changes. For such sets, the use of IDENTIFIERs is unwieldy: the NML file would be very large and in a state of constant update. A better solution is to use a standard relational database, and call it to recognize a stock symbol. The particular example for stock symbols is:
-
<CALLBACK NAME=“CompanyFundIndexDbName” EXPR=“False” CLASS=“ecCallback.CompanyFundIndexNameDatabase” PARSER=“isCompanyFundIndexName” MAPPER=“findCompanyFundIndexSymbol”> <COMMENT> Each company, fund, and index name or symbol is obtained via a callback to method that matches the names in a database. </COMMENT> </CALLBACK> - Formally, the CALLBACK element defines a Java class which contains at least two methods: a method which takes a string and returns a boolean (this is named in the PARSER argument), and a method which takes a string and returns another string (this is named in the MAPPER argument). While this was specifically designed with a SQL interface in mind, there is no restriction in the code for this: any Java class having the appropriate methods will do.
- In one embodiment, the CALLBACK element may have no structure, and have the following arguments, all of which are required:
- CLASS This is the name of the fully-qualified Java class containing the two methods referenced above. The Content Engine will call the method <CLASS>.<PARSER>(token); to recognize the token, and <CLASS>.<MAPPER>(token); (in the example above, “ecCallback.CompanyFundIndexNameDatabase.isCompanyFundIndexName(token);” for recognition, and “ecCallback.CompanyFundIndexNameDatabase.findCompanyFundIndex Symbol(token);” for mapping). Thus, the CLASS must be accessible to the Content Engine from the string as given here using the standard Java class loader methods.
- PARSER This is the name of the method within CLASS called to do the recognition: it should take a single String argument and return a boolean. This functions exactly as the LITERAL argument to IDENTIFIER; Content Engine will pass the root form of the token, not the token itself, to the parser. Thus, the word “Microsoft's”, appearing in a sentence, yields the call “ecCallback.CompanyFundIndexNameDatabase.isCompanyFundIndexName(microsoft)”. When this returns true, the behavior of the compiler is exactly identical to that produced when “microsoft” had appeared in a list of IDENTIFIERs for this OBJECT.
- MAPPER This is the name of the method within CLASS called to map recognized tokens to a canonical form: it should take a String and return a String. This functions exactly as the MAP argument to IDENTIFIER. As with PARSER, Content Engine will pass the root form of the token, not the token itself, to the mapper. To obtain the default behavior of IDENTIFIER, MAPPER should simply return its argument. A richer example is the one cited: ecCallback.CompanyFundIndexNameDatabase.findCompanyFundIndexSymbol returns the symbol associated with the name. So, for example, ecCallback.CompanyFundIndexNameDatabase.findCompanyFundIndexSymbol (microsoft) returns “msft”, as does ecCallback.CompanyFundIndexNaneDatabase.findCompanyFundIndexSymbol(msft).
- In an alternate embodiment,
CALLBACK 520 may be simplified if theContent Engine 110 adopts an interface-based protocol for its callbacks. In this case, the PARSER and MAPPER arguments to CALLBACK will disappear, and the CALLBACK CLASS will be required to implement theContent Engine 110 callback protocol. - Enumeration from Patterns
- A pattern is the third logical equivalent to an enumeration. This is used when a large number of identifiers can be specified by a regular expression. A full description of regular expressions (formally, regular languages) can be found in Aho, Hopcrofi, and Ullman, Introduction to Automata and Language Theory, Addison-Wesley, 1979. The most simple example of a regular expression is a Social Security Number, which is represented by the regular expression:
-
[1-9][0-9][0-9]-?[0-9][0-9]-?[0-9][0-9][0-9][0-9] - which indicates that a social security number is any string which begins with a digit between one and 9, followed by two digits between 0 and 9, an optional dash, two digits between 0 and 9, and optional dash, and then four digits between 0 and 9.
- In one embodiment, the
content engine 110 accepts any regular expressions specified by the PERL 5 compiler (see http://www.perldoc.com/perl5.6/pod/perlre.html for the current specification). The regular expressions are captured in the STR argument of the contained REGEXP element. Occasionally, it is useful to specify multiple regular expressions in the same pattern, which are separated by an optional SEP character (space by default). -
<!ELEMENT PATTERN (REGEXP+)> <!ATTLIST PATTERN NAME CDATA #REQUIRED EXPR (true | false | TRUE | FALSE | True | False) “true” ROOT (true | false | TRUE | FALSE | True | False) “false” DML_ELEMENT CDATA #IMPLIED DML_ATTRIBUTE CDATA #IMPLIED DML_VALUE CDATA #IMPLIED PEER (true | false | TRUE | FALSE | True | False) “true”> <!ELEMENT REGEXP EMPTY> <!ATTLIST REGEXP STR CDATA #REQUIRED SEP CDATA #IMPLIED> - Inference:
- Apart from the enumeration techniques discussed above, one more way in which an instance object can be created is by inference. Inference is when the presence of a modifier can imply the existence of an object, even when the object is not explicitly identified. This can occur through ellipsis, or, more commonly, because the underlying object is abstract and is not always (and perhaps never) explicitly identified.
- Consider, for example, the generic object “Weather,” which has attributes “Temperature,” “Precipitation,” “Outlook,” and “Location.” Though such an object may be explicitly identified (as, for example, by the keyword “weather”) it will often not be, as in the question “What is the temperature in San Francisco?” In this case, the request for the “Weather” object is inferred from the request for its attribute “Temperature.”
- Not all attributes infer the presence of a modified object. In the example above, the city San Francisco is a “Location” for “Weather,” but does not infer a “Weather” object. “Temperature,” however, does. A developer declares that a particular attribute infers the existence of the object. In the map, inferred objects are created immediately along with the inferring attribute, along with an “inferred” tag.
- In one embodiment of the present invention, inference is related to type inference in an object-oriented language in a deep and non-obvious fashion. In general, if a type A is a subclass of a type B in an object-oriented language, then every instance of A bears within it an instance of type B. Put better, one can think of A as B with additional properties. Thus, creation of an instance of A forces the creation of an instance of B. In some sense, then, the declaration of a sub-type in a program is a declaration of an inferencing attribute.
- In an alternate embodiment, rather than encapsulating the inferencing attribute in a sub-type declaration, the inferencing attribute may directly infer the object. In this embodiment, the attribute can be directly recognized, and the inferred object can be built directly from it.
- As discussed above, the INFER element is an argument of an attribute, which, when true, instructs the
content engine 110 to immediately build the OBJECT whenever an object of the type named in ID is built. In the example: -
<OBJECT NAME=“HelloWorld”> <ATTRIBUTE INFER=“false” MIN=“1” MAX=“1” ID=“Greeting”> <ATTRIBUTE INFER=“true” MIN=“1” MAX=“1” ID=“Everyone”> </OBJECT>
whenever an Everyone object is built, a HelloWorld object containing it as an attribute is often built. The default value for INFER is false. - As the objects are created, the “handle” of the instance tree must be adjusted. It may be helpful to define some terminology here. When an English phrase or sentence is parsed, there is always a dominant element. In the case of a subject-verb-object sentence, for example, the dominant element is the verb phrase; in the case of a noun phrase, it is the head noun; in the case of an adjectival phrase, it is the adjective. This element is referred to as the head word or head phrase of the phrase.
- As the
mapper 220 progresses, it creates trees of objects centered on nodes of the parse tree. Such a tree of objects, centered on a node of the parse tree, is said to be a map of the node. The link between a tree of objects and the parse tree is a single object within the map, called the handle of the map. The handle of the map may be thought of as the root of the map of the head phrase of the mapped node in the parse tree. Its role (and how the handle moves during the mapping process) will be explained below. - There is a fundamental equivalence between the object attribute tree in a program and the modifier hierarchy in a parse tree of a sentence. In the parse of a sentence, various words are the anchors of their phrase. For example, in any noun phrase, the noun is the anchor. The other sub-phrases are the modifiers. The anchor of the phrase defines the object in the component tree; the modifiers are attributes of the object. If an object Girl had been declared with identifier “girl” and attribute Carrying with identifier “with”, then the sentence “the boy helped the girl with the suitcase” would have its Object mapped to a component Girl with attribute Carrying. However, if Girl did not have an attribute Carrying then the object would have been mapped to a component Girl.
- The easiest way to see how an object grows by accumulating attributes is to imagine two objects of the same type as composing into a single object by merging their attributes. Consider the following snippet from the HelloWorld programs:
-
<OBJECT NAME=“HelloWorld”> <ATTRIBUTE INFER=“true” MIN=“1” MAX=“1” ID=“Greeting”> <ATTRIBUTE INFER=“true” MIN=“1” MAX=“1” ID=“Everyone”> </OBJECT> - In this case, both the Greeting object and the Everyone object create a HelloWorld object through the inference mechanism. Both of these HelloWorld objects have a missing, required attribute: once merged into a single object, the required attributes for both are complete.
- Two objects that are unrelated in the sentence, for example, should not compose: they refer to different semantic entities within the sentence, unless there is some overlying grammatical link between them. Consider the sentence “hello, dolly and thanks, everyone.” The HelloWorld objects created by the Greeting object containing “hello” and the Everyone object containing “everyone” should not merge: this would imply that there was a single phrase containing both Greeting and Everyone, and this is false. A second method that might be imagined would have an object adding as attributes only the maps of the modifiers of its head phrase. However, in English the semantic analysis of a sentence often contains inversions of its grammatical structure. For example, in the sentence “Show me the price of Microsoft,” the main semantic object is “the price of Microsoft,” and the verb phrase “Show” is, semantically, a modifier. Nonetheless, in the parse the head phrase is “Show,” which is modified by “the price of Microsoft.”
- The rule used by the
Content Engine 110 is very simple. A map may add as an attribute: - (1) The map of a modifier of its handle; or
- (2) The map of a phrase modified by its handle.
- In case (1), the handle remains unchanged. In case (2), the handle moves to the attribute, so that the handle remains at the map of the head phrase of the parse. Thus, in our example, assume that a Stock object had been created for the phrase “the price of Microsoft”. The handle of this map is the Stock object. “the price of Microsoft” modified the verb “show”, and so under rule (2) the Stock object can add a Show attribute. When it does, the handle of the map moves to the Show attribute of the Stock object. In other words, the root of the map is no longer the handle.
- Occasionally, it's helpful to force the handle to move to the root of the map. This happens when the programmer can guarantee that no further attributes can be added to this map from the modifiers of the head phrase. A good example occurs in the case considered in the previous section, where is clear that no further modifiers of the verb “show” will become attributes of the root Stock object. In order to permit this, inference moves the handle of the map to the root of the map. An inferred object's handle is always the root of the map.
- Details of the Mapping Algorithm
- Further details regarding the
generation 410 of instance trees are outlined in the flowchart depicted inFIG. 6 . Based on the application-specific NML obtained from theNML module 140, themapper 220 starts thegeneration 410 of instance trees by considering oneprocess node 601. Themapper 220 first determines 602 whether the node it is considering is a leaf node. If the node is determined 602 to be a leaf node, the object array is initialized 604 with generated objects. - Once the object array is initialized 604 by objects generated by enumeration, the
mapper 220 iterates 606-610 over all the objects in the array. For each such existing object, all objects that can be “inferred” from the existing object are added 610 to the object array. “Inference” is the only other way in which instance objects are generated, as described above. Once it is determined 606 that there are no more objects in the array, the object array is returned - Referring back to the
determination 602 of whether the node being processed is a leaf node, if the node is not a leaf node, the object array is initialized 614 to empty. Themapper 220 then determines 616 whether all the children of the node have been processed. If all the children of the node have not been processed, the next child node is selected 618 and processed 620. The maps of the child node are copied 622 to the object array, and the root of each copied object is set 624 to the child node. - If all the children of the node have been processed, then the attachment of attributes to objects is performed 626-648. Each object of the array is selected in turn as the object to which to attach attributes. This object is denoted as obj and is indexed by the variable i. Each object of the array is selected in turn using the index j initialized 630 to zero. The object indexed by j is examined 640 and is henceforth referred to as obj1. The goal of steps 640-648 is to determine whether obj1 can be attached as an attribute of obj, and to perform the attachment if it is possible. First, obj is examined 642 to see if it has as an attribute an object whose name is the name of obj1. If this is true, then the second test is performed 644: whether the handle of obj1 modifies the handle of obj. If this is true, then obj1 is attached 646 as an attribute of obj. Following this, or if either of the
tests obj1 648. - Once the attributes have been attached to obj, the final step is the reassignment of obj's handle, steps 634-636. The handle of obj is set to obj itself if obj has been inferred; if not, the handle of obj is left undisturbed.
- b) Pruning of Instance Trees
- In one embodiment, once the instance trees are generated 430, pruning 440 is performed by the
mapper 220 to discard invalid/incomplete instance trees. In one embodiment, for each map, a list of the tokens mapped into the instance tree are recorded; an instance tree for the sentence which does not map all the verbs and nouns are discarded. - An algorithm employed for pruning in one embodiment of the present invention is demonstrated in the flowchart in
FIG. 7 . Pruning starts 701 at the root of an instance tree. An array is designated 702 as the array of objects (i.e. components of the instance tree) associated with the root of the parse DAG. The content engine determines 704 whether there are any more objects in the array. As long as there are more objects remaining in the array, obj is assigned 706 the next object in the array. The content engine then determines 708 whether the obj covers all nouns and verbs in the sentence. If not, the object is deleted 710 from the array. If obj does cover all nouns and verbs in the sentence, the content engine determines 712 whether the MIN and MAX attributes of the object are satisfied. If they are not satisfied, the object is deleted 710 from the array. If these attributes are satisfied, the content engine loops back to determine 704 whether there are any more objects left in the array. When such determinations have been made for all the objects in the array, the array is returned 714. Thus, only those instance trees that account for all the verbs and nouns of the given sentence, and which satisfy the MIN and MAX attributes, are retained. - In another embodiments, a different algorithm may be used to discard instance trees. In still another embodiment, the step of pruning 440 need not be performed at all.
- c) Choosing the Best Map
- Finally, the instance tree which reflects the best map within the specified domain is chosen 450.
FIG. 8 illustrates how the best map is chosen 450 in one embodiment of the present invention. One skilled in the art will note that the “best” map can be chosen 450 in several other ways. - In the embodiment illustrated in
FIG. 8 , a cost function is used to impose a partial order on maps of a sentence. The maps of the sentence which are maximal under this partial order are chosen to be the best maps of the sentence, and returned as the result(s) of the mapping procedure. - The cost function in
FIG. 8 compares two maps (map A and map B), and returns which is the superior map. It consists of a set of eight comparisons 810-880, run in order. The kth comparison in the sequence is used only if the preceding k−1 comparisons have resulted in ties; thus, it is a hierarchy of tiebreakers. These are, in order: - 810: If the number of tokens covered by the two maps is not identical, the superior map is the map covering the most tokens. The reasoning here is straightforward: a better map covers more tokens.
- 820: If #1 does not indicate the better map, choose the map whose topmost expression (maps joined by the words “and” or “or”, or the punctuation symbol “,”) is furthest from the root of the map. The reasoning here is that a conjunction can bind two phrases of arbitrary size. Consider, for example, the phrase “red feather and gold sheath pen”. This phrase is ambiguous: it could refer either to two objects (a red feather and a gold sheath pen) or to a single object (a pen with a red feather and a gold sheath). The two maps would be distinct—the first, two-object map, has its expression at the root; the second, one level down, joining attributes of a single object. This rule resolves in favor of binding phrases at the lower of the possible levels, i.e., conjoining the smaller possible units. In this example, preferring the second map (pen with a red feather and a gold sheath) over the first. When a map has no expressions, the distance of an expression from the root is taken to be infinite.
- 830: If the maps are equal under
criteria # 1 and #2, choose the map with the least distance between the tokens. In an n-token text fragment, tokens are assigned indices. The leftmost token is assignedindex 0, and the token to the immediate right of the token with index i is assigned index i+1. This rule chooses the map with the smallest difference in index between the leftmost and rightmost tokens covered by the map. So, for example, given the phrase “red felt pen tip”, with indices red=0, felt=1, pen=2, tip=3, and map A covering “red felt tip” and map B covering “felt pen tip”, map B would be chosen as it has the least distance between its covered tokens (3−1=2 compared to 3−0=3). The reasoning here is that compact maps are preferred over disjoined maps. - 840: If the maps are equal under criteria #1-#3, choose the map with the fewer objects created by enumerations.
- 850: If the maps are equal under criteria #1-#4, choose the map with the fewer unused primitives—these are words and phrases in the text fragment unused by the relevant map.
- 860: If the maps are equal under criteria #1-#5, choose the map with the fewer objects created by database lookup.
- 870: If the maps are equal under criteria #1-#6, choose the map with the fewer NML objects.
- 880: If the maps are equal under criteria #1-#7, choose the map with the fewer inferred objects.
- If the maps are equal under all eight criteria, then they are incomparable (and thus equal) under the partial order, and are regarded as equally valid maps.
- The different criteria of the cost function illustrated in
FIG. 8 break into three distinct groups. The first group, comprising rules 1-2 and 5, are based on the structure of the sentence. Maps which use the most tokens, contained in a compact group, are preferred over maps which use fewer tokens spread further over the text segment. Rule 3, as mentioned above, resolves ambiguities with respect to expression phrases in favor of the tightest possible binding. Rules 4 and 6-8 comprise another major group; and act together to prefer maps which have fewer objects. Together, they can be read as preferring maps with less structure over maps with more created structure. - 3. DML Generation
- As discussed above, the data structure produced by the
mapper 220 is an instance of the domain described in the NML document. In one embodiment, this data structure is then used to generate DML. DML Generation is done in a depth-first fashion over the NML Instance tree.FIG. 9 is a flowchart that illustrates the generation 460 of DML. - The output of the
mapper 220, described above, is a tree of NML object instances with enumerations in the leaves (actually, in general, it is a collection of such trees, since some maps can “tie” for the best map. Each tree is first pruned by removal of nodes that have no peers and whose descendants have no peers: such nodes cannot generate DML_ELEMENTS, DML_ATTRIBUTES, or DML_VALUES. In one embodiment, at each node in the resulting pruned NML instance tree, the following algorithm is performed: -
proc generateDML(NMLInstanceNode node) { set savedElement = current DML_ELEMENT set savedAttribute = current DML_ATTRIBUTE if (node is a trigger for a DML CALL) { close & output all open DML_ELEMENTS set the current DML_ELEMENT to the DML_CALL } else if (node has a DML_ELEMENT) { set newElement = new DML_ELEMENT with name in declaration attach newElement to current DML_ELEMENT set current DML_ELEMENT to newElement } else if (node has a DML_ATTRIBUTE) { set newAttribute = named attribute in declaration set current Attribute = new Attribute } if (node is a leaf) { set the value of the current Attribute to the identifying token } else if (node has a DML_VALUE) { set the value of the current Attribute to the named value } foreach child of node { generateDML(child) } close any DML_ELEMENT or ATTRIBUTE created by this node set current DML_ELEMENT = savedElement set current DML_ATTRIBUTE = savedAttribute return; } - The generateDML process is called on each root node of each tree, in turn. Once it has completed on a root node, any open DML elements are closed and output.
- 4. DML Used To Populate DBMSs, Retrieve Data, and Invoke Programs
- Once the DML has been generated, it can be used in a variety of different ways, including populating a database system, retrieving data from a database system or other data store, or invoking a program using the parameters stored in the DML document as parameters to invoke the program. These various applications are illustrated in
FIGS. 10-12 . InFIG. 10 , a description of a “black vinyl chair” 1030 is converted into astructured description 1060. The description is input into the Content Engine, 1020, which produces aDML Document 1040. ADML Processing System 1050 then generates the structureddescription 1060. It will be obvious to one skilled in the art that thetabular form 1060 is suitable for insertion into any database management system, including but not limited to a relational database management system. - In
FIG. 11 , a natural language request for a “black vinyl chair” 1130 is converted into a structuredquery 1160. The description is fed into theContent Engine 1120, which produces aDML Document 1140. ADML Processing System 1150 then generates the structuredquery 1160. The structured query here is shown in the database query language SQL. It will be obvious to one skilled in the art that theDML Processing System 1150 could generate a query in any of a number of database languages, and is not restricted to SQL. - It is noted that here the
NML model 1010 and theNML model 1110 are identical: the same model is used for both content creation and content query. This illustrates the flexibility the robustness of the present invention. - In
FIG. 12 , a natural language request for astock chart 1230 is converted into aprogram invocation 1260. The description is fed into theContent Engine 1220, which produces aDML Document 1240. ADML Processing System 1250 then generates theprogram invocation 1260. The program invocation here is shown as an HTTP cgi request. It will be obvious to one skilled in the art that theDML Processing System 1250 could generate a program invocation in any scripting, web, or API environment, and is not restricted to HTTP requests. - Construction of a DML processing system such as 1050, 1150, or 1250 is site- and application-specific. The major task is traversing the
structured DML document - As will be understood by those familiar with the art, the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. For example, note that the various algorithms are illustrative, and variations are easily implemented. For example, a different cost function could be used to compute the best map, or the pruning step may be left out altogether. Likewise, the particular capitalization or naming of the modules, protocols, features, attributes, data structures, or any other aspect is not mandatory or significant, and the mechanisms that implement the invention or its features may have different names or formats. Further, functionality which is shown or described as being provided by a single module may be provided instead by multiple different modules; likewise functionality provided by multiple modules may be provided instead by lesser or a single module. Further, while a software based embodiment has been described, the functionality of the invention may be embodied in whole or in part in various hardware elements, such as application specific integrated circuits (ASICs) or the like. The particular examples of NML and DML are illustrative, and not limiting. Indeed, given the flexibility of the invention, it is understood that the NML and DML are not limited to the example domains and applications discussed, but may be applied in numerous other domains and embodiments. Accordingly, the disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.
Claims (21)
1-22. (canceled)
23. A system, comprising:
a processor configured to:
tokenize a plain text description;
create parse trees from the tokenized plain text description based on grammar from a grammar storage area;
generate an instance tree from each parse tree based upon an application domain specific natural markup language provided by a natural markup language model module;
discard each invalid or incomplete instance tree;
choose an instance tree from remaining instance trees representing a best map based upon a cost function; and
process the best map with a domain markup language generator to generate a structured data representation; and
a memory coupled to the processor and configured to provide the processor with instructions.
24. The system of claim 23 wherein the processor is further configured to use the structured data representation to populate a database.
25. The system of claim 23 wherein the processor is further configured to use the structured data representation to query a database.
26. The system of claim 23 wherein the processor is further configured to use the structured data representation to invoke an application.
27. The system of claim 23 , wherein the cost function comprises choosing maps with less structure over maps with more created structure.
28. The system of claim 23 , wherein the cost function comprises:
choosing maps that use the most tokens contained in compact groups over maps using fewer tokens spread further over text segments;
choosing maps with the tightest possible bindings; and
choosing maps that have fewer objects.
29. The system of claim 23 , wherein the cost function comprises:
(a) choosing a map with the most tokens;
(b) if maps are equal under (a), then choosing a map having a topmost expression farthest from a root of the map;
(c) if maps are equal under (a) and (b), then choosing a map with a least distance between tokens;
(d) if maps are equal under (a) through (c), then choosing a map with fewer objects created by enumerations;
(e) if maps are equal under (a) through (d), then choosing a map with fewer unused primitives;
(f) if maps are equal under (a) through (e), then choosing a map with fewer objects created by database lookup;
(g) if maps are equal under (a) through (f), then choosing a map with fewer natural markup language objects;
(h) if maps are equal under (a) through (g), then choosing a map with fewer inferred objects.
30. The system of claim 29 , wherein the cost function further comprises:
(i) if maps are equal under (a) through (h), then regarding all maps as equally valid.
31. The system of claim 23 , wherein all possible parse trees from the tokenized plain text are created.
32. The system of claim 23 , wherein the processor is further configured to represent all of the parse trees in a single directed acyclic graph.
33. The system of claim 23 , wherein the grammar from the grammar storage area is context free.
34. A computer program product embodied in a computer readable medium and comprising computer instructions for:
tokenizing a plain text description;
creating parse trees from the tokenized plain text description based on grammar from a grammar storage area;
generating an instance tree from each parse tree based upon an application domain specific natural markup language provided by a natural markup language model module;
discarding each invalid or incomplete instance tree;
choosing an instance tree from remaining instance trees representing a best map based upon a cost function; and
processing the best map with a domain markup language generator to generate a structured data representation.
35. The computer program product of claim 34 further comprising computer instructions for using the structured data representation to populate a database.
36. The computer program product of claim 34 further comprising computer instructions for using the structured data representation to query a database.
37. The computer program product of claim 34 further comprising computer instructions for using the structured data representation to invoke an application.
38. The computer program product of claim 34 , wherein the cost function comprises choosing maps with less structure over maps with more created structure.
39. The computer program product of claim 34 , wherein the cost function comprises:
choosing maps that use the most tokens contained in compact groups over maps using fewer tokens spread further over text segments;
choosing maps with the tightest possible bindings; and
choosing maps that have fewer objects.
40. The computer program product of claim 34 , wherein the cost function comprises:
(a) choosing a map with the most tokens;
(b) if maps are equal under (a), then choosing a map having a topmost expression farthest from a root of the map;
(c) if maps are equal under (a) and (b), then choosing a map with a least distance between tokens;
(d) if maps are equal under (a) through (c), then choosing a map with fewer objects created by enumerations;
(e) if maps are equal under (a) through (d), then choosing a map with fewer unused primitives;
(f) if maps are equal under (a) through (e), then choosing a map with fewer objects created by database lookup;
(g) if maps are equal under (a) through (f), then choosing a map with fewer natural markup language objects;
(h) if maps are equal under (a) through (g), then choosing a map with fewer inferred objects.
41. The computer program product of claim 40 , wherein the cost function further comprises:
(i) if maps are equal under (a) through (h), then regarding all maps as equally valid.
42. The computer program product of claim 34 , wherein the grammar from the grammar storage area is context free.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/982,386 US20080126080A1 (en) | 2001-01-08 | 2007-10-31 | Creation of structured data from plain text |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/757,075 US6714939B2 (en) | 2001-01-08 | 2001-01-08 | Creation of structured data from plain text |
US10/794,335 US7324936B2 (en) | 2001-01-08 | 2004-03-05 | Creation of structured data from plain text |
US11/982,386 US20080126080A1 (en) | 2001-01-08 | 2007-10-31 | Creation of structured data from plain text |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/794,335 Continuation US7324936B2 (en) | 2001-01-08 | 2004-03-05 | Creation of structured data from plain text |
Publications (1)
Publication Number | Publication Date |
---|---|
US20080126080A1 true US20080126080A1 (en) | 2008-05-29 |
Family
ID=25046249
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/757,075 Expired - Lifetime US6714939B2 (en) | 2001-01-08 | 2001-01-08 | Creation of structured data from plain text |
US10/794,335 Expired - Lifetime US7324936B2 (en) | 2001-01-08 | 2004-03-05 | Creation of structured data from plain text |
US11/982,386 Abandoned US20080126080A1 (en) | 2001-01-08 | 2007-10-31 | Creation of structured data from plain text |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/757,075 Expired - Lifetime US6714939B2 (en) | 2001-01-08 | 2001-01-08 | Creation of structured data from plain text |
US10/794,335 Expired - Lifetime US7324936B2 (en) | 2001-01-08 | 2004-03-05 | Creation of structured data from plain text |
Country Status (6)
Country | Link |
---|---|
US (3) | US6714939B2 (en) |
EP (1) | EP1399842B1 (en) |
AT (1) | ATE334449T1 (en) |
AU (1) | AU2002246981A1 (en) |
DE (1) | DE60213409T2 (en) |
WO (1) | WO2002056196A2 (en) |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070245308A1 (en) * | 2005-12-31 | 2007-10-18 | Hill John E | Flexible XML tagging |
US20070250762A1 (en) * | 2006-04-19 | 2007-10-25 | Apple Computer, Inc. | Context-aware content conversion and interpretation-specific views |
US20090144295A1 (en) * | 2007-11-30 | 2009-06-04 | Business Objects S.A. | Apparatus and method for associating unstructured text with structured data |
US20100088674A1 (en) * | 2008-10-06 | 2010-04-08 | Microsoft Corporation | System and method for recognizing structure in text |
US20100106802A1 (en) * | 2007-02-16 | 2010-04-29 | Alexander Zink | Apparatus and method for generating a data stream and apparatus and method for reading a data stream |
US20120016661A1 (en) * | 2010-07-19 | 2012-01-19 | Eyal Pinkas | System, method and device for intelligent textual conversation system |
US8595182B1 (en) * | 2007-11-07 | 2013-11-26 | Google Inc. | Network file association |
US20140019122A1 (en) * | 2012-07-10 | 2014-01-16 | Robert D. New | Method for Parsing Natural Language Text |
US8671341B1 (en) | 2007-01-05 | 2014-03-11 | Linguastat, Inc. | Systems and methods for identifying claims associated with electronic text |
WO2014074346A1 (en) * | 2012-11-12 | 2014-05-15 | Facebook, Inc. | Grammar model for structured search queries |
US8732036B2 (en) | 2010-05-07 | 2014-05-20 | Ariba, Inc. | Supplier/buyer network that provides catalog updates |
US20140164899A1 (en) * | 2012-12-10 | 2014-06-12 | International Business Machines Corporation | Utilizing classification and text analytics for annotating documents to allow quick scanning |
US8977953B1 (en) * | 2006-01-27 | 2015-03-10 | Linguastat, Inc. | Customizing information by combining pair of annotations from at least two different documents |
US9430571B1 (en) * | 2012-10-24 | 2016-08-30 | Google Inc. | Generating travel queries in response to free text queries |
US9684690B2 (en) | 2011-01-12 | 2017-06-20 | Google Inc. | Flights search |
EP3142028A3 (en) * | 2015-09-11 | 2017-07-12 | Google, Inc. | Handling failures in processing natural language queries through user interactions |
US9779076B2 (en) | 2013-09-04 | 2017-10-03 | International Business Machines Corporation | Utilizing classification and text analytics for optimizing processes in documents |
US10013505B1 (en) * | 2010-01-13 | 2018-07-03 | The Boeing Company | Method, apparatus and computer program product for identifying a target part name within a data record |
US10282444B2 (en) * | 2015-09-11 | 2019-05-07 | Google Llc | Disambiguating join paths for natural language queries |
US10540444B2 (en) | 2017-06-20 | 2020-01-21 | The Boeing Company | Text mining a dataset of electronic documents to discover terms of interest |
US10691584B2 (en) * | 2018-09-28 | 2020-06-23 | Sap Se | Behavior driven development integration with test tool |
US10810368B2 (en) | 2012-07-10 | 2020-10-20 | Robert D. New | Method for parsing natural language text with constituent construction links |
US11068477B1 (en) * | 2018-06-06 | 2021-07-20 | Gbt Travel Servces Uk Limited | Natural language processing with pre-specified SQL queries |
US11249960B2 (en) * | 2018-06-11 | 2022-02-15 | International Business Machines Corporation | Transforming data for a target schema |
EP3968149A1 (en) * | 2020-09-11 | 2022-03-16 | SemaLogic UG | Generation of control rules from schematic control systems |
US20220292143A1 (en) * | 2021-03-11 | 2022-09-15 | Jatin V. Mehta | Dynamic Website Characterization For Search Optimization |
US11651001B2 (en) | 2018-03-14 | 2023-05-16 | The Boeing Company | Unifying terms of interest from a dataset of electronic documents |
Families Citing this family (356)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7062456B1 (en) | 1999-02-09 | 2006-06-13 | The Chase Manhattan Bank | System and method for back office processing of banking transactions using electronic files |
US7451107B1 (en) * | 2000-01-28 | 2008-11-11 | Supply Chain Connect, Llc | Business-to-business electronic commerce clearinghouse |
US7124144B2 (en) * | 2000-03-02 | 2006-10-17 | Actuate Corporation | Method and apparatus for storing semi-structured data in a structured manner |
US7707159B2 (en) * | 2000-03-02 | 2010-04-27 | Actuate Corporation | Method and apparatus for storing semi-structured data in a structured manner |
US7152062B1 (en) | 2000-11-21 | 2006-12-19 | Actuate Corporation | Technique for encapsulating a query definition |
AU2001281023A1 (en) * | 2000-08-01 | 2002-04-08 | Nimble Technology, Inc. | Nested conditional relations (ncr) model and algebra |
US7013308B1 (en) | 2000-11-28 | 2006-03-14 | Semscript Ltd. | Knowledge storage and retrieval system and method |
US6714939B2 (en) * | 2001-01-08 | 2004-03-30 | Softface, Inc. | Creation of structured data from plain text |
WO2002082318A2 (en) * | 2001-02-22 | 2002-10-17 | Volantia Holdings Limited | System and method for extracting information |
JP2002288201A (en) * | 2001-03-23 | 2002-10-04 | Fujitsu Ltd | Question-answer processing method, question-answer processing program, recording medium for the question- answer processing program, and question-answer processor |
EP1246077A1 (en) * | 2001-03-26 | 2002-10-02 | LION Bioscience AG | Method and apparatus for structuring and searching sets of signals |
JP2002288602A (en) * | 2001-03-28 | 2002-10-04 | Nec Corp | Interpretation system using id tag |
US7593920B2 (en) * | 2001-04-04 | 2009-09-22 | West Services, Inc. | System, method, and software for identifying historically related legal opinions |
US7500017B2 (en) * | 2001-04-19 | 2009-03-03 | Microsoft Corporation | Method and system for providing an XML binary format |
US7272594B1 (en) | 2001-05-31 | 2007-09-18 | Autonomy Corporation Ltd. | Method and apparatus to link to a related document |
US7225126B2 (en) | 2001-06-12 | 2007-05-29 | At&T Corp. | System and method for processing speech files |
AUPR645701A0 (en) * | 2001-07-18 | 2001-08-09 | Tralee Investments Ltd | Database adapter |
US20020010715A1 (en) * | 2001-07-26 | 2002-01-24 | Garry Chinn | System and method for browsing using a limited display device |
US6990442B1 (en) * | 2001-07-27 | 2006-01-24 | Nortel Networks Limited | Parsing with controlled tokenization |
US20080300856A1 (en) * | 2001-09-21 | 2008-12-04 | Talkflow Systems, Llc | System and method for structuring information |
US7594172B2 (en) * | 2001-10-10 | 2009-09-22 | Fish Robert D | Data storage using spreadsheet and metatags |
US7013263B1 (en) | 2001-10-25 | 2006-03-14 | Mindfabric, Inc. | Online interaction processing |
US6859810B2 (en) * | 2001-12-10 | 2005-02-22 | Bea Systems, Inc. | Declarative specification and engine for non-isomorphic data mapping |
US7295966B2 (en) * | 2002-01-14 | 2007-11-13 | Microsoft Corporation | System for normalizing a discourse representation structure and normalized data structure |
US7177799B2 (en) * | 2002-01-14 | 2007-02-13 | Microsoft Corporation | Semantic analysis system for interpreting linguistic structures output by a natural language linguistic analysis system |
US20030154208A1 (en) * | 2002-02-14 | 2003-08-14 | Meddak Ltd | Medical data storage system and method |
EP1493118A1 (en) * | 2002-04-10 | 2005-01-05 | Accenture Global Services GmbH | Determination of attributes based on product descriptions |
US7805302B2 (en) * | 2002-05-20 | 2010-09-28 | Microsoft Corporation | Applying a structured language model to information extraction |
US7987246B2 (en) | 2002-05-23 | 2011-07-26 | Jpmorgan Chase Bank | Method and system for client browser update |
US7437664B2 (en) * | 2002-06-18 | 2008-10-14 | Microsoft Corporation | Comparing hierarchically-structured documents |
US7266553B1 (en) | 2002-07-01 | 2007-09-04 | Microsoft Corporation | Content data indexing |
US7082433B2 (en) * | 2002-07-20 | 2006-07-25 | Microsoft Corporation | Translation of object queries involving inheritence |
US20040103199A1 (en) * | 2002-11-22 | 2004-05-27 | Anthony Chao | Method and system for client browser update from a lite cache |
CA2508791A1 (en) * | 2002-12-06 | 2004-06-24 | Attensity Corporation | Systems and methods for providing a mixed data integration service |
US8818793B1 (en) | 2002-12-24 | 2014-08-26 | At&T Intellectual Property Ii, L.P. | System and method of extracting clauses for spoken language understanding |
US8849648B1 (en) | 2002-12-24 | 2014-09-30 | At&T Intellectual Property Ii, L.P. | System and method of extracting clauses for spoken language understanding |
CN1512406A (en) * | 2002-12-30 | 2004-07-14 | 国际商业机器公司 | Electronic dictionary facing user, electronic dictionary system and its forming method |
US7305612B2 (en) * | 2003-03-31 | 2007-12-04 | Siemens Corporate Research, Inc. | Systems and methods for automatic form segmentation for raster-based passive electronic documents |
US7668888B2 (en) * | 2003-06-05 | 2010-02-23 | Sap Ag | Converting object structures for search engines |
US7296223B2 (en) * | 2003-06-27 | 2007-11-13 | Xerox Corporation | System and method for structured document authoring |
US8775443B2 (en) * | 2003-08-07 | 2014-07-08 | Sap Ag | Ranking of business objects for search engines |
US7069278B2 (en) | 2003-08-08 | 2006-06-27 | Jpmorgan Chase Bank, N.A. | System for archive integrity management and related methods |
US20050050456A1 (en) * | 2003-08-29 | 2005-03-03 | Dehamer Brian James | Method and apparatus for supporting XML-based service consumption in a web presentation architecture |
US7516139B2 (en) * | 2003-09-19 | 2009-04-07 | Jp Morgan Chase Bank | Processing of tree data structures |
US20050108630A1 (en) * | 2003-11-19 | 2005-05-19 | Wasson Mark D. | Extraction of facts from text |
US20050149538A1 (en) * | 2003-11-20 | 2005-07-07 | Sadanand Singh | Systems and methods for creating and publishing relational data bases |
US8156444B1 (en) | 2003-12-31 | 2012-04-10 | Google Inc. | Systems and methods for determining a user interface attribute |
US7437353B2 (en) * | 2003-12-31 | 2008-10-14 | Google Inc. | Systems and methods for unification of search results |
US20050149498A1 (en) * | 2003-12-31 | 2005-07-07 | Stephen Lawrence | Methods and systems for improving a search ranking using article information |
US7707573B1 (en) | 2003-12-31 | 2010-04-27 | Google Inc. | Systems and methods for providing and installing software |
US8321858B1 (en) | 2003-12-31 | 2012-11-27 | Google Inc. | Systems and methods for providing software updates |
US7281008B1 (en) | 2003-12-31 | 2007-10-09 | Google Inc. | Systems and methods for constructing a query result set |
US8954420B1 (en) | 2003-12-31 | 2015-02-10 | Google Inc. | Methods and systems for improving a search ranking using article information |
US8271651B1 (en) | 2003-12-31 | 2012-09-18 | Google Inc. | Methods and systems for regulating resource usage |
US20050165599A1 (en) * | 2004-01-23 | 2005-07-28 | Dale Russell | Method and apparatus for generating a translation table |
US8037102B2 (en) | 2004-02-09 | 2011-10-11 | Robert T. and Virginia T. Jenkins | Manipulating sets of hierarchical data |
US7958132B2 (en) * | 2004-02-10 | 2011-06-07 | Microsoft Corporation | Voting based scheme for electronic document node reuse |
GB2411014A (en) * | 2004-02-11 | 2005-08-17 | Autonomy Corp Ltd | Automatic searching for relevant information |
US8055553B1 (en) | 2006-01-19 | 2011-11-08 | Verizon Laboratories Inc. | Dynamic comparison text functionality |
WO2005096173A1 (en) * | 2004-03-30 | 2005-10-13 | Victor Company Of Japan, Limited | Digitization service manual generation method and additional data generation method |
US9009153B2 (en) * | 2004-03-31 | 2015-04-14 | Google Inc. | Systems and methods for identifying a named entity |
US8386728B1 (en) | 2004-03-31 | 2013-02-26 | Google Inc. | Methods and systems for prioritizing a crawl |
US8631076B1 (en) | 2004-03-31 | 2014-01-14 | Google Inc. | Methods and systems for associating instant messenger events |
US7333976B1 (en) | 2004-03-31 | 2008-02-19 | Google Inc. | Methods and systems for processing contact information |
US7310633B1 (en) | 2004-03-31 | 2007-12-18 | Google Inc. | Methods and systems for generating textual information |
US7664734B2 (en) * | 2004-03-31 | 2010-02-16 | Google Inc. | Systems and methods for generating multiple implicit search queries |
US8041713B2 (en) | 2004-03-31 | 2011-10-18 | Google Inc. | Systems and methods for analyzing boilerplate |
US8161053B1 (en) | 2004-03-31 | 2012-04-17 | Google Inc. | Methods and systems for eliminating duplicate events |
US7680888B1 (en) | 2004-03-31 | 2010-03-16 | Google Inc. | Methods and systems for processing instant messenger messages |
US20070282797A1 (en) * | 2004-03-31 | 2007-12-06 | Niniane Wang | Systems and methods for refreshing a content display |
US20050223027A1 (en) * | 2004-03-31 | 2005-10-06 | Lawrence Stephen R | Methods and systems for structuring event data in a database for location and retrieval |
US7581227B1 (en) | 2004-03-31 | 2009-08-25 | Google Inc. | Systems and methods of synchronizing indexes |
US7693825B2 (en) | 2004-03-31 | 2010-04-06 | Google Inc. | Systems and methods for ranking implicit search results |
US7707142B1 (en) | 2004-03-31 | 2010-04-27 | Google Inc. | Methods and systems for performing an offline search |
US7580568B1 (en) | 2004-03-31 | 2009-08-25 | Google Inc. | Methods and systems for identifying an image as a representative image for an article |
US8275839B2 (en) | 2004-03-31 | 2012-09-25 | Google Inc. | Methods and systems for processing email messages |
US8099407B2 (en) | 2004-03-31 | 2012-01-17 | Google Inc. | Methods and systems for processing media files |
US20080040315A1 (en) * | 2004-03-31 | 2008-02-14 | Auerbach David B | Systems and methods for generating a user interface |
US8631001B2 (en) | 2004-03-31 | 2014-01-14 | Google Inc. | Systems and methods for weighting a search query result |
US7499958B1 (en) | 2004-03-31 | 2009-03-03 | Google Inc. | Systems and methods of replicating all or part of a data store |
US8595214B1 (en) * | 2004-03-31 | 2013-11-26 | Google Inc. | Systems and methods for article location and retrieval |
US20070276801A1 (en) * | 2004-03-31 | 2007-11-29 | Lawrence Stephen R | Systems and methods for constructing and using a user profile |
US7412708B1 (en) | 2004-03-31 | 2008-08-12 | Google Inc. | Methods and systems for capturing information |
US20080059419A1 (en) * | 2004-03-31 | 2008-03-06 | David Benjamin Auerbach | Systems and methods for providing search results |
US7725508B2 (en) * | 2004-03-31 | 2010-05-25 | Google Inc. | Methods and systems for information capture and retrieval |
US7272601B1 (en) | 2004-03-31 | 2007-09-18 | Google Inc. | Systems and methods for associating a keyword with a user interface area |
US8346777B1 (en) | 2004-03-31 | 2013-01-01 | Google Inc. | Systems and methods for selectively storing event data |
US7941439B1 (en) * | 2004-03-31 | 2011-05-10 | Google Inc. | Methods and systems for information capture |
US9646107B2 (en) | 2004-05-28 | 2017-05-09 | Robert T. and Virginia T. Jenkins as Trustee of the Jenkins Family Trust | Method and/or system for simplifying tree expressions such as for query reduction |
US7788274B1 (en) | 2004-06-30 | 2010-08-31 | Google Inc. | Systems and methods for category-based search |
US8131754B1 (en) | 2004-06-30 | 2012-03-06 | Google Inc. | Systems and methods for determining an article association measure |
US7761439B1 (en) | 2004-06-30 | 2010-07-20 | Google Inc. | Systems and methods for performing a directory search |
US7620632B2 (en) * | 2004-06-30 | 2009-11-17 | Skyler Technology, Inc. | Method and/or system for performing tree matching |
US7882147B2 (en) * | 2004-06-30 | 2011-02-01 | Robert T. and Virginia T. Jenkins | File location naming hierarchy |
US7584103B2 (en) * | 2004-08-20 | 2009-09-01 | Multimodal Technologies, Inc. | Automated extraction of semantic content and generation of a structured document from speech |
US20060095900A1 (en) * | 2004-08-26 | 2006-05-04 | Calpont Corporation | Semantic processor for a hardware database management system |
US7552420B1 (en) * | 2004-09-01 | 2009-06-23 | Intuit Inc. | Externally defined application configuration |
US7624009B2 (en) * | 2004-09-02 | 2009-11-24 | Hewlett-Packard Development Company, L.P. | Method and system for producing variable length context models |
US8782200B2 (en) * | 2004-09-14 | 2014-07-15 | Sitespect, Inc. | System and method for optimizing website visitor actions |
US7853606B1 (en) | 2004-09-14 | 2010-12-14 | Google, Inc. | Alternate methods of displaying search results |
US20060074980A1 (en) * | 2004-09-29 | 2006-04-06 | Sarkar Pte. Ltd. | System for semantically disambiguating text information |
US8051096B1 (en) | 2004-09-30 | 2011-11-01 | Google Inc. | Methods and systems for augmenting a token lexicon |
US7996208B2 (en) * | 2004-09-30 | 2011-08-09 | Google Inc. | Methods and systems for selecting a language for text segmentation |
US7680648B2 (en) * | 2004-09-30 | 2010-03-16 | Google Inc. | Methods and systems for improving text segmentation |
US20060074897A1 (en) * | 2004-10-04 | 2006-04-06 | Fergusson Iain W | System and method for dynamic data masking |
US7627591B2 (en) * | 2004-10-29 | 2009-12-01 | Skyler Technology, Inc. | Method and/or system for manipulating tree expressions |
US7801923B2 (en) * | 2004-10-29 | 2010-09-21 | Robert T. and Virginia T. Jenkins as Trustees of the Jenkins Family Trust | Method and/or system for tagging trees |
US7970600B2 (en) * | 2004-11-03 | 2011-06-28 | Microsoft Corporation | Using a first natural language parser to train a second parser |
US7630995B2 (en) | 2004-11-30 | 2009-12-08 | Skyler Technology, Inc. | Method and/or system for transmitting and/or receiving data |
US7636727B2 (en) * | 2004-12-06 | 2009-12-22 | Skyler Technology, Inc. | Enumeration of trees from finite number of nodes |
US8195693B2 (en) | 2004-12-16 | 2012-06-05 | International Business Machines Corporation | Automatic composition of services through semantic attribute matching |
US8316059B1 (en) | 2004-12-30 | 2012-11-20 | Robert T. and Virginia T. Jenkins | Enumeration of rooted partial subtrees |
US8843536B1 (en) | 2004-12-31 | 2014-09-23 | Google Inc. | Methods and systems for providing relevant advertisements or other content for inactive uniform resource locators using search queries |
US8615530B1 (en) | 2005-01-31 | 2013-12-24 | Robert T. and Virginia T. Jenkins as Trustees for the Jenkins Family Trust | Method and/or system for tree transformation |
US7681177B2 (en) * | 2005-02-28 | 2010-03-16 | Skyler Technology, Inc. | Method and/or system for transforming between trees and strings |
US8356040B2 (en) | 2005-03-31 | 2013-01-15 | Robert T. and Virginia T. Jenkins | Method and/or system for transforming between trees and arrays |
US7899821B1 (en) | 2005-04-29 | 2011-03-01 | Karl Schiffmann | Manipulation and/or analysis of hierarchical data |
US20060253476A1 (en) * | 2005-05-09 | 2006-11-09 | Roth Mary A | Technique for relationship discovery in schemas using semantic name indexing |
US20060277028A1 (en) * | 2005-06-01 | 2006-12-07 | Microsoft Corporation | Training a statistical parser on noisy data by filtering |
EP1742152B1 (en) * | 2005-07-07 | 2012-09-12 | Texas Instruments Inc. | Method and system for a multi-sharing memory access control |
US7523077B2 (en) * | 2005-07-08 | 2009-04-21 | Sap Aktiengesellschaft | Knowledge repository using configuration and document templates |
AU2006272442B2 (en) * | 2005-07-15 | 2012-05-31 | Think Software Pty Ltd | Method and apparatus for providing structured data for free text messages |
US8666928B2 (en) | 2005-08-01 | 2014-03-04 | Evi Technologies Limited | Knowledge repository |
US8065606B1 (en) | 2005-09-16 | 2011-11-22 | Jpmorgan Chase Bank, N.A. | System and method for automating document generation |
US8688673B2 (en) * | 2005-09-27 | 2014-04-01 | Sarkar Pte Ltd | System for communication and collaboration |
WO2007051227A1 (en) * | 2005-10-31 | 2007-05-10 | Think Software Pty Ltd. | Interacting with a computer-based management system |
US20070157073A1 (en) * | 2005-12-29 | 2007-07-05 | International Business Machines Corporation | Software weaving and merging |
US9262446B1 (en) | 2005-12-29 | 2016-02-16 | Google Inc. | Dynamically ranking entries in a personal data book |
FR2896603B1 (en) * | 2006-01-20 | 2008-05-02 | Thales Sa | METHOD AND DEVICE FOR EXTRACTING INFORMATION AND TRANSFORMING THEM INTO QUALITATIVE DATA OF A TEXTUAL DOCUMENT |
US7958164B2 (en) * | 2006-02-16 | 2011-06-07 | Microsoft Corporation | Visual design of annotated regular expression |
US7860881B2 (en) * | 2006-03-09 | 2010-12-28 | Microsoft Corporation | Data parsing with annotated patterns |
US7962328B2 (en) * | 2006-03-13 | 2011-06-14 | Lexikos Corporation | Method and apparatus for generating a compact data structure to identify the meaning of a symbol |
US7765097B1 (en) * | 2006-03-20 | 2010-07-27 | Intuit Inc. | Automatic code generation via natural language processing |
EP1847923A1 (en) * | 2006-04-21 | 2007-10-24 | Microsoft Corporation | Localising unstructured resources |
US20070250613A1 (en) * | 2006-04-25 | 2007-10-25 | Sbc Knowledge Ventures, L.P. | Method and apparatus for configuring a workflow |
US20070250505A1 (en) * | 2006-04-25 | 2007-10-25 | Sbc Knowledge Ventures, L.P. | Method and apparatus for defining a workflow |
US20070250822A1 (en) * | 2006-04-25 | 2007-10-25 | Sbc Knowledge Ventures, L.P. | Method and apparatus for importing content in a user-defined workflow |
US20070260450A1 (en) * | 2006-05-05 | 2007-11-08 | Yudong Sun | Indexing parsed natural language texts for advanced search |
US7970767B2 (en) | 2006-06-05 | 2011-06-28 | Accenture Global Services Limited | Extraction of attributes and values from natural language documents |
US7996440B2 (en) * | 2006-06-05 | 2011-08-09 | Accenture Global Services Limited | Extraction of attributes and values from natural language documents |
US7970746B2 (en) * | 2006-06-13 | 2011-06-28 | Microsoft Corporation | Declarative management framework |
JP5385134B2 (en) * | 2006-06-22 | 2014-01-08 | マルチモーダル・テクノロジーズ・エルエルシー | Computer mounting method |
US20080016047A1 (en) * | 2006-07-12 | 2008-01-17 | Dettinger Richard D | System and method for creating and populating dynamic, just in time, database tables |
US20080016048A1 (en) * | 2006-07-12 | 2008-01-17 | Dettinger Richard D | Intelligent condition pruning for size minimization of dynamic, just in time tables |
US20080033967A1 (en) * | 2006-07-18 | 2008-02-07 | Ravi Murthy | Semantic aware processing of XML documents |
US20080065370A1 (en) * | 2006-09-11 | 2008-03-13 | Takashi Kimoto | Support apparatus for object-oriented analysis and design |
US8515733B2 (en) * | 2006-10-18 | 2013-08-20 | Calculemus B.V. | Method, device, computer program and computer program product for processing linguistic data in accordance with a formalized natural language |
US8037179B2 (en) * | 2006-11-02 | 2011-10-11 | Storz Endoskop Produktions Gmbh | Device control system employing extensible markup language for defining information resources |
US8104076B1 (en) | 2006-11-13 | 2012-01-24 | Jpmorgan Chase Bank, N.A. | Application access control system |
US20080154853A1 (en) * | 2006-12-22 | 2008-06-26 | International Business Machines Corporation | English-language translation of exact interpretations of keyword queries |
US8285697B1 (en) * | 2007-01-23 | 2012-10-09 | Google Inc. | Feedback enhanced attribute extraction |
US20080250394A1 (en) * | 2007-04-04 | 2008-10-09 | Microsoft Corporation | Synchronizing external documentation with code development |
US7720814B2 (en) * | 2007-04-04 | 2010-05-18 | Microsoft Corporation | Repopulating a database with document content |
US7720885B2 (en) * | 2007-04-04 | 2010-05-18 | Microsoft Corporation | Generating a word-processing document from database content |
JP5256654B2 (en) * | 2007-06-29 | 2013-08-07 | 富士通株式会社 | Sentence division program, sentence division apparatus, and sentence division method |
US8260619B1 (en) | 2008-08-22 | 2012-09-04 | Convergys Cmg Utah, Inc. | Method and system for creating natural language understanding grammars |
US8027941B2 (en) * | 2007-09-14 | 2011-09-27 | Accenture Global Services Limited | Automated classification algorithm comprising at least one input-invariant part |
US8024177B2 (en) * | 2007-09-28 | 2011-09-20 | Cycorp, Inc. | Method of transforming natural language expression into formal language representation |
US8838659B2 (en) | 2007-10-04 | 2014-09-16 | Amazon Technologies, Inc. | Enhanced knowledge repository |
US8364469B2 (en) * | 2007-11-26 | 2013-01-29 | Red Hat, Inc. | Natural language enhanced user interface in a business rule management system |
US20090198488A1 (en) * | 2008-02-05 | 2009-08-06 | Eric Arno Vigen | System and method for analyzing communications using multi-placement hierarchical structures |
US8706477B1 (en) | 2008-04-25 | 2014-04-22 | Softwin Srl Romania | Systems and methods for lexical correspondence linguistic knowledge base creation comprising dependency trees with procedural nodes denoting execute code |
WO2010046902A1 (en) * | 2008-10-22 | 2010-04-29 | Sankhya Technologies Private Limited | System and method for automatically generating sentences of a language |
US8326809B2 (en) * | 2008-10-27 | 2012-12-04 | Sas Institute Inc. | Systems and methods for defining and processing text segmentation rules |
US20100169359A1 (en) * | 2008-12-30 | 2010-07-01 | Barrett Leslie A | System, Method, and Apparatus for Information Extraction of Textual Documents |
US7937386B2 (en) * | 2008-12-30 | 2011-05-03 | Complyon Inc. | System, method, and apparatus for information extraction of textual documents |
US9805089B2 (en) | 2009-02-10 | 2017-10-31 | Amazon Technologies, Inc. | Local business and product search system and method |
US20100211894A1 (en) * | 2009-02-18 | 2010-08-19 | Google Inc. | Identifying Object Using Generative Model |
US8719004B2 (en) * | 2009-03-19 | 2014-05-06 | Ditech Networks, Inc. | Systems and methods for punctuating voicemail transcriptions |
US8762131B1 (en) | 2009-06-17 | 2014-06-24 | Softwin Srl Romania | Systems and methods for managing a complex lexicon comprising multiword expressions and multiword inflection templates |
US8762130B1 (en) | 2009-06-17 | 2014-06-24 | Softwin Srl Romania | Systems and methods for natural language processing including morphological analysis, lemmatizing, spell checking and grammar checking |
US9189475B2 (en) * | 2009-06-22 | 2015-11-17 | Ca, Inc. | Indexing mechanism (nth phrasal index) for advanced leveraging for translation |
US20100325227A1 (en) * | 2009-06-23 | 2010-12-23 | Alon Novy | Systems and methods for composite data message |
US20110202398A1 (en) * | 2010-02-15 | 2011-08-18 | Sarah Photowat | Personal planner with targeted advertising |
US9633121B2 (en) | 2010-04-19 | 2017-04-25 | Facebook, Inc. | Personalizing default search queries on online social networks |
US9110882B2 (en) | 2010-05-14 | 2015-08-18 | Amazon Technologies, Inc. | Extracting structured knowledge from unstructured text |
US9082140B2 (en) * | 2010-06-09 | 2015-07-14 | Ebay Inc. | Systems and methods to extract and utilize textual semantics |
US8655901B1 (en) | 2010-06-23 | 2014-02-18 | Google Inc. | Translation-based query pattern mining |
JP5431261B2 (en) * | 2010-07-23 | 2014-03-05 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Information management system, method and program |
US8959102B2 (en) | 2010-10-08 | 2015-02-17 | Mmodal Ip Llc | Structured searching of dynamic structured document corpuses |
CN102467497B (en) * | 2010-10-29 | 2014-11-05 | 国际商业机器公司 | Method and system for text translation in verification program |
US8606564B2 (en) * | 2010-11-01 | 2013-12-10 | Yahoo! Inc. | Extracting rich temporal context for business entities and events |
WO2012061252A2 (en) | 2010-11-04 | 2012-05-10 | Dw Associates, Llc. | Methods and systems for identifying, quantifying, analyzing, and optimizing the level of engagement of components within a defined ecosystem or context |
US9038177B1 (en) | 2010-11-30 | 2015-05-19 | Jpmorgan Chase Bank, N.A. | Method and system for implementing multi-level data fusion |
US9317595B2 (en) | 2010-12-06 | 2016-04-19 | Yahoo! Inc. | Fast title/summary extraction from long descriptions |
US9720899B1 (en) | 2011-01-07 | 2017-08-01 | Narrative Science, Inc. | Automatic generation of narratives from data using communication goals and narrative analytics |
US10185477B1 (en) | 2013-03-15 | 2019-01-22 | Narrative Science Inc. | Method and system for configuring automatic generation of narratives from data |
US10657201B1 (en) | 2011-01-07 | 2020-05-19 | Narrative Science Inc. | Configurable and portable system for generating narratives |
US8620836B2 (en) | 2011-01-10 | 2013-12-31 | Accenture Global Services Limited | Preprocessing of text |
US8504492B2 (en) | 2011-01-10 | 2013-08-06 | Accenture Global Services Limited | Identification of attributes and values using multiple classifiers |
US8996359B2 (en) | 2011-05-18 | 2015-03-31 | Dw Associates, Llc | Taxonomy and application of language analysis and processing |
US8407165B2 (en) * | 2011-06-15 | 2013-03-26 | Ceresis, Llc | Method for parsing, searching and formatting of text input for visual mapping of knowledge information |
US9037529B2 (en) | 2011-06-15 | 2015-05-19 | Ceresis, Llc | Method for generating visual mapping of knowledge information from parsing of text inputs for subjects and predicates |
US8952796B1 (en) | 2011-06-28 | 2015-02-10 | Dw Associates, Llc | Enactive perception device |
US8965882B1 (en) | 2011-07-13 | 2015-02-24 | Google Inc. | Click or skip evaluation of synonym rules |
US9292588B1 (en) | 2011-07-20 | 2016-03-22 | Jpmorgan Chase Bank, N.A. | Safe storing data for disaster recovery |
EP2748711A4 (en) * | 2011-08-26 | 2016-03-02 | Aaron Gerard Franco | Data infrastructure for providing interconnectivity between platforms, devices, and operating systems |
US8909627B1 (en) | 2011-11-30 | 2014-12-09 | Google Inc. | Fake skip evaluation of synonym rules |
US9269353B1 (en) | 2011-12-07 | 2016-02-23 | Manu Rehani | Methods and systems for measuring semantics in communications |
US9152698B1 (en) | 2012-01-03 | 2015-10-06 | Google Inc. | Substitute term identification based on over-represented terms identification |
US8965875B1 (en) | 2012-01-03 | 2015-02-24 | Google Inc. | Removing substitution rules based on user interactions |
US8600973B1 (en) * | 2012-01-03 | 2013-12-03 | Google Inc. | Removing substitution rules |
US9020807B2 (en) | 2012-01-18 | 2015-04-28 | Dw Associates, Llc | Format for displaying text analytics results |
US9667513B1 (en) | 2012-01-24 | 2017-05-30 | Dw Associates, Llc | Real-time autonomous organization |
US9141672B1 (en) | 2012-01-25 | 2015-09-22 | Google Inc. | Click or skip evaluation of query term optionalization rule |
US8935277B2 (en) * | 2012-03-30 | 2015-01-13 | Sap Se | Context-aware question answering system |
US8959103B1 (en) | 2012-05-25 | 2015-02-17 | Google Inc. | Click or skip evaluation of reordering rules |
US20130339399A1 (en) * | 2012-06-18 | 2013-12-19 | Dexter A. Dorris | Dynamic Schema |
US9104720B2 (en) * | 2012-06-28 | 2015-08-11 | International Business Machines Corporation | Generation of technical description of report from functional description of report |
US8935255B2 (en) | 2012-07-27 | 2015-01-13 | Facebook, Inc. | Social static ranking for search |
US20140059051A1 (en) * | 2012-08-22 | 2014-02-27 | Mark William Graves, Jr. | Apparatus and system for an integrated research library |
US8965820B2 (en) | 2012-09-04 | 2015-02-24 | Sap Se | Multivariate transaction classification |
US9323767B2 (en) | 2012-10-01 | 2016-04-26 | Longsand Limited | Performance and scalability in an intelligent data operating layer system |
US9146966B1 (en) | 2012-10-04 | 2015-09-29 | Google Inc. | Click or skip evaluation of proximity rules |
US9201936B2 (en) * | 2012-11-13 | 2015-12-01 | International Business Machines Corporation | Rapid provisioning of information for business analytics |
US9398104B2 (en) | 2012-12-20 | 2016-07-19 | Facebook, Inc. | Ranking test framework for search results on an online social network |
US9703844B2 (en) | 2012-12-31 | 2017-07-11 | Facebook, Inc. | Search result snippets for structured search queries |
US9361363B2 (en) | 2012-12-31 | 2016-06-07 | Facebook, Inc. | Modifying structured search queries on online social networks |
US9367607B2 (en) * | 2012-12-31 | 2016-06-14 | Facebook, Inc. | Natural-language rendering of structured search queries |
DE102013003055A1 (en) * | 2013-02-18 | 2014-08-21 | Nadine Sina Kurz | Method and apparatus for performing natural language searches |
US9223826B2 (en) | 2013-02-25 | 2015-12-29 | Facebook, Inc. | Pushing suggested search queries to mobile devices |
US10540373B1 (en) | 2013-03-04 | 2020-01-21 | Jpmorgan Chase Bank, N.A. | Clause library manager |
US9218568B2 (en) | 2013-03-15 | 2015-12-22 | Business Objects Software Ltd. | Disambiguating data using contextual and historical information |
US9299041B2 (en) | 2013-03-15 | 2016-03-29 | Business Objects Software Ltd. | Obtaining data from unstructured data for a structured data collection |
US9104780B2 (en) | 2013-03-15 | 2015-08-11 | Kamazooie Development Corporation | System and method for natural language processing |
US9262550B2 (en) | 2013-03-15 | 2016-02-16 | Business Objects Software Ltd. | Processing semi-structured data |
US9910887B2 (en) | 2013-04-25 | 2018-03-06 | Facebook, Inc. | Variable search query vertical access |
US9727619B1 (en) | 2013-05-02 | 2017-08-08 | Intelligent Language, LLC | Automated search |
US9330183B2 (en) | 2013-05-08 | 2016-05-03 | Facebook, Inc. | Approximate privacy indexing for search queries on online social networks |
US9223898B2 (en) * | 2013-05-08 | 2015-12-29 | Facebook, Inc. | Filtering suggested structured queries on online social networks |
US10956433B2 (en) | 2013-07-15 | 2021-03-23 | Microsoft Technology Licensing, Llc | Performing an operation relative to tabular data based upon voice input |
US9305322B2 (en) | 2013-07-23 | 2016-04-05 | Facebook, Inc. | Native application testing |
US9336300B2 (en) * | 2014-01-17 | 2016-05-10 | Facebook, Inc. | Client-side search templates for online social networks |
WO2015120603A1 (en) | 2014-02-13 | 2015-08-20 | Sap Ag | Database calculation using parallel-computation in directed acyclic graph |
US9794359B1 (en) | 2014-03-31 | 2017-10-17 | Facebook, Inc. | Implicit contacts in an online social network |
US9798832B1 (en) | 2014-03-31 | 2017-10-24 | Facebook, Inc. | Dynamic ranking of user cards |
US9646055B2 (en) | 2014-04-03 | 2017-05-09 | Facebook, Inc. | Blending search results on online social networks |
US9251139B2 (en) * | 2014-04-08 | 2016-02-02 | TitleFlow LLC | Natural language processing for extracting conveyance graphs |
US11334720B2 (en) * | 2019-04-17 | 2022-05-17 | International Business Machines Corporation | Machine learned sentence span inclusion judgments |
US9606985B2 (en) | 2014-06-13 | 2017-03-28 | Nuance Communications, Inc. | Structured natural language representations |
US10083227B2 (en) * | 2014-08-13 | 2018-09-25 | Sap Se | On-the-fly determination of search areas and queries for database searches |
US10275458B2 (en) * | 2014-08-14 | 2019-04-30 | International Business Machines Corporation | Systematic tuning of text analytic annotators with specialized information |
US10474702B1 (en) | 2014-08-18 | 2019-11-12 | Street Diligence, Inc. | Computer-implemented apparatus and method for providing information concerning a financial instrument |
US11144994B1 (en) | 2014-08-18 | 2021-10-12 | Street Diligence, Inc. | Computer-implemented apparatus and method for providing information concerning a financial instrument |
US10740412B2 (en) | 2014-09-05 | 2020-08-11 | Facebook, Inc. | Pivoting search results on online social networks |
US9507876B2 (en) | 2014-10-06 | 2016-11-29 | Facebook, Inc. | Constructing queries using query filters on online social networks |
US10747823B1 (en) | 2014-10-22 | 2020-08-18 | Narrative Science Inc. | Interactive and conversational data exploration |
US11922344B2 (en) | 2014-10-22 | 2024-03-05 | Narrative Science Llc | Automatic generation of narratives from data using communication goals and narrative analytics |
US11288328B2 (en) | 2014-10-22 | 2022-03-29 | Narrative Science Inc. | Interactive and conversational data exploration |
US11238090B1 (en) | 2015-11-02 | 2022-02-01 | Narrative Science Inc. | Applied artificial intelligence technology for using narrative analytics to automatically generate narratives from visualization data |
US9703870B2 (en) | 2014-11-05 | 2017-07-11 | Facebook, Inc. | Social-based optimization of web crawling for online social networks |
US10123846B2 (en) * | 2014-11-13 | 2018-11-13 | Intuitive Surgical Operations, Inc. | User-interface control using master controller |
US10409873B2 (en) | 2014-11-26 | 2019-09-10 | Facebook, Inc. | Searching for content by key-authors on online social networks |
US9679024B2 (en) | 2014-12-01 | 2017-06-13 | Facebook, Inc. | Social-based spelling correction for online social networks |
US10552759B2 (en) | 2014-12-01 | 2020-02-04 | Facebook, Inc. | Iterative classifier training on online social networks |
US9990441B2 (en) | 2014-12-05 | 2018-06-05 | Facebook, Inc. | Suggested keywords for searching content on online social networks |
US10102273B2 (en) | 2014-12-30 | 2018-10-16 | Facebook, Inc. | Suggested queries for locating posts on online social networks |
US9589305B2 (en) * | 2014-12-30 | 2017-03-07 | Facebook, Inc. | Techniques for graph based natural language processing |
US10333696B2 (en) | 2015-01-12 | 2019-06-25 | X-Prime, Inc. | Systems and methods for implementing an efficient, scalable homomorphic transformation of encrypted data with minimal data expansion and improved processing efficiency |
US10061856B2 (en) * | 2015-01-29 | 2018-08-28 | Facebook, Inc. | Multimedia search using reshare text on online social networks |
US10997257B2 (en) | 2015-02-06 | 2021-05-04 | Facebook, Inc. | Aggregating news events on online social networks |
US10846486B2 (en) * | 2015-04-08 | 2020-11-24 | Lisuto Kk | Data transformation system and method |
US10049099B2 (en) | 2015-04-10 | 2018-08-14 | Facebook, Inc. | Spell correction with hidden markov models on online social networks |
US10095683B2 (en) | 2015-04-10 | 2018-10-09 | Facebook, Inc. | Contextual speller models on online social networks |
US10628636B2 (en) | 2015-04-24 | 2020-04-21 | Facebook, Inc. | Live-conversation modules on online social networks |
US10298535B2 (en) | 2015-05-19 | 2019-05-21 | Facebook, Inc. | Civic issues platforms on online social networks |
US10102275B2 (en) | 2015-05-27 | 2018-10-16 | International Business Machines Corporation | User interface for a query answering system |
US9678724B2 (en) * | 2015-05-29 | 2017-06-13 | Intentional Software Corporation | System and method for combining text editing and tree encoding for computer programs |
US10108603B2 (en) * | 2015-06-01 | 2018-10-23 | Nuance Communications, Inc. | Processing natural language text with context-specific linguistic model |
US10496749B2 (en) * | 2015-06-12 | 2019-12-03 | Satyanarayana Krishnamurthy | Unified semantics-focused language processing and zero base knowledge building system |
US10397167B2 (en) | 2015-06-19 | 2019-08-27 | Facebook, Inc. | Live social modules on online social networks |
US10509832B2 (en) | 2015-07-13 | 2019-12-17 | Facebook, Inc. | Generating snippet modules on online social networks |
US10268664B2 (en) | 2015-08-25 | 2019-04-23 | Facebook, Inc. | Embedding links in user-created content on online social networks |
US10810217B2 (en) | 2015-10-07 | 2020-10-20 | Facebook, Inc. | Optionalization and fuzzy search on online social networks |
US11170038B1 (en) | 2015-11-02 | 2021-11-09 | Narrative Science Inc. | Applied artificial intelligence technology for using narrative analytics to automatically generate narratives from multiple visualizations |
US11222184B1 (en) | 2015-11-02 | 2022-01-11 | Narrative Science Inc. | Applied artificial intelligence technology for using narrative analytics to automatically generate narratives from bar charts |
US11232268B1 (en) | 2015-11-02 | 2022-01-25 | Narrative Science Inc. | Applied artificial intelligence technology for using narrative analytics to automatically generate narratives from line charts |
US10270868B2 (en) | 2015-11-06 | 2019-04-23 | Facebook, Inc. | Ranking of place-entities on online social networks |
US10795936B2 (en) | 2015-11-06 | 2020-10-06 | Facebook, Inc. | Suppressing entity suggestions on online social networks |
US9602965B1 (en) | 2015-11-06 | 2017-03-21 | Facebook, Inc. | Location-based place determination using online social networks |
US10534814B2 (en) | 2015-11-11 | 2020-01-14 | Facebook, Inc. | Generating snippets on online social networks |
US10387511B2 (en) | 2015-11-25 | 2019-08-20 | Facebook, Inc. | Text-to-media indexes on online social networks |
US10146858B2 (en) | 2015-12-11 | 2018-12-04 | International Business Machines Corporation | Discrepancy handler for document ingestion into a corpus for a cognitive computing system |
US10740368B2 (en) | 2015-12-29 | 2020-08-11 | Facebook, Inc. | Query-composition platforms on online social networks |
US10282434B2 (en) | 2016-01-11 | 2019-05-07 | Facebook, Inc. | Suppression and deduplication of place-entities on online social networks |
US10176250B2 (en) | 2016-01-12 | 2019-01-08 | International Business Machines Corporation | Automated curation of documents in a corpus for a cognitive computing system |
US9842161B2 (en) | 2016-01-12 | 2017-12-12 | International Business Machines Corporation | Discrepancy curator for documents in a corpus of a cognitive computing system |
US10162899B2 (en) | 2016-01-15 | 2018-12-25 | Facebook, Inc. | Typeahead intent icons and snippets on online social networks |
US10262039B1 (en) | 2016-01-15 | 2019-04-16 | Facebook, Inc. | Proximity-based searching on online social networks |
US10740375B2 (en) | 2016-01-20 | 2020-08-11 | Facebook, Inc. | Generating answers to questions using information posted by users on online social networks |
EP3203384A1 (en) * | 2016-02-02 | 2017-08-09 | Theo Hoffenberg | Method, device, and computer program for providing a definition or a translation of a word belonging to a sentence as a function of neighbouring words and of databases |
KR20170092409A (en) * | 2016-02-03 | 2017-08-11 | 엘지전자 주식회사 | Mobile terminal and method for controlling the same |
US10242074B2 (en) | 2016-02-03 | 2019-03-26 | Facebook, Inc. | Search-results interfaces for content-item-specific modules on online social networks |
US10157224B2 (en) | 2016-02-03 | 2018-12-18 | Facebook, Inc. | Quotations-modules on online social networks |
US10270882B2 (en) | 2016-02-03 | 2019-04-23 | Facebook, Inc. | Mentions-modules on online social networks |
US10216850B2 (en) | 2016-02-03 | 2019-02-26 | Facebook, Inc. | Sentiment-modules on online social networks |
US11727391B2 (en) | 2016-04-11 | 2023-08-15 | Nchain Licensing Ag | Computer-implemented methods and systems for validating tokens for blockchain-based cryptocurrencies |
US10452671B2 (en) | 2016-04-26 | 2019-10-22 | Facebook, Inc. | Recommendations from comments on online social networks |
US10635661B2 (en) | 2016-07-11 | 2020-04-28 | Facebook, Inc. | Keyboard-based corrections for search queries on online social networks |
US10223464B2 (en) | 2016-08-04 | 2019-03-05 | Facebook, Inc. | Suggesting filters for search on online social networks |
US10282483B2 (en) | 2016-08-04 | 2019-05-07 | Facebook, Inc. | Client-side caching of search keywords for online social networks |
US10726022B2 (en) | 2016-08-26 | 2020-07-28 | Facebook, Inc. | Classifying search queries on online social networks |
US10534815B2 (en) | 2016-08-30 | 2020-01-14 | Facebook, Inc. | Customized keyword query suggestions on online social networks |
US10853583B1 (en) | 2016-08-31 | 2020-12-01 | Narrative Science Inc. | Applied artificial intelligence technology for selective control over narrative generation from visualizations of data |
US10102255B2 (en) | 2016-09-08 | 2018-10-16 | Facebook, Inc. | Categorizing objects for queries on online social networks |
US10645142B2 (en) | 2016-09-20 | 2020-05-05 | Facebook, Inc. | Video keyframes display on online social networks |
US10026021B2 (en) | 2016-09-27 | 2018-07-17 | Facebook, Inc. | Training image-recognition systems using a joint embedding model on online social networks |
US10083379B2 (en) | 2016-09-27 | 2018-09-25 | Facebook, Inc. | Training image-recognition systems based on search queries on online social networks |
US10579688B2 (en) | 2016-10-05 | 2020-03-03 | Facebook, Inc. | Search ranking and recommendations for online social networks based on reconstructed embeddings |
US10360301B2 (en) | 2016-10-10 | 2019-07-23 | International Business Machines Corporation | Personalized approach to handling hypotheticals in text |
US10311117B2 (en) | 2016-11-18 | 2019-06-04 | Facebook, Inc. | Entity linking to query terms on online social networks |
US10650009B2 (en) | 2016-11-22 | 2020-05-12 | Facebook, Inc. | Generating news headlines on online social networks |
US10235469B2 (en) | 2016-11-30 | 2019-03-19 | Facebook, Inc. | Searching for posts by related entities on online social networks |
US10313456B2 (en) | 2016-11-30 | 2019-06-04 | Facebook, Inc. | Multi-stage filtering for recommended user connections on online social networks |
US10185763B2 (en) | 2016-11-30 | 2019-01-22 | Facebook, Inc. | Syntactic models for parsing search queries on online social networks |
US10162886B2 (en) | 2016-11-30 | 2018-12-25 | Facebook, Inc. | Embedding-based parsing of search queries on online social networks |
US11223699B1 (en) | 2016-12-21 | 2022-01-11 | Facebook, Inc. | Multiple user recognition with voiceprints on online social networks |
US10607148B1 (en) | 2016-12-21 | 2020-03-31 | Facebook, Inc. | User identification with voiceprints on online social networks |
US10535106B2 (en) | 2016-12-28 | 2020-01-14 | Facebook, Inc. | Selecting user posts related to trending topics on online social networks |
WO2018131048A1 (en) * | 2017-01-11 | 2018-07-19 | Satyanarayana Krishnamurthy | System and method for natural language generation |
US10489472B2 (en) | 2017-02-13 | 2019-11-26 | Facebook, Inc. | Context-based search suggestions on online social networks |
US10713442B1 (en) | 2017-02-17 | 2020-07-14 | Narrative Science Inc. | Applied artificial intelligence technology for interactive story editing to support natural language generation (NLG) |
US10614141B2 (en) | 2017-03-15 | 2020-04-07 | Facebook, Inc. | Vital author snippets on online social networks |
US10769222B2 (en) | 2017-03-20 | 2020-09-08 | Facebook, Inc. | Search result ranking based on post classifiers on online social networks |
US10592706B2 (en) * | 2017-03-29 | 2020-03-17 | Valyant AI, Inc. | Artificially intelligent order processing system |
US11379861B2 (en) | 2017-05-16 | 2022-07-05 | Meta Platforms, Inc. | Classifying post types on online social networks |
US10248645B2 (en) | 2017-05-30 | 2019-04-02 | Facebook, Inc. | Measuring phrase association on online social networks |
US10268646B2 (en) | 2017-06-06 | 2019-04-23 | Facebook, Inc. | Tensor-based deep relevance model for search on online social networks |
US10489502B2 (en) * | 2017-06-30 | 2019-11-26 | Accenture Global Solutions Limited | Document processing |
US11562143B2 (en) | 2017-06-30 | 2023-01-24 | Accenture Global Solutions Limited | Artificial intelligence (AI) based document processor |
US11003796B2 (en) | 2017-06-30 | 2021-05-11 | Accenture Global Solutions Limited | Artificial intelligence based document processor |
US11128680B2 (en) * | 2017-07-21 | 2021-09-21 | Fiduciary.ai Inc. | AI mediated conference monitoring and document generation |
US11494395B2 (en) | 2017-07-31 | 2022-11-08 | Splunk Inc. | Creating dashboards for viewing data in a data storage system based on natural language requests |
US10901811B2 (en) | 2017-07-31 | 2021-01-26 | Splunk Inc. | Creating alerts associated with a data storage system based on natural language requests |
US10489468B2 (en) | 2017-08-22 | 2019-11-26 | Facebook, Inc. | Similarity search using progressive inner products and bounds |
US10776437B2 (en) | 2017-09-12 | 2020-09-15 | Facebook, Inc. | Time-window counters for search results on online social networks |
US10678786B2 (en) | 2017-10-09 | 2020-06-09 | Facebook, Inc. | Translating search queries on online social networks |
US10810214B2 (en) | 2017-11-22 | 2020-10-20 | Facebook, Inc. | Determining related query terms through query-post associations on online social networks |
US10963514B2 (en) | 2017-11-30 | 2021-03-30 | Facebook, Inc. | Using related mentions to enhance link probability on online social networks |
US11604968B2 (en) | 2017-12-11 | 2023-03-14 | Meta Platforms, Inc. | Prediction of next place visits on online social networks |
US10129705B1 (en) | 2017-12-11 | 2018-11-13 | Facebook, Inc. | Location prediction using wireless signals on online social networks |
FI20176151A1 (en) | 2017-12-22 | 2019-06-23 | Vuolearning Ltd | A heuristic method for analyzing content of an electronic document |
US11042709B1 (en) | 2018-01-02 | 2021-06-22 | Narrative Science Inc. | Context saliency-based deictic parser for natural language processing |
US11003866B1 (en) | 2018-01-17 | 2021-05-11 | Narrative Science Inc. | Applied artificial intelligence technology for narrative generation using an invocable analysis service and data re-organization |
US11062239B2 (en) * | 2018-02-17 | 2021-07-13 | Bank Of America Corporation | Structuring computer-mediated communication and determining relevant case type |
US11151631B2 (en) | 2018-04-17 | 2021-10-19 | Oracle International Corporation | Quick learning recommendation method, non-transitory computer readable medium and system for baskets of goods |
US11042713B1 (en) | 2018-06-28 | 2021-06-22 | Narrative Scienc Inc. | Applied artificial intelligence technology for using natural language processing to train a natural language generation system |
US20200065825A1 (en) * | 2018-08-24 | 2020-02-27 | Capital One Services, Llc | Systems and methods for customer service prediction |
US11227114B1 (en) * | 2018-11-28 | 2022-01-18 | Kensho Technologies, Llc | Natural language interface with real-time feedback |
US11341330B1 (en) | 2019-01-28 | 2022-05-24 | Narrative Science Inc. | Applied artificial intelligence technology for adaptive natural language understanding with term discovery |
US11113469B2 (en) | 2019-03-27 | 2021-09-07 | International Business Machines Corporation | Natural language processing matrices |
US11386902B2 (en) * | 2020-04-28 | 2022-07-12 | Bank Of America Corporation | System for generation and maintenance of verified data records |
CN112231212B (en) * | 2020-10-16 | 2023-05-09 | 湖南皖湘科技有限公司 | Method for detecting grammar error of program code |
US12031228B2 (en) | 2021-07-21 | 2024-07-09 | Meta Platforms Technologies, Llc | Organic solid crystal—method and structure |
US11811626B1 (en) * | 2022-06-06 | 2023-11-07 | International Business Machines Corporation | Ticket knowledge graph enhancement |
CN115965017B (en) * | 2023-01-04 | 2023-11-10 | 北京三维天地科技股份有限公司 | Multi-language input and analysis system and method based on development platform |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6636831B1 (en) * | 1999-04-09 | 2003-10-21 | Inroad, Inc. | System and process for voice-controlled information retrieval |
US6714939B2 (en) * | 2001-01-08 | 2004-03-30 | Softface, Inc. | Creation of structured data from plain text |
US6766320B1 (en) * | 2000-08-24 | 2004-07-20 | Microsoft Corporation | Search engine with natural language-based robust parsing for user query and relevance feedback learning |
US7027975B1 (en) * | 2000-08-08 | 2006-04-11 | Object Services And Consulting, Inc. | Guided natural language interface system and method |
US7216073B2 (en) * | 2001-03-13 | 2007-05-08 | Intelligate, Ltd. | Dynamic natural language understanding |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
NL8900247A (en) * | 1989-02-01 | 1990-09-03 | Bso Buro Voor Systeemontwikkel | METHOD AND SYSTEM FOR DISPLAYING MULTIPLE ANALYZES IN A DEPENDENCE GRAMMATICS, AND A DEPLUSING DEVICE FOR GENERATING SUCH VIEW. |
US5309359A (en) | 1990-08-16 | 1994-05-03 | Boris Katz | Method and apparatus for generating and utlizing annotations to facilitate computer text retrieval |
US5404295A (en) | 1990-08-16 | 1995-04-04 | Katz; Boris | Method and apparatus for utilizing annotations to facilitate computer retrieval of database material |
US5528491A (en) * | 1992-08-31 | 1996-06-18 | Language Engineering Corporation | Apparatus and method for automated natural language translation |
US5680619A (en) | 1995-04-03 | 1997-10-21 | Mfactory, Inc. | Hierarchical encapsulation of instantiated objects in a multimedia authoring system |
US5855020A (en) | 1996-02-21 | 1998-12-29 | Infoseek Corporation | Web scan process |
SG49804A1 (en) * | 1996-03-20 | 1998-06-15 | Government Of Singapore Repres | Parsing and translating natural language sentences automatically |
US5920854A (en) | 1996-08-14 | 1999-07-06 | Infoseek Corporation | Real-time document collection search engine with phrase indexing |
US6182029B1 (en) * | 1996-10-28 | 2001-01-30 | The Trustees Of Columbia University In The City Of New York | System and method for language extraction and encoding utilizing the parsing of text data in accordance with domain parameters |
CN1204515C (en) * | 1997-04-22 | 2005-06-01 | 格雷格·赫瑟林顿 | Method and apparatus for processing free-format data |
US5940821A (en) | 1997-05-21 | 1999-08-17 | Oracle Corporation | Information presentation in a knowledge base search and retrieval system |
US6138098A (en) * | 1997-06-30 | 2000-10-24 | Lernout & Hauspie Speech Products N.V. | Command parsing and rewrite system |
JP2000207397A (en) | 1999-01-01 | 2000-07-28 | Ishikura Hiroshi | System and method for analyzing language |
US6782505B1 (en) * | 1999-04-19 | 2004-08-24 | Daniel P. Miranker | Method and system for generating structured data from semi-structured data sources |
-
2001
- 2001-01-08 US US09/757,075 patent/US6714939B2/en not_active Expired - Lifetime
-
2002
- 2002-01-07 DE DE60213409T patent/DE60213409T2/en not_active Expired - Lifetime
- 2002-01-07 EP EP02714731A patent/EP1399842B1/en not_active Expired - Lifetime
- 2002-01-07 WO PCT/US2002/000757 patent/WO2002056196A2/en active IP Right Grant
- 2002-01-07 AT AT02714731T patent/ATE334449T1/en not_active IP Right Cessation
- 2002-01-07 AU AU2002246981A patent/AU2002246981A1/en not_active Abandoned
-
2004
- 2004-03-05 US US10/794,335 patent/US7324936B2/en not_active Expired - Lifetime
-
2007
- 2007-10-31 US US11/982,386 patent/US20080126080A1/en not_active Abandoned
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6636831B1 (en) * | 1999-04-09 | 2003-10-21 | Inroad, Inc. | System and process for voice-controlled information retrieval |
US7027975B1 (en) * | 2000-08-08 | 2006-04-11 | Object Services And Consulting, Inc. | Guided natural language interface system and method |
US6766320B1 (en) * | 2000-08-24 | 2004-07-20 | Microsoft Corporation | Search engine with natural language-based robust parsing for user query and relevance feedback learning |
US6714939B2 (en) * | 2001-01-08 | 2004-03-30 | Softface, Inc. | Creation of structured data from plain text |
US7216073B2 (en) * | 2001-03-13 | 2007-05-08 | Intelligate, Ltd. | Dynamic natural language understanding |
Cited By (44)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070245308A1 (en) * | 2005-12-31 | 2007-10-18 | Hill John E | Flexible XML tagging |
US8977953B1 (en) * | 2006-01-27 | 2015-03-10 | Linguastat, Inc. | Customizing information by combining pair of annotations from at least two different documents |
US8407585B2 (en) * | 2006-04-19 | 2013-03-26 | Apple Inc. | Context-aware content conversion and interpretation-specific views |
US20070250762A1 (en) * | 2006-04-19 | 2007-10-25 | Apple Computer, Inc. | Context-aware content conversion and interpretation-specific views |
US8671341B1 (en) | 2007-01-05 | 2014-03-11 | Linguastat, Inc. | Systems and methods for identifying claims associated with electronic text |
US20100106802A1 (en) * | 2007-02-16 | 2010-04-29 | Alexander Zink | Apparatus and method for generating a data stream and apparatus and method for reading a data stream |
US20120275541A1 (en) * | 2007-02-16 | 2012-11-01 | Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. | Apparatus and method for generating a data stream and apparatus and method for reading a data stream |
US8782273B2 (en) * | 2007-02-16 | 2014-07-15 | Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. | Apparatus and method for generating a data stream and apparatus and method for reading a data stream |
US8788693B2 (en) * | 2007-02-16 | 2014-07-22 | Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. | Apparatus and method for generating a data stream and apparatus and method for reading a data stream |
US8595182B1 (en) * | 2007-11-07 | 2013-11-26 | Google Inc. | Network file association |
US8086592B2 (en) * | 2007-11-30 | 2011-12-27 | SAP France S.A. | Apparatus and method for associating unstructured text with structured data |
US20090144295A1 (en) * | 2007-11-30 | 2009-06-04 | Business Objects S.A. | Apparatus and method for associating unstructured text with structured data |
US20100088674A1 (en) * | 2008-10-06 | 2010-04-08 | Microsoft Corporation | System and method for recognizing structure in text |
US10013505B1 (en) * | 2010-01-13 | 2018-07-03 | The Boeing Company | Method, apparatus and computer program product for identifying a target part name within a data record |
US8732036B2 (en) | 2010-05-07 | 2014-05-20 | Ariba, Inc. | Supplier/buyer network that provides catalog updates |
US20120016661A1 (en) * | 2010-07-19 | 2012-01-19 | Eyal Pinkas | System, method and device for intelligent textual conversation system |
US9684690B2 (en) | 2011-01-12 | 2017-06-20 | Google Inc. | Flights search |
US10810368B2 (en) | 2012-07-10 | 2020-10-20 | Robert D. New | Method for parsing natural language text with constituent construction links |
US9720903B2 (en) * | 2012-07-10 | 2017-08-01 | Robert D. New | Method for parsing natural language text with simple links |
US20140019122A1 (en) * | 2012-07-10 | 2014-01-16 | Robert D. New | Method for Parsing Natural Language Text |
US9430571B1 (en) * | 2012-10-24 | 2016-08-30 | Google Inc. | Generating travel queries in response to free text queries |
US11361041B2 (en) | 2012-10-24 | 2022-06-14 | Google Llc | Generating travel queries in response to free-text search queries |
US10423684B2 (en) | 2012-10-24 | 2019-09-24 | Google Llc | Generating travel queries in response to free-text search queries |
US9679080B2 (en) | 2012-11-12 | 2017-06-13 | Facebook, Inc. | Grammar model for structured search queries |
US9105068B2 (en) | 2012-11-12 | 2015-08-11 | Facebook, Inc. | Grammar model for structured search queries |
KR101858157B1 (en) | 2012-11-12 | 2018-05-15 | 페이스북, 인크. | Grammar model for structured search queries |
WO2014074346A1 (en) * | 2012-11-12 | 2014-05-15 | Facebook, Inc. | Grammar model for structured search queries |
US10509852B2 (en) | 2012-12-10 | 2019-12-17 | International Business Machines Corporation | Utilizing classification and text analytics for annotating documents to allow quick scanning |
US20140164899A1 (en) * | 2012-12-10 | 2014-06-12 | International Business Machines Corporation | Utilizing classification and text analytics for annotating documents to allow quick scanning |
US10430506B2 (en) * | 2012-12-10 | 2019-10-01 | International Business Machines Corporation | Utilizing classification and text analytics for annotating documents to allow quick scanning |
US9779076B2 (en) | 2013-09-04 | 2017-10-03 | International Business Machines Corporation | Utilizing classification and text analytics for optimizing processes in documents |
EP3142028A3 (en) * | 2015-09-11 | 2017-07-12 | Google, Inc. | Handling failures in processing natural language queries through user interactions |
US10282444B2 (en) * | 2015-09-11 | 2019-05-07 | Google Llc | Disambiguating join paths for natural language queries |
US10997167B2 (en) | 2015-09-11 | 2021-05-04 | Google Llc | Disambiguating join paths for natural language queries |
US10540444B2 (en) | 2017-06-20 | 2020-01-21 | The Boeing Company | Text mining a dataset of electronic documents to discover terms of interest |
US11651001B2 (en) | 2018-03-14 | 2023-05-16 | The Boeing Company | Unifying terms of interest from a dataset of electronic documents |
US11068477B1 (en) * | 2018-06-06 | 2021-07-20 | Gbt Travel Servces Uk Limited | Natural language processing with pre-specified SQL queries |
US11249960B2 (en) * | 2018-06-11 | 2022-02-15 | International Business Machines Corporation | Transforming data for a target schema |
US10691584B2 (en) * | 2018-09-28 | 2020-06-23 | Sap Se | Behavior driven development integration with test tool |
EP3968149A1 (en) * | 2020-09-11 | 2022-03-16 | SemaLogic UG | Generation of control rules from schematic control systems |
US20220083740A1 (en) * | 2020-09-11 | 2022-03-17 | SemaLogic UG | Generating control commands from schematized rule sets |
US20220292143A1 (en) * | 2021-03-11 | 2022-09-15 | Jatin V. Mehta | Dynamic Website Characterization For Search Optimization |
US11907311B2 (en) * | 2021-03-11 | 2024-02-20 | Jatin V. Mehta | Dynamic website characterization for search optimization |
US20240311433A1 (en) * | 2021-03-11 | 2024-09-19 | Jatin V. Mehta | Dynamic Website Characterization For Search Optimization |
Also Published As
Publication number | Publication date |
---|---|
ATE334449T1 (en) | 2006-08-15 |
WO2002056196A3 (en) | 2003-12-31 |
US7324936B2 (en) | 2008-01-29 |
DE60213409D1 (en) | 2006-09-07 |
AU2002246981A1 (en) | 2002-07-24 |
US20040172237A1 (en) | 2004-09-02 |
US6714939B2 (en) | 2004-03-30 |
WO2002056196A2 (en) | 2002-07-18 |
US20030167266A1 (en) | 2003-09-04 |
DE60213409T2 (en) | 2007-07-19 |
EP1399842B1 (en) | 2006-07-26 |
EP1399842A2 (en) | 2004-03-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7324936B2 (en) | Creation of structured data from plain text | |
Abiteboul et al. | Querying documents in object databases | |
Arocena et al. | WebOQL: Restructuring documents, databases, and webs | |
Mena et al. | Ontology-based query processing for global information systems | |
Burkowski | Retrieval activities in a database consisting of heterogeneous collections of structured text | |
US6792418B1 (en) | File or database manager systems based on a fractal hierarchical index structure | |
Reynaud et al. | Semantic integration of XML heterogeneous data sources | |
Kay | XPath 2.0 programmer's reference | |
Stratica et al. | Using semantic templates for a natural language interface to the CINDI virtual library | |
Beeri et al. | Websuite—a tool suite for harnessing web data | |
Gao et al. | Semi-structured data extraction from heterogeneous sources | |
Pluempitiwiriyawej et al. | Element matching across data-oriented XML sources using a multi-strategy clustering model | |
JPH09190453A (en) | Database device | |
Aggarwal et al. | WIRE-a WWW-based information retrieval and extraction system | |
Chartrand | Ontology-based extraction of RDF data from the world wide web | |
Bruder et al. | GETESS: Constructing a linguistic search index for an Internet search engine | |
Raghavan et al. | A system architecture for database mining applications | |
Lawrence et al. | Unity-a database integration tool | |
Becker | Effective databases for text & document management | |
Laender et al. | The Debye environment for Web data management | |
Chen et al. | Query rewriting for extracting data behind HTML forms | |
Chobe et al. | Extraction of Meaningful Information from the Web: a Brief Survey | |
Wessman | A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System | |
Abulaish et al. | Using Part-of-Speech Patterns and Domain Ontology to Mine Imprecise Concepts from Text Documents. | |
Pluempitiwiriyawej | A new hierarchical clustering model for speeding up the reconciliation of XML-based, semistructured data in mediation systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |