From suresh@hume Thu Sep 15 16:45 EDT 1994 Date: Thu, 15 Sep 1994 16:47:03 +0500 From: suresh@hume (Suresh Srinivasan) To: rodgers@hume Subject: Sourcerer Content-Type: text Content-Length: 6922 0 UMLS ------ Introduce the different knowledge sources in the UMLS ? Query Formulation ------------------- Components - atomic and composite linked by Boolean operators. 1 Technical Approach -------------------- Some services implemented in perl. Rest in C. ? Sourcerer CGI Application --------------------------- 1.0 State Maintenance --------------------- State is maintained on the server side in two files: a session DB file and a component file. This former contains information about the client, the time of last update, the session number, and pointers to each search component already filled for that search. The C structure is declared thusly: struct session { u_long clientIP; /* client's IP address */ u_long access; /* time last accessed */ int num; /* session number */ int n_comps; /* number of active components */ long comp_ofs[MAXCOMP]; /* offsets to each component in the component file */ long comp_lines[MAXCOMP]; /* number of lines for each component */ } These session DB records are of fixed length and hence allow for quick random access. They are also locked for exclusive access. Each HTML page returned to the client contains the session number. This allows Sourcerer to check for expired sessions when the page is returned by comparing the current time with the last access time for that session, and to obtain all the state information for that session via the session number. On all documents that contain a session UI, the client's IP is checked against the one stored in the session DB file. The last access time is then checked. When new sessions are allocated from the available pool, those that have expired are explicitly retired and reallocated. The search component file contains all the search information for all the clients. This file will need to be periodically re-written to delete retired information. It contains the component number, the user's query string, the matching Metathesaurus terms that were chosen by the user as being relevant, their synonyms, their semantic types, the applicable concept definitions, etc. The contents of this file are newline delimited ASCII text containing slots and values separated by ':'. The recognized slots are: Component: # component number Component Type: # atomic or composite # for atomic components Boolean: Original Query: # user's original query Modified Query: # list of all matching Meta terms selected by user Matching CUI: CUI/syn#syn#syn..|CUI/syn#syn#syn|... Matching STY: STY/Tnum|STY/Tnum|... Matching MH: # MH/TN:TN|MH/TN:TN|... # for composite components Original Query Topic 1: Original Query Topic 2: Modified Query Topic 1: Modified Query Topic 2: Matching CUI Topic 1: Matching CUI Topic 2: Matching SYN Topic 1: Matching SYN Topic 2: Matching STY Topic 1: Matching STY Topic 2: Matching MH Topic 1: Matching MH Topic 2: Every Nth (N=1000?) session, the search component file can be rewritten and the corresponding offsets in the session file updated. 1.1 Searching the UMLS Metathesaurus ------------------------------------ The knowledge server is the entry point to searching the UMLS for suitable terms given a user's query. It currently uses the "meta" server developed for the UMLS project (See Reference to Alexa's work). The knowledge server acts as a client to "meta" in performing its searches. It accepts a Sourcerer query string and applies the following search algorithm: lookup the query string in the Metathesaurus (ignoring case). "normalize" the query string and look that up in the Metathesaurus [the "normalize" function is a series of operations on a string that attempts to canonicalize the string. It currently involves the following lexical operations: lowercase, reducing each "word" in the term to its base form, and sorting the resulting words in ascending order. (Give examples)] (see lexical methods reference). lookup the normalized string in the Metathesaurus. break up the string into individual words lookup each word in the Metathesaurus word index and extract the top 'N' matches sorted in reverse order of weight. May use Ed's weighting scheme, if so, provide reference. normalize each word in the query and lookup in the Metathesaurus normalized word index and extract the top 'N' matches sorted in reverse order of weight. the result is a list of matching Metathesaurus terms (anchored to pages that provide their definitions, synonyms, semantic types, etc.). these are presented to the user, who is prompted to select one or more. 1.2 Searching the UMLS Information Sources Map ---------------------------------------------- The resource server attempts to search the Information Sources Map (ISM) using the information gleaned from searching the UMLS Metathesaurus with the user's original query. The ISM specifies a variety of fields, many suitable for indexing. These include the semantic relation field (SRL), the semantic type field (STY), and the MeSH heading field (MH_). In addition the text words from the user's query can be searched against the text fields of the ISM such as the definition, and general description fields. In the current implementation, the resource server searches on the SRL, STY and MH_ fields. The search results in the URN's of the matching sources. When the user invokes the "Search the ISM" option after posing a search in the Metathesaurus, Sourcerer accesses the state of the search from the component file and if there's data present, posits a search to the ISM server with appropriate transformations. For example, the STY fields are searched directly. For any composite components, all relevant SRL queries are constructed and used in the seach (all possible permutations), the relevant MH fields are searched using the tree numbers and inheritance, and eventually, the text in the user's original query and perhaps the matching terms (and synonyms) can be used to search the text fields in the ISM. The result is a list of appropriate sources sorted by decreasing relevancy. This HTML page (should) contain anchors to all relevant information about that source from the ISM, including definition, name, other names, type, etc. 1.3 Location Server ------------------- The URN's from the resource server are mapped to URL's using a simplified Whois++ protocol (see CNIDR reference). 1.4 Terminal Information Server ------------------------------- The URL's point to terminal information servers. The user's query along with additional information obtained from the knowledge server is passed along to the TIS (via hidden fields) for an effective search in the database. Currently this functionality is only available for the MEDLARS family of databases and the URL's point to the services of NetCoach which provides the search function (See NetCoach reference).