Relation Extraction from Surgical Pathology Reports

Yanna Shen

There has been little work towards development of NLP methods for Information Extraction from medical reports, although these would be of substantial value to the medical research community. Parsing of medical reports poses challenges to existing domain-independent parsers. In particular, medical reports show a strong variability in word and phrase order, include numerous medical terms, contain many sentence fragments, and can be quite long and complex. For example, the average length of a sentence in the reports I investigated was 14 words, with some sentences more than 60 words. Under these conditions, traditional parsing rules may fail. For my ISSP 2030 project, I investigated the potential use of an existing partial parser (Leroy, Chen and Martinez, 2003) to capture relations from Pathology Reports - one important type of medical report. The existing parser uses cascaded Finite State Automata, and was developed for use with biomedical research abstracts, which are quite different from medical reports. For this pilot study, I randomly selected 50 reports from a corpus of 400,000 de-identified pathology reports at the University of Pittsburgh Medical Center. I manually processed 106 sentences, using the cascading FSAs reported in this paper, and compared the results to those reported for biomedical research abstracts. Although some prepositional relations in the pathology reports are well-captured, the overall results are not very encouraging. Future work will focus on investigating other methods for parsing of reports and on re-training a part-of-speech tagger using the existing corpus.