Decoding the Language of Biomedical Research
Computer programming, and specifically natural language processing, has the potential to decode sentence structure and organize immense quantities of information. This summer, Richard Klockowski ’12 is working with Associate Professor of Computer Science Alistair Campbell with aspirations to automatically extract information from Pubmed’s database of medical research papers.
Pubmed is an online database listing more than 20.8 million records with approximately 500,000 new records added each year. It is extremely time consuming for one or more humans to read, let alone understand, the information in all of these resources. Klockowski’s task is to automate the process of scanning and extracting information from abstracts to to formulate a logical representation of the relationships involved and perhaps reveal contradictions or inferences that a human reader may miss.
Klockowski’s research entails finding out what work has been done in the past and to create a custom information extraction system. The final goal is to create a logical database that is easily referenced and understood. For example, if one abstract discusses the location of a specific protein in the body, the system will add an entry to the database signifying that the protein is found in that location.
To carry out this project, Klockowski, a mathematics and computer science double-major, is employing several pre-existing technologies including General Architecture for Text Engineering (GATE) and the Unified Medical Language System (UMLS). With some customization, these tools allow Klockowski to parse the complicated language found in biomedical research papers.
The field of information extraction is a relatively new subject that is gaining popularity among computer scientists. Klockowski’s goal is to mimic research that has been done with similar resources, and perhaps expand this sort of research even further. He has not set a specific goal for the end of the summer because his current work is very open-ended. Instead, Klockowski is taking a more careful approach to push the boundaries of what has been done already, and his research is evolving as he learns of the different techniques and resources available. Klockowski hopes that his program will eventually be able to find contradictions and make accurate inferences about related topics in texts.
Over the course of his project, Klockowski simply looks forward to making progress. He explains that when working with this kind of system-wide programming, there is little room for error and a lot can go wrong. Klockowski realizes that it may not be possible to complete all of his goals for this summer in only 10 weeks. He has considered using this project as the basis of his senior thesis, and so he feels highly invested in its development.
Klockowski enjoys programming, which stemmed from his childhood interest of playing video games. He is a computer science TA, a mathematics grader and also enjoys bike riding and playing guitar.
Computer programming fosters critical thinking and efficient problem solving. Klockowski sees the potential in computer programming to benefit humankind in a significant and logical way. He hopes to help the field of information extraction to become a commonplace technique of handling data containing natural language.
Richard Klockowski is a graduate of the Rome Free Academy in Rome, N.Y.