I am struggling with a rather complex issue, to which I can see no easy solution.
I have a set of data that contains paragraphs in which a certain word is cited and the document containing them. This is accomplished through a script (courtesy of macropod) that extracts paragraphs from documents according to a keyword, in the format:
Code:
Document | KeyWord | Paragraph
The problem I have is that often times words are cited in an ambiguous way and are thus not an accurate representation of what is being cited.The main problems is that
sometimes the citation is correct, but sometimes it is not. Accordingly, it is necessary to search for a "lowest common denominator" and a more complex form.
For example, let's assume I am looking for the a court case, let's call it "Case1212". This would generally be cited as "Court of Appeals, Case 1212", but may also be cited as "Case 1212". In this case, however, the citation would also be correct for the district court case. (This is an odd jurisdiction with terrible citation systems, if you need to know!)
Think of the following example leading to identical citing paragraphs.
Code:
Document | Word | Paragraph
Doc 1 | Court of Appeals, Case 1212 | The Court affirms that A > B because Court of Appeals, Case 1212 says that
Doc 1 | Case 1212 | The Court affirms that A > B because Court of Appeals, Case 1212 says that
The way I can see around resolving the problem is to search for two keywords, (eg "Court of Appeals, Case 1212" and "Case 1212"). By comparing the texts in "paragraph" column, the occurrence of the more complex form should reasonably exclude the other. The rest could then be easily parsed manually.
Now, my problem is
how to automatically remove the rows containing "Case 1212" in the "Keyword" column if there exist another row containing "Court of Appeals, Case 1212" in the same column when the two rows have have the same value in the "Paragraph" column.
Before you mention it: I thought about showing duplicate paragraphs / values and doing it manually. However, you must understand that it's 9,000 rows we are talking about. If you have any suggestions, I'll be forever grateful!