Need to improve speed of Excel macro that searches for whole word matches

Marrick13 · #1 02-16-2016, 10:02 AM

I’ve constructed macros that search a user range on a worksheet and looks for whole word matches (one provides case sensitive and the other insensitive searches with the option compare statements). It extracts the matches it finds to a new workbook. It determines the whole word match by testing whether the search text in the range is preceded or followed by a non-letter or non-number.

It runs very well if there are a small number of hits, even if the search range is several thousand rows, but slows down if it finds a lot of hits, probably because it loops through the ASC characters for the front and back of the search hit. For example, with a range of about 5800 rows of data, it took just over 2 minutes to extract 455 hits. But searching the same range for a different criterion that had only 14 hits took around 15 seconds.

Does anyone have a suggestion for speeding this code up when the hit count is high? Macro is attached in text file.

NoSparks · #2 02-16-2016, 10:31 AM

I'd look into the use of range.Find and .FindNext methods.

Marrick13 · #3 02-16-2016, 12:47 PM

I'd be happy to do that but am not VBA fluent enough to change the code to apply those methods. As it is, I had to test and try to rewrite other match whole word modules, only to find they didn't quite do what I wanted them to do. The one I have now is the only one that actually worked (without doing a find and replace).

Can you suggest what the code should look like using range.find and .findnext?

NoSparks · #4 02-16-2016, 02:22 PM

As we don't know what you're actually working with, how about you supply a sample workbook that what you have now works on and can be used to make sure anything suggested will work on.

Marrick13 · #5 02-17-2016, 01:01 PM

That data is private but I have created a similar database of about 1070 records with the 4 code modules (two for case sensitive and two for case insensitive; the other two are for finding all occurrences and for whole word matches).

The code ran pretty fast on the 1070-record database, so I copied most of it 6 times to increase it to about 6300 records. Results running match whole word, case insensitive on a search for 'the':

searching 1070 records: found 462 in 25 seconds
searching 6319 records: found 2717 in 49 seconds

These were faster times than my original tests but I would still like to know if the searching can be faster, whether using find.next or another method.

Sample database attached.

NoSparks · #6 02-17-2016, 06:24 PM

The find-findnext was only a suggestion of something to look into.
Don't know if this is of any use or not.

Here is what I did:
added module2, altered the forms' Search and Extract procedure a little... it's commented.
put a timer in your Find_Whole_Word_CIS1(x) macro.

In it's original state, on my computer, that macro ran in 48 seconds (last input box to first msgbox).

Adding Application.ScreenUpdating=False reduced the time to 9 seconds.

Using .Find and .FindNext (in module2) the time to run was 3 seconds, but the macro is questionable, as the results didn't jive.

The original macro found 2717 records.
60 of those weren't in the results of find/findnext.
I can account for that being 'the' at the beginning of a sentence or first word inside brackets and not having a preceding space.

The find/findnext macro found 3018 records.
361 of these aren't in the original results and I don't know what gives with that.
The attached Comparison workbook contains only the results of the 2 macros, the discrepancies are highlighted.

Marrick13 · #7 02-17-2016, 07:15 PM

Thanks for checking this. Just after sending that sample file, I ran the match whole word macro against my real database on one word. It found 89 matches. When I ran the original all occurrences code, it found 260. The search text would have been a single word in all cases, so there must be a flaw in the match whole word code. It is not due to the search text being the only text in any cell because some of the hits were such cells. I had an earlier version of the match whole word code that allowed for that by testing only the start of the word to see if it had any spaces or punctuation. But when i tested it, it made no difference. I just tested that earlier version and also got 89 hits.

I like your 'Identify_Missing' macro - that's really helpful. But I'm at a loss as to why the match whole word misses so many rows. The speed is less of an issue than the accuracy; it's not worthwhile keeping if it misses qualified rows.

Do you have ideas as to why, or suggestions for different code that finds all whole word matches?

NoSparks · #8 02-17-2016, 08:56 PM

No question, accuracy must come first.

The reason so many rows are being skipped is only the first occurrence of 'the' is being looked at:
seq 23: Antikythera
seq29: no space between writes and the
seq 108 and 109: Southern Ohio
etc.
etc.

Just so this can be looked at with a totally open (maybe that should be blank) mind, can you state what the original request and requirements were regarding this ?

Marrick13 · #9 02-17-2016, 09:18 PM

The original request was my own doing. I had code that extracted text I searched regardless of case and matching whole words (I adapted it from http://stackoverflow.com/questions/2...n-one-workbook). I usually got rows that I didn't want just because the code wasn't matching, so I wondered about changing it to match the whole word. Not being a VBA guru, I had to look for examples that I could maybe adapt for my purpose.

I did just comment out two lines of code: the 'If b < 48 Or a > 57 Then' and the 'If a < 48 Or b > 57 Then' which I had added so the code would include numbers adjacent to the search text when it was looking to determine if the whole word was a match. When I ran that code, it found the same number of hits as the original code above searching for a word that I knew was a whole word in all 260 occurrences. With that b < 48 Or a > 57 code uncommented, I got 89 hits on the same search term. That is great - but if I added a number to the search term in one of the rows, the code doesn't recognize it, whether or not the b < 48 Or a > 57 lines are there. So the way it is now, I think the only reason lines will be skipped is if the search term is either preceded or followed by a number (without a space). I guess I can live with that. In the example you mention (seq 23: Antikythera, seq29: no space between writes and the seq 108 and 109: Southern Ohio), 'the' is not a whole word, so on a whole word match run, I don't want those rows. Only if I choose "all occurrences" would I want to see 'the'.

If you can suggest a way to have the code see numbers without compromising the other (whole word match) searching, I''d be happy to hear it!

NoSparks · #10 02-17-2016, 10:15 PM

Quote:

seq 108 and 109: Southern Ohio), 'the' is not a whole word, so on a whole word match run, I don't want those rows.

Yes you do... further on, each line has the stand alone whole word you're looking for.

Is your search always confined to a single column ?

NoSparks · #11 02-18-2016, 08:20 AM

Here's a different approach.

Started with the sample workbook you posted.
Kept your sheet of data and userform.
For each cell in the TestRng, split the cell contents into an array of individual words as separated by spaces.
Checks for array elements being same as SearchText.

Using the same 'the' as what to check for
takes about 10 seconds and returns 3090 records case insensitive and 2814 records case sensitive.

Appears to me your challenge will be determining how many records should be returned in order to know if anything works correctly.

Marrick13 · #12 02-18-2016, 10:51 AM

This looks really good - thank you. The only issue I have with it is that some search words might be preceded or followed by punctuation such as a dash, question mark, comma, etc.

I am a little familiar with the regex method in "[^A-Z0-9 ]", but when I tried adding the punctuation to it, the code ignored it. So the code does find and extract all whole word matches unless the search text has anything but a letter or number immediately in front of or behind it. How do I add the punctuation so the code recognizes the hits as whole word matches?

Regarding the range extending beyond one column, yes, I did not put a restriction on that because I wanted the search to look in all selected columns and rows (and that works fine, as far as I've seen). And you're right about there being more than one instance of the searchtext in any row, so that if one does not match the whole word and another one does, the code should extract it. I was responding only to the text you posted for illustration.

NoSparks · #13 02-18-2016, 11:37 AM

I'm sure if you play around with the stuff in the UDF it will remove anything and everything you could possibly think of.
Google VBA to remove punctuation special characters and numbers, you'll be able to adapt something.

Marrick13 · #14 02-18-2016, 02:00 PM

Well, I've just spent several hours trying various combinations of punctuation and special characters and none of them has had any effect. The latest I've tried are:

.Pattern = "[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!" & _
"#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:" & _
"[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:" & _
"[a-z0-9-]*[a-z0-9])?"

and

.Pattern = "[a-z0-9!#$%&'*+/=?^_`{|}~- ]"

I don't know how else to tell the code to identify hits as whole word matches if they adjoin these characters. Am I getting warm or still cold here?

NoSparks · #15 02-18-2016, 05:58 PM

I believe the .Pattern thing works opposite to your line of thought.

.Pattern = "[^characters you want to keep]"

.Pattern = "[^A-Za-z0-9\ ]"
keeps upper case A thru Z
keeps lower case a thru z
keeps numerals 0 thru 9
don't know what the back slash does as back slashes are removed
keeps spaces

I add a row to the top of your data and put in this

.Pattern = "[A-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!" & "#$%&'*+/=\?^_`{|}~-]+)*@(?:[a-z0-9](?:" & _"[a-z0-9-]*[a-\z0-9])?\.)+[a-z0-9](?:" & _"[a-z0-9-]*[a-z0-9])?"

use debug.print and this is what it gets reduced to using .Pattern = "[^A-Za-z0-9\ ]"

Pattern AZ09az09 az09 az09az09az09 az09az09

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
how to speed up macro	AC PORTA VIA	Excel	3	10-30-2015 08:58 PM
Excel formula required to speed up process of creating species rarefaction curves. Ecology!	Tim_F	Excel	2	07-18-2014 04:56 AM
word and keyboard speed	ptz	Word	2	09-20-2012 06:37 AM
Help -can i improve this way of working?	lucky67	Office	1	01-26-2012 09:20 PM
Change how windows searches	cadams	Office	1	12-07-2010 10:01 AM