Identify matches between sheet columns

gmaxey · #1 01-29-2025, 06:26 AM

This may sound ridiculously complicated, but here a stab.

Looking for a practical/proficient process to identify matches between data in two sheet columns. For example lets say I have Sheet 1 with columns A and B and 300,000 rows.

Ref# Folder Path

1 C:\Folder 1\Test.png
...

243 C:\Folder 2\Subfolder 1\Testing.jpg

... .....

300000 E:\Folder1\SubFolder2\Test.doc

I have Sheet 2 with 5 columns and 100 rows

SN Type Title Date Name

1-23 A Dogs 1//1/25 Test.doc
....

35-456 A Cats 1/2/24 Test.png

...
100-D-3 B Pigs 1/4/20 Testing.jpg

I need to create a third sheet that gets the data from Sheet 1 Column A where the data in Sheet 2 Column 5 "Name" is "found" in Sheet 1 Column B.

So in the example. Sheet 3 would look like this:

A
1 35-456 Cats 1/2/24 Test.png
243 100-D-3 Pigs 1/4/20 Testing.jpg
300000 1-23 Dogs 1/1/25 Test.doc

My initial process was like this
Loop through each row in Sheet 1 (300,000)
Loop through each row in Sheet 2 (100)
Check InStr(Sheet1.Cell, Sheet2,Cell) > 1
Yes we have a match, develop Sheet3 row
This process takes about a minute to run.

I revised the process to:
Create a collection

Loop through each row in Sheet 2 (100)
Check if Sheet2.Column 5 is Found in Sheet1
Yes
For Each find
Add row index to collection
For Each Index in Collection
Develop Sheet3 row
This process takes about 12 seconds.

Either method is working, but the potential exists for the row counts in sheet 1 or sheet 2 to grow considerably and the only result is longer processing time.

Wondering if there is a better way. I know nothing about Power Query and very little about Excel. Hoping some expert knows a more efficient process to minimize looping to identify all row indexes in sheet 1 that requires processing?

Thanks

cross posted at Identify Matches between Sheet Columns

batman1 · #2 01-29-2025, 08:41 AM

Code:

Sub demo()
Dim lastRow As Long, pos As Long, i As Long, count As Long, path As String, key As String, data(), result(), dic As Object
    With Worksheets("Sheet1")
        lastRow = .Cells(Rows.count, "A").End(xlUp).Row
        If lastRow = 1 Then Exit Sub    ' there is no data in Sheet1
        data = .Range("A2").Resize(lastRow - 1, 2).Value    ' data from sheet1 to data()
    End With
    With Worksheets("Sheet2")
        lastRow = .Cells(Rows.count, "A").End(xlUp).Row
        If lastRow = 1 Then Exit Sub    ' there is no data in Sheet2
    End With
    Set dic = CreateObject("Scripting.Dictionary")
    dic.CompareMode = vbTextCompare
    For i = 1 To UBound(data, 1)
        path = data(i, 2)   ' column B sheet1
        pos = InStrRev(path, "\")
        If pos > 0 Then
            key = Mid(path, pos + 1, Len(path))
            If Not dic.exists(key) Then dic.Add key, data(i, 1) ' item = column A
        End If
    Next i
    data = Worksheets("Sheet2").Range("A2").Resize(lastRow - 1, 5).Value    ' data from sheet2 to data()
    ReDim result(1 To UBound(data, 1), 1 To 1)  ' result in one column: <column A sheet1)><sheet2: <SN><Title>< Date><Name>>
    For i = 1 To UBound(data, 1)
        key = data(i, 5)
        If dic.exists(key) Then
            count = count + 1
            result(count, 1) = dic.Item(key) & " " & data(i, 1) & " " & data(i, 3) & " " & data(i, 4) & " " & data(i, 5)
        End If
    Next i
    If count Then
        Worksheets("Sheet3").Range("A2").Resize(count).Value = result
    End If
    
    Set dic = Nothing
End Sub

gmaxey · #3 01-29-2025, 11:07 AM

Thanks. I'll look at this and see if I can make it work.

batman1 · #4 01-29-2025, 11:20 AM

If the results are to be in 5 columns: Ref#, SN, Title, Date, Name then change it to

Code:

ReDim result(1 To UBound(data, 1), 1 To 5) 
…
If dic.exists(key) Then
            count = count + 1
            result(count, 1) = dic.Item(key)  
            result(count, 2) = data(i, 1)
            result(count, 3) = data(i, 3)
            result(count, 4) = data(i, 4)
            result(count, 5) = data(i, 5)
End If
…
Worksheets("Sheet3").Range("A2").Resize(count, 5).Value = result

gmaxey · #5 01-29-2025, 11:58 AM

batman1,

Interesting approach and while I am close to making it work, the actual file is much larger and more complicated than my simple example.

Two things:
1. There is not always a match. So when I run your code, Count and Ubound(result) are not the same.
2. So, I think that is why I get error about subscript out of range with Resize(count, 5).

I can't upload the file here. It is too big. You can get it at the cross post link if your want to look.

batman1 · #6 01-29-2025, 03:37 PM

Quote:

Originally Posted by gmaxey

batman1,

Interesting approach and while I am close to making it work, the actual file is much larger and more complicated than my simple example.

Two things:
1. There is not always a match. So when I run your code, Count and Ubound(result) are not the same.

They don't have to be the same. I understand that for each row in Sheet2 you have to look for Name in Sheet1. At the very beginning the code doesn't know how many matches there are, so I declare result with the maximum row - as many as the row in Sheet2

Code:

ReDim result(1 To UBound(data, 1), 1 To 5)

So for example Sheet2 has 100 rows then Ubound(result, 1) = UBound(data, 1) = 100, and there are 5 matches then count = 5 - Ubound(result, 1) <> count.
But since there are UBound(data, 1) rows in Sheet2 then there are maximum count = UBound(data, 1) matches:
Always count <= UBound(data, 1) = UBound(result, 1)

Quote:

2. So, I think that is why I get error about subscript out of range with Resize(count, 5).

I don't understand. count <= UBound(result, 1), so it can't be "subscript out of range error"
Since you didn't attach a file, I created the file myself and I'm testing for data in such structures. For the future, attach the file.

gmaxey · #7 01-29-2025, 03:47 PM

I tried to attach the file previously but it is too big. I was able to get your code to run but the result when run on my test file only returns 22 rows when it should return 43. I was able to post the test file here:

Identify Matches between Sheet Columns (see post #6)

It is a pretty complicated process (for me at least) and the example I asked about was a feeler to see if there was a way to do it without loop through all of the rows (speed enhancement). I believe with some slight mods that your approach might work, but it stills does a lot of looping and I'm not sure if in the end there would be that much improvement in speed.

Thank you for sharing your ideas. I have used a dictionary before but clearly not as proficient with them as you. Well done!!

batman1 · #8 01-29-2025, 04:22 PM

Quote:

Originally Posted by gmaxey

I tried to attach the file previously but it is too big. I was able to get your code to run but the result when run on my test file only returns 22 rows when it should return 43. I was able to post the test file here:

Identify Matches between Sheet Columns (see post #6)

It is a pretty complicated process (for me at least) and the example I asked about was a feeler to see if there was a way to do it without loop through all of the rows (speed enhancement). I believe with some slight mods that your approach might work, but it stills does a lot of looping and I'm not sure if in the end there would be that much improvement in speed.

Thank you for sharing your ideas. I have used a dictionary before but clearly not as proficient with them as you. Well done!!

1. I understand that the value to search is in Sheet NDS_SHEET, but exactly in which column? Maybe it's just part of the cell value?
For example, let's take row 5 in NDS_SHEET, the value to search for = ???
2. We already have the value to search for, but where do we search in sheet REF_SHEET?

It is best to attach a file:
1. NDS_SHEET with 20 rows of data
2. REF_SHEET with 100 rows of data
3. Manually enter all matches in the RESULT sheet

gmaxey · #9 01-29-2025, 05:00 PM

I'll have to create something for you tomorrow. Thank you.

gmaxey · #10 01-29-2025, 05:17 PM

The attached should let you see the current process and results.

batman1 · #11 01-29-2025, 06:58 PM

Quote:

Originally Posted by gmaxey

The attached should let you see the current process and results.

Sub demo in Module1

gmaxey · #12 01-30-2025, 02:27 AM

Quote:

Originally Posted by batman1

Sub demo in Module1

Batman,
While you matched the output of the last test file, your process when run on the actual data file returned 21 rows when I should have returned 39.

I think the problem is your output is limited to the number of unique file names you find in NDS_SHEET. That is not the intended process.

The intended process is to look at every row in the RS_SHEET
For Every row look at Every row in the NDS_SHEET
If the row NDS_SHEET File Name (column 8) is InStr(RS_SHEET column 2)

Then create record in Overlay.

I'm currently doing that process just in a different way and I get the result I need. It is faster than looping through every row and every row but still takes time.

Your process seems faster but it isn't returning all the required results.

I've attached another file which I think will help you see what I am talking about if you still have interest. Thank you.

batman1 · #13 01-30-2025, 03:18 AM

Quote:

Originally Posted by gmaxey

Your process seems faster but it isn't returning all the required results.

Well, because you didn't write many things. I can't guess your intention. You have to say everything.
1. The case of duplicates in column H in NDS_SHEET In NSD_SHEET H24 = H25 = DemoData_0000015_Import.zip You didn't write how to search, so I did this: When the code finds H24 = "DemoData_0000015_Import.zip" in B229 in REF_SHEET, it no longer searches for H25 in REF_SHEET (that's why I use dictionary DIC) So how do you want it now? Should there be 2 results (for H24 and H25)? <#266912.1><line 24> <#266912.1><line 25> ??? Or maybe something else? If you don't say anything, I won't know anything.

2. The issue of duplicates in column B in REF_SHEET. For example, let's say that "DemoData_0000015_Import.zip" occurs 3 times in column B of REF_SHEET - in B299, B367 and B489. How many results should there be? At the moment, the code will find "DemoData_0000015_Import.zip" in B299 and stop searching, it will not search any further to find in B367 and B489 (Exit For in If pos Then … End If). So what do you want now? The code should search further to B367 and B489?

3. Are we looking for "File Name" from column H of sheet NDS_SHEET in column B of sheet REF_SHEET, or vice versa: will we find for each "File Path" from column B of sheet REF_SHEET its "companion" in column H of sheet NDS_SHEET?

4. Or maybe something else that I don't know about? If you don't say anything, I won't know anything.

gmaxey · #14 01-30-2025, 04:06 AM

Quote:

Originally Posted by batman1

Well, because you didn't write many things. I can't guess your intention. You have to say everything.
1. The case of duplicates in column H in NDS_SHEET In NSD_SHEET H24 = H25 = DemoData_0000015_Import.zip You didn't write how to search, so I did this: When the code finds H24 = "DemoData_0000015_Import.zip" in B229 in REF_SHEET, it no longer searches for H25 in REF_SHEET (that's why I use dictionary DIC) So how do you want it now? Should there be 2 results (for H24 and H25)? <#266912.1><line 24> <#266912.1><line 25> ??? Or maybe something else? If you don't say anything, I won't know anything.

2. The issue of duplicates in column B in REF_SHEET. For example, let's say that "DemoData_0000015_Import.zip" occurs 3 times in column B of REF_SHEET - in B299, B367 and B489. How many results should there be? At the moment, the code will find "DemoData_0000015_Import.zip" in B299 and stop searching, it will not search any further to find in B367 and B489 (Exit For in If pos Then … End If). So what do you want now? The code should search further to B367 and B489?

3. Are we looking for "File Name" from column H of sheet NDS_SHEET in column B of sheet REF_SHEET, or vice versa: will we find for each "File Path" from column B of sheet REF_SHEET its "companion" in column H of sheet NDS_SHEET?

4. Or maybe something else that I don't know about? If you don't say anything, I won't know anything.

Batman,
I understand your frustration. As I tried to explain at the very start, it is a complicated process. Complicated and difficult to explain.

The REF_SHEET can contain hundreds of thousands of records. The one I am testing with has 440,000. The NDS_SHEET can have tens of thousands, but for testing I have only 21.

The code I provided in modMain launches a userform interface. With that interface you can create all of the possible outcomes. It works. I am just looking for something that might work faster.

DemoData_0000015_Import.zip You didn't write how to search, so I did this: When the code finds H24 = "DemoData_0000015_Import.zip" in B229 in REF_SHEET, it no longer searches for H25 in REF_SHEET (that's why I use dictionary DIC) So how do you want it now? Should there be 2 results (for H24 and H25)? <#266912.1><line 24> <#266912.1><line 25> ??? Or maybe something else? If you don't say anything, I won't know anything.

That depends on if you select "First Match Only" or "Duplicate Reference Row" in the first case then there should be only one record in the Overlay for #266912.1" in the second case there should be two records.

2. The issue of duplicates in column B in REF_SHEET. For example, let's say that "DemoData_0000015_Import.zip" occurs 3 times in column B of REF_SHEET - in B299, B367 and B489. How many results should there be? At the moment, the code will find "DemoData_0000015_Import.zip" in B299 and stop searching, it will not search any further to find in B367 and B489 (Exit For in If pos Then … End If). So what do you want now? The code should search further to B367 and B489?

Yes. Every record of the REF_SHEET that might have a match in the NDS_SHEET must be processed. That was my initial approach. Loop through every REF_SHEET record ... with my actual data this was taking about a minute.

I changed that initial approach to first loop through the smaller NDS_SHEET to find and create a collection of REF_SHEET Rows that "DO" have a match. I then looped through the Rows in that collection. With the actual data the process now takes about 12 seconds. The options for the REF_SHEET records are:
If No Match Found in NDS_SHEET
1. Discard (don't include) RS_SHEET REF/Control# records in Overlay
2. Copy control ID from RS_SHEET as new record in Overlay
3. Cancel processing - don't create overlay

3. Are we looking for "File Name" from column H of sheet NDS_SHEET in column B of sheet REF_SHEET. Yes. NDS_SHEET column 8 list single file name e.g., Test.png.
REF_SHEET column 2 can list single or mulitple files names e.g., Testing.png//Test.png//This is a test.png. If InStr(REF_SHEET(2),NDS_SHEET(8)> 0 Then
If First Match Option
Record and get out
Else
Record either as new row, or delimited data in existing row.
End If

... will we find for each "File Path" from column B of sheet REF_SHEET its "companion" in column H of sheet NDS_SHEET? It is not really a file path. It can be a single file name or multiple files name delimited with // / or \. No. There are three options if a "companion" is not found in the NDS_SHEET.

All the looping takes time. As I've said, I've gotten down to 12 seconds with the 440,000 REF_SHEET I have. I like your thought process and its possible it might shave off a a few seconds. I thought a power query might get it done in the blink of and eye?? Thank you.

ArviLaanemets · #15 01-30-2025, 12:22 AM

No need for coding for this.

As start, I'd advice to split the 2nd column to path and file columns respectively (a golden database law - don't keep several types of information in same column/field). This will help not only with current task, but will be useful in future too. In my excample, on sheet Files I added formulas how to do this easily.

Don't use special characters in column names (I replaced Ref# with RefNo).

On Info sheet, I added a RefNo column with formula which returns matching RefNo from Files sheet. Now you can simply set the filter for this column >0 to display all entries where the matching RefNo was found - or set the filer = 0 to display all entries where matching RefNo was not found.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to Remove the subtotal columns in a sheet	Marcia	Excel	3	12-01-2023 05:48 AM
Relating cells in a row in one sheet to cells in columns on another sheet.	mbesspiata3	Excel	2	01-06-2017 05:42 AM
Create a New Sheet from Existing Sheet with Specific Columns	malam	Excel Programming	1	10-17-2014 10:01 PM
From an XL sheet ,how to keep the group of columns which match with other XL sheet	Zubairkhan	Excel	2	03-04-2014 10:57 PM
Removing columns within sheet	shabbaranks	Excel	2	09-11-2012 05:03 AM