![]() |
|
#1
|
|||
|
|||
![]()
This may sound ridiculously complicated, but here a stab.
Looking for a practical/proficient process to identify matches between data in two sheet columns. For example lets say I have Sheet 1 with columns A and B and 300,000 rows. Ref# Folder Path 1 C:\Folder 1\Test.png ... 243 C:\Folder 2\Subfolder 1\Testing.jpg ... ..... 300000 E:\Folder1\SubFolder2\Test.doc I have Sheet 2 with 5 columns and 100 rows SN Type Title Date Name 1-23 A Dogs 1//1/25 Test.doc .... 35-456 A Cats 1/2/24 Test.png ... 100-D-3 B Pigs 1/4/20 Testing.jpg I need to create a third sheet that gets the data from Sheet 1 Column A where the data in Sheet 2 Column 5 "Name" is "found" in Sheet 1 Column B. So in the example. Sheet 3 would look like this: A 1 35-456 Cats 1/2/24 Test.png 243 100-D-3 Pigs 1/4/20 Testing.jpg 300000 1-23 Dogs 1/1/25 Test.doc My initial process was like this Loop through each row in Sheet 1 (300,000) Loop through each row in Sheet 2 (100) Check InStr(Sheet1.Cell, Sheet2,Cell) > 1 Yes we have a match, develop Sheet3 row This process takes about a minute to run. I revised the process to: Create a collection Loop through each row in Sheet 2 (100) Check if Sheet2.Column 5 is Found in Sheet1 Yes For Each find Add row index to collection For Each Index in Collection Develop Sheet3 row This process takes about 12 seconds. Either method is working, but the potential exists for the row counts in sheet 1 or sheet 2 to grow considerably and the only result is longer processing time. Wondering if there is a better way. I know nothing about Power Query and very little about Excel. Hoping some expert knows a more efficient process to minimize looping to identify all row indexes in sheet 1 that requires processing? Thanks cross posted at Identify Matches between Sheet Columns |
#2
|
|||
|
|||
![]() Code:
Sub demo() Dim lastRow As Long, pos As Long, i As Long, count As Long, path As String, key As String, data(), result(), dic As Object With Worksheets("Sheet1") lastRow = .Cells(Rows.count, "A").End(xlUp).Row If lastRow = 1 Then Exit Sub ' there is no data in Sheet1 data = .Range("A2").Resize(lastRow - 1, 2).Value ' data from sheet1 to data() End With With Worksheets("Sheet2") lastRow = .Cells(Rows.count, "A").End(xlUp).Row If lastRow = 1 Then Exit Sub ' there is no data in Sheet2 End With Set dic = CreateObject("Scripting.Dictionary") dic.CompareMode = vbTextCompare For i = 1 To UBound(data, 1) path = data(i, 2) ' column B sheet1 pos = InStrRev(path, "\") If pos > 0 Then key = Mid(path, pos + 1, Len(path)) If Not dic.exists(key) Then dic.Add key, data(i, 1) ' item = column A End If Next i data = Worksheets("Sheet2").Range("A2").Resize(lastRow - 1, 5).Value ' data from sheet2 to data() ReDim result(1 To UBound(data, 1), 1 To 1) ' result in one column: <column A sheet1)><sheet2: <SN><Title>< Date><Name>> For i = 1 To UBound(data, 1) key = data(i, 5) If dic.exists(key) Then count = count + 1 result(count, 1) = dic.Item(key) & " " & data(i, 1) & " " & data(i, 3) & " " & data(i, 4) & " " & data(i, 5) End If Next i If count Then Worksheets("Sheet3").Range("A2").Resize(count).Value = result End If Set dic = Nothing End Sub |
#3
|
|||
|
|||
![]()
Thanks. I'll look at this and see if I can make it work.
|
#4
|
|||
|
|||
![]()
If the results are to be in 5 columns: Ref#, SN, Title, Date, Name then change it to
Code:
ReDim result(1 To UBound(data, 1), 1 To 5) … If dic.exists(key) Then count = count + 1 result(count, 1) = dic.Item(key) result(count, 2) = data(i, 1) result(count, 3) = data(i, 3) result(count, 4) = data(i, 4) result(count, 5) = data(i, 5) End If … Worksheets("Sheet3").Range("A2").Resize(count, 5).Value = result |
#5
|
|||
|
|||
![]()
batman1,
Interesting approach and while I am close to making it work, the actual file is much larger and more complicated than my simple example. Two things: 1. There is not always a match. So when I run your code, Count and Ubound(result) are not the same. 2. So, I think that is why I get error about subscript out of range with Resize(count, 5). I can't upload the file here. It is too big. You can get it at the cross post link if your want to look. |
#6
|
|||
|
|||
![]() Quote:
Code:
ReDim result(1 To UBound(data, 1), 1 To 5) But since there are UBound(data, 1) rows in Sheet2 then there are maximum count = UBound(data, 1) matches: Always count <= UBound(data, 1) = UBound(result, 1) Quote:
Since you didn't attach a file, I created the file myself and I'm testing for data in such structures. For the future, attach the file. |
#7
|
|||
|
|||
![]()
I tried to attach the file previously but it is too big. I was able to get your code to run but the result when run on my test file only returns 22 rows when it should return 43. I was able to post the test file here:
Identify Matches between Sheet Columns (see post #6) It is a pretty complicated process (for me at least) and the example I asked about was a feeler to see if there was a way to do it without loop through all of the rows (speed enhancement). I believe with some slight mods that your approach might work, but it stills does a lot of looping and I'm not sure if in the end there would be that much improvement in speed. Thank you for sharing your ideas. I have used a dictionary before but clearly not as proficient with them as you. Well done!! |
#8
|
|||
|
|||
![]() Quote:
For example, let's take row 5 in NDS_SHEET, the value to search for = ??? 2. We already have the value to search for, but where do we search in sheet REF_SHEET? It is best to attach a file: 1. NDS_SHEET with 20 rows of data 2. REF_SHEET with 100 rows of data 3. Manually enter all matches in the RESULT sheet |
#9
|
|||
|
|||
![]()
I'll have to create something for you tomorrow. Thank you.
|
#10
|
|||
|
|||
![]()
The attached should let you see the current process and results.
|
#11
|
|||
|
|||
![]() Sub demo in Module1 |
#12
|
|||
|
|||
![]() Batman, While you matched the output of the last test file, your process when run on the actual data file returned 21 rows when I should have returned 39. I think the problem is your output is limited to the number of unique file names you find in NDS_SHEET. That is not the intended process. The intended process is to look at every row in the RS_SHEET For Every row look at Every row in the NDS_SHEET If the row NDS_SHEET File Name (column 8) is InStr(RS_SHEET column 2) Then create record in Overlay. I'm currently doing that process just in a different way and I get the result I need. It is faster than looping through every row and every row but still takes time. Your process seems faster but it isn't returning all the required results. I've attached another file which I think will help you see what I am talking about if you still have interest. Thank you. |
#13
|
|||
|
|||
![]() Quote:
1. The case of duplicates in column H in NDS_SHEET In NSD_SHEET H24 = H25 = DemoData_0000015_Import.zip You didn't write how to search, so I did this: When the code finds H24 = "DemoData_0000015_Import.zip" in B229 in REF_SHEET, it no longer searches for H25 in REF_SHEET (that's why I use dictionary DIC) So how do you want it now? Should there be 2 results (for H24 and H25)? <#266912.1><line 24> <#266912.1><line 25> ??? Or maybe something else? If you don't say anything, I won't know anything. 2. The issue of duplicates in column B in REF_SHEET. For example, let's say that "DemoData_0000015_Import.zip" occurs 3 times in column B of REF_SHEET - in B299, B367 and B489. How many results should there be? At the moment, the code will find "DemoData_0000015_Import.zip" in B299 and stop searching, it will not search any further to find in B367 and B489 (Exit For in If pos Then … End If). So what do you want now? The code should search further to B367 and B489? 3. Are we looking for "File Name" from column H of sheet NDS_SHEET in column B of sheet REF_SHEET, or vice versa: will we find for each "File Path" from column B of sheet REF_SHEET its "companion" in column H of sheet NDS_SHEET? 4. Or maybe something else that I don't know about? If you don't say anything, I won't know anything. |
#14
|
|||
|
|||
![]() Quote:
Batman, I understand your frustration. As I tried to explain at the very start, it is a complicated process. Complicated and difficult to explain. The REF_SHEET can contain hundreds of thousands of records. The one I am testing with has 440,000. The NDS_SHEET can have tens of thousands, but for testing I have only 21. The code I provided in modMain launches a userform interface. With that interface you can create all of the possible outcomes. It works. I am just looking for something that might work faster. DemoData_0000015_Import.zip You didn't write how to search, so I did this: When the code finds H24 = "DemoData_0000015_Import.zip" in B229 in REF_SHEET, it no longer searches for H25 in REF_SHEET (that's why I use dictionary DIC) So how do you want it now? Should there be 2 results (for H24 and H25)? <#266912.1><line 24> <#266912.1><line 25> ??? Or maybe something else? If you don't say anything, I won't know anything. That depends on if you select "First Match Only" or "Duplicate Reference Row" in the first case then there should be only one record in the Overlay for #266912.1" in the second case there should be two records. 2. The issue of duplicates in column B in REF_SHEET. For example, let's say that "DemoData_0000015_Import.zip" occurs 3 times in column B of REF_SHEET - in B299, B367 and B489. How many results should there be? At the moment, the code will find "DemoData_0000015_Import.zip" in B299 and stop searching, it will not search any further to find in B367 and B489 (Exit For in If pos Then … End If). So what do you want now? The code should search further to B367 and B489? Yes. Every record of the REF_SHEET that might have a match in the NDS_SHEET must be processed. That was my initial approach. Loop through every REF_SHEET record ... with my actual data this was taking about a minute. I changed that initial approach to first loop through the smaller NDS_SHEET to find and create a collection of REF_SHEET Rows that "DO" have a match. I then looped through the Rows in that collection. With the actual data the process now takes about 12 seconds. The options for the REF_SHEET records are: If No Match Found in NDS_SHEET 1. Discard (don't include) RS_SHEET REF/Control# records in Overlay 2. Copy control ID from RS_SHEET as new record in Overlay 3. Cancel processing - don't create overlay 3. Are we looking for "File Name" from column H of sheet NDS_SHEET in column B of sheet REF_SHEET. Yes. NDS_SHEET column 8 list single file name e.g., Test.png. REF_SHEET column 2 can list single or mulitple files names e.g., Testing.png//Test.png//This is a test.png. If InStr(REF_SHEET(2),NDS_SHEET(8)> 0 Then If First Match Option Record and get out Else Record either as new row, or delimited data in existing row. End If ... will we find for each "File Path" from column B of sheet REF_SHEET its "companion" in column H of sheet NDS_SHEET? It is not really a file path. It can be a single file name or multiple files name delimited with // / or \. No. There are three options if a "companion" is not found in the NDS_SHEET. All the looping takes time. As I've said, I've gotten down to 12 seconds with the 440,000 REF_SHEET I have. I like your thought process and its possible it might shave off a a few seconds. I thought a power query might get it done in the blink of and eye?? Thank you. |
#15
|
|||
|
|||
![]()
No need for coding for this.
As start, I'd advice to split the 2nd column to path and file columns respectively (a golden database law - don't keep several types of information in same column/field). This will help not only with current task, but will be useful in future too. In my excample, on sheet Files I added formulas how to do this easily. Don't use special characters in column names (I replaced Ref# with RefNo). On Info sheet, I added a RefNo column with formula which returns matching RefNo from Files sheet. Now you can simply set the filter for this column >0 to display all entries where the matching RefNo was found - or set the filer = 0 to display all entries where matching RefNo was not found. |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to Remove the subtotal columns in a sheet | Marcia | Excel | 3 | 12-01-2023 05:48 AM |
![]() |
mbesspiata3 | Excel | 2 | 01-06-2017 05:42 AM |
![]() |
malam | Excel Programming | 1 | 10-17-2014 10:01 PM |
![]() |
Zubairkhan | Excel | 2 | 03-04-2014 10:57 PM |
Removing columns within sheet | shabbaranks | Excel | 2 | 09-11-2012 05:03 AM |