Question 1

What's the actual trick to this problem?

Accepted Answer

Group files by their content hash using a hash table, then return only groups with more than one file. The parsing is the real work. Read the input format carefully, extract path and hash separately, and bucket by hash value. Most duplicates are caught by the second file in each bucket.

Question 2

Is this still asked at Dropbox and Applied Intuition?

Accepted Answer

Yes. Dropbox, Applied Intuition, and Turing all report asking it. At 67 percent acceptance, it's a medium that interviews don't treat as trivial. The problem maps directly to their real file-comparison work, so expect variations on input format or the definition of duplicate.

Question 3

What's the most common mistake candidates make?

Accepted Answer

Misreading the input format or trying to parse file content when you're already given a content identifier. Others waste time with nested loops or unnecessary sorting. The hash table approach is linear once you've parsed correctly. Off-by-one errors in extracting the hash from each file path also burn time.

Question 4

Do I need to sort the output or handle edge cases?

Accepted Answer

Check the problem statement for output order. Most variants don't care about sort order, but some require lexicographic ordering of file paths within each duplicate group. Handle empty input and single-file cases. The content hash identifier is already unique per file, so collisions aren't a concern.

Question 5

How does this relate to the Hash Table and Array topics?

Accepted Answer

Hash Table groups files by content. Array is your output container for the list of duplicate lists. String handling comes in parsing the file path format. All three are equally weighted. If you nail the hash table design, the rest is straightforward iteration and formatting.

Find Duplicate File in System

Companies that ask "Find Duplicate File in System"

Pattern tags

You know the problem.
Make sure you actually pass it.