MATCH FILES is an SPSS command mostly used for merging data holding similar cases but different variables. For different cases but similar variables, use ADD FILES. MATCH FILES is also the way to go for a table lookup similar to VLOOKUP in Excel.

Merging two datasets by id, which is a unique case identifier.

SPSS Match Files - Basic Use

The most common scenario for MATCH FILES are two data files or datasets holding different variables on similar cases.
Each case has a unique id (identifier) in each data source. This id tells SPSS which case from one data source corresponds to which case from the other. Corresponding cases become a single case in the merged data.
The syntax below demonstrates a very basic MATCH FILES command. If you're not comfortable working with multiple datasets, have a look at SPSS Datasets Tutorial 1 - Basics.

SPSS Match Files Syntax Example 1

*1. Create test data 1.

data list free/id test_1.
begin data
3 8 4 5 6 6
end data.
dataset name test_1.

*2. Create test data 2.

data list free/id test_2.
begin data
1 4 3 9 4 8
end data.
dataset name test_2.

*3. Match test_1 and test_2.

match files file = test_1 / file = test_2
/by id.
execute.

*4. Close all but merged dataset.

dataset close test_1.
dataset close test_2.

SPSS Match Files - Table

A second common scenario is having a file with respondents and their zip codes. Note that there are probably duplicate zip codes in the respondents file.
If we also have a table with the city (or region) indicated by each zip code, we can merge these into the respondent data. In this case we can use MATCH FILES with one FILE (with duplicates) and one TABLE (without duplicates).
The syntax below demonstrates how to do this. Note that * refers to the active dataset.

SPSS Match Files Syntax Example 2

*1. Table holding zip codes and cities.

data list free/zip_code (f3.0) city(a20).
begin data
123 'Amsterdam' 456 'Haarlem' 789 "'s Hertogenbosch"
end data.
dataset name cities.

*2. Mini data holding respondents and their zip codes.

data list free /id zip_code.
begin data
1 123 2 123 3 123 4 456 5 456 6 456 7 789 8 789 9 789
end data.

*3. Add cities to active dataset using zip_code.

match files file * / table cities
/by zip_code.
execute.

*4. Close all but merged data.

dataset close cities.

SPSS Match Files - One Data Source

Match files can also be used with a single data source. This is often used for reordering variables and/or dropping variables..
One option here is using the KEEP subcommand. It basically means “drop all variables except ...”.
Alternatively, the DROP subcommand means “keep all variables except ...”.Note that these subcommands can be used in a similar way in a GET FILE, SAVE and ADD FILES command.
The TO and ALL keywords are convenient here. However, in this case ALL means “all variables that haven't been addressed yet” rather than simply all variables.

SPSS Match Files Syntax Example 3

*1. Single case test data with wrong variable order.

data list free / v1 to v3 v5 v6 v7 v8 v4.
begin data
0 0 0 0 0 0 0 0
end data.

* 2. Reorder variables. Note the TO and ALL keywords here.

match files file * / keep v1 to v3 v4 all.
execute.

SPSS Match Files - Rules

Instead of merging two data sources, you may specify up to 50 data sources in one MATCH FILES command.
More than one variable may be used to uniquely identify cases. We'll hereafter refer to these as the BY variables since they're used on the BY subcommand. An common example are respondents having a household_id and a member_id indicating the nth member of each household. Both variables will probably have many duplicates but their combination should uniquely identify each respondent.
All data must be sorted on the BY variable(s) ascendingly. In case of doubt, run SORT CASES before proceeding.
The order of the merged variables is the order in which they're encountered. This implies that the order in which data sources are specified matters for the end result. For a demo, run the first syntax example once with file = test_1 / file = test_2 and then again with file = test_2 / file = test_1.
Make sure there's no duplicate variable names across data sources. In this case, values on duplicate variables that are first encountered overwrite those that are encountered later. Annoyingly, SPSS does not throw a warning if this happens.

THIS TUTORIAL HAS 18 COMMENTS:

By Ruben Geert van den Berg on May 14th, 2016

Hi Sarah!

Sorry for my late response, my workload's been pretty insane the last 2 days.

Anyway, there may be up to 15 different response patterns in your data. You indicate that these patterns are present in your data for each participant separately. However, for the big picture you could create a nice overview table by using AGGREGATE in a handy way.

Take a look at SPSS - Find Response Patterns with AGGREGATE and let me know whether that works for you, ok?
By Sharon Walhout on November 2nd, 2016

Hi Ruben,
I have two different sets of data, one including alcohol intoxicated persons (healthy subjects), the other including alcohol dependent patients. I need to compare these in order to assess the effects of alcohol, on the short- and long term. Can I use your method of matching to do so? I need the two data files to be matched on age, gender and education, so a comparison is possible.

Thank you!
By Ruben Geert van den Berg on November 2nd, 2016

Hi Sharon! For your scenario, ADD FILES seems the way to go instead of MATCH FILES. Next, see if the alco and nonalco respondents actually differ on age, gender and education with an independent samples t test, z-test for 2 independent proportions and a chi-square independence test.

If they do differ, you may try and correct for it by including these factors as covariates in your model. However, do keep in mind that you can't really establish any causal effects with such a cross-sectional design. Even if the alco and nonalco cases are similar in age, gender and education, they may differ on a million other -possibly unmeasured- characteristics and these may be the true cause for any differences on any outcome variables.

This exact same problem renders a comparison of -for instance- hospitals almost impossible: different hospitals have different patient populations and these -instead of the quality of medical care- may cause different medical outcomes. Or it could in fact be differences in medical quality after all. It's pretty much impossible to distinguish these unless patients can be randomly assigned to hospitals.

Hope this doesn't ruin your evening.
By Linda on January 6th, 2018

Hi all,

I have a question related to adding variables to imputed datasets. I have an imputed dataset, which means that each case appears 10 times in the dataset (we performed multiple imputation and did 10 imputations). Now I want to add an extra variable to this imputed dataset. Is that possible?

I tried multiple times by using the key variable ID (each case has a unique ID), but then, for each case, a value for the new variable was only added to the first imputed dataset. No value was added to the other 9 datasets of that case. So it seems not to work.

I hope anyone can help! Thanks!
By Amy on January 10th, 2019

Hi Ruben - I am looking at the marker variable created by using the syntax [/in] for tracing a case and whether they appear in one of the data sets. How is this made?

So, for below, does the variable [in_DS2] have a value of 1 for any cases that are in dataset 2 or only those with at least one non-missing value?

MATCH FILES file = 'Dataset 1'
/file = 'Dataset 2' /in = in_DS2
/by uniqueidentifier.
EXECUTE.