MATCH FILES is an SPSS command mostly used for merging data holding similar cases but different variables. For different cases but similar variables, use ADD FILES
.
MATCH FILES is also the way to go for a table lookup similar to VLOOKUP in Excel.
SPSS Match Files - Basic Use
- The most common scenario for
MATCH FILES
are two data files or datasets holding different variables on similar cases. - Each case has a unique id (identifier) in each data source. This id tells SPSS which case from one data source corresponds to which case from the other. Corresponding cases become a single case in the merged data.
- The syntax below demonstrates a very basic
MATCH FILES
command. If you're not comfortable working with multiple datasets, have a look at SPSS Datasets Tutorial 1 - Basics.
SPSS Match Files Syntax Example 1
*1. Create test data 1.
data list free/id test_1.
begin data
3 8 4 5 6 6
end data.
dataset name test_1.
*2. Create test data 2.
data list free/id test_2.
begin data
1 4 3 9 4 8
end data.
dataset name test_2.
*3. Match test_1 and test_2.
match files file = test_1 / file = test_2
/by id.
execute.
*4. Close all but merged dataset.
dataset close test_1.
dataset close test_2.
data list free/id test_1.
begin data
3 8 4 5 6 6
end data.
dataset name test_1.
*2. Create test data 2.
data list free/id test_2.
begin data
1 4 3 9 4 8
end data.
dataset name test_2.
*3. Match test_1 and test_2.
match files file = test_1 / file = test_2
/by id.
execute.
*4. Close all but merged dataset.
dataset close test_1.
dataset close test_2.
SPSS Match Files - Table
- A second common scenario is having a file with respondents and their zip codes. Note that there are probably duplicate zip codes in the respondents file.
- If we also have a table with the city (or region) indicated by each zip code, we can merge these into the respondent data. In this case we can use
MATCH FILES
with oneFILE
(with duplicates) and oneTABLE
(without duplicates). - The syntax below demonstrates how to do this. Note that
*
refers to the active dataset.
SPSS Match Files Syntax Example 2
*1. Table holding zip codes and cities.
data list free/zip_code (f3.0) city(a20).
begin data
123 'Amsterdam' 456 'Haarlem' 789 "'s Hertogenbosch"
end data.
dataset name cities.
*2. Mini data holding respondents and their zip codes.
data list free /id zip_code.
begin data
1 123 2 123 3 123 4 456 5 456 6 456 7 789 8 789 9 789
end data.
*3. Add cities to active dataset using zip_code.
match files file * / table cities
/by zip_code.
execute.
*4. Close all but merged data.
dataset close cities.
data list free/zip_code (f3.0) city(a20).
begin data
123 'Amsterdam' 456 'Haarlem' 789 "'s Hertogenbosch"
end data.
dataset name cities.
*2. Mini data holding respondents and their zip codes.
data list free /id zip_code.
begin data
1 123 2 123 3 123 4 456 5 456 6 456 7 789 8 789 9 789
end data.
*3. Add cities to active dataset using zip_code.
match files file * / table cities
/by zip_code.
execute.
*4. Close all but merged data.
dataset close cities.
SPSS Match Files - One Data Source
- Match files can also be used with a single data source. This is often used for reordering variables and/or dropping variables..
- One option here is using the
KEEP
subcommand. It basically means “drop all variables except ...”. - Alternatively, the
DROP
subcommand means “keep all variables except ...”.Note that these subcommands can be used in a similar way in aGET FILE
,SAVE
andADD FILES
command. - The TO and ALL keywords are convenient here. However, in this case
ALL
means “all variables that haven't been addressed yet” rather than simply all variables.
SPSS Match Files Syntax Example 3
*1. Single case test data with wrong variable order.
data list free / v1 to v3 v5 v6 v7 v8 v4.
begin data
0 0 0 0 0 0 0 0
end data.
* 2. Reorder variables. Note the TO and ALL keywords here.
match files file * / keep v1 to v3 v4 all.
execute.
data list free / v1 to v3 v5 v6 v7 v8 v4.
begin data
0 0 0 0 0 0 0 0
end data.
* 2. Reorder variables. Note the TO and ALL keywords here.
match files file * / keep v1 to v3 v4 all.
execute.
SPSS Match Files - Rules
- Instead of merging two data sources, you may specify up to 50 data sources in one
MATCH FILES
command. - More than one variable may be used to uniquely identify cases. We'll hereafter refer to these as the
BY
variables since they're used on theBY
subcommand. An common example are respondents having ahousehold_id
and amember_id
indicating the nth member of each household. Both variables will probably have many duplicates but their combination should uniquely identify each respondent. - All data must be sorted on the
BY
variable(s) ascendingly. In case of doubt, runSORT CASES
before proceeding. - The order of the merged variables is the order in which they're encountered. This implies that the order in which data sources are specified matters for the end result. For a demo, run the first syntax example once with
file = test_1 / file = test_2
and then again withfile = test_2 / file = test_1
. - Make sure there's no duplicate variable names across data sources. In this case, values on duplicate variables that are first encountered overwrite those that are encountered later. Annoyingly, SPSS does not throw a warning if this happens.
THIS TUTORIAL HAS 18 COMMENTS:
By Ruben Geert van den Berg on March 15th, 2016
Hi Hadi!
First of all, with similar variables and different cases, you'll need ADD FILES for merging the data, not
MATCH FILES
. Depending on your data, you can perhaps first merge the data and then perform all recodes and so on on the merged data.Othwerise, simply rerun the syntax you used for the first file -step by step- on the second file and carefully inspect intermediate results.
If you're not working from syntax yet, then we recommend you start doing so right away. Also see reasons for using SPSS syntax and our SPSS syntax beginners tutorial.
By Hadi on March 15th, 2016
Thank you Ruben.
Yes! Syntax, I had totally forgotten that its commands would work on any active file no matter where they were produced first. Excellent reminder!
By Sarah Maass on May 11th, 2016
Hi! The "Match" component appears to be what I need. However, I have every variable already in one dataset. Here is my situation. I have a secondary data set where data was collected in 4 waves. I am wanting to see how many individuals participated in all 4 waves, but am not 100% certain what command/analysis to run. I ran across the "Match Files" tutorial after I did some google searching, but from what I can tell it just discusses two or more data files while all of mine has already been combined. I would welcome any thoughts or suggestions. Thanks!
By Ruben Geert van den Berg on May 11th, 2016
Hi Sarah!
I'm not sure I'm entirely getting it: you've 4 waves of data and they're already in one file, right? I'll make some likely assumptions: every respondent may have between 1 and 4 waves of data (people with 0 waves of data will be absent from the data altogether). Every respondent is a single row of data in your file.
If somebody missed a wave of data, they'll probably have system missing values on all variables representing that wave. You can create 4 new variables that hold the number of missing values per wave by combining COMPUTE with the NMISS or NVALID functions as explained in SPSS Missing Values Functions.
For each variable, if the number of missings is equal to the number of variables in the corresponding wave, then the respondent missed the entire wave (you could create 4 new dichotomous variables indicating whether this is the case.
With 4 waves, there's 15 possible patterns (2 * 2 * 2 * 2) - 1 for respondents who missed all waves. You can quickly get an overview of these 15 patterns with AGGREGATE.
Hope that helps!
By Sarah Maass on May 12th, 2016
Let me try to explain this a little better as I'm not sure what you recommended is what I really need...
There are four waves of data in one data set. There are already variables created that indicate the waves the subjects participated in. I am just uncertain of what analysis to run to know that Susie participated in waves 1, 3, and 4 and Johnny participated in waves 1, 2, 3, and 4. Does this make sense?
Thanks in advance for your assistance.