General principles

Organising your data

In order for you and others to analyse and provide meaning to your data, it’s important that you adopt good practices right from the start. Some of the things you need to consider are listed below.

Consistency

When entering data, ensure that all data elements having same value are entered exactly the same way each time. For example, if “CSIRO” is entered as a value, it’s important that “CSIRO” is used consistently across your entire dataset, and not “CSIRO” in one place, “C.S.I.R.O.” in another and “Commonwealth Scientific and Industrial Research Organisation” in yet another. Data entered inconsistently can make it difficult for a computer to match equivalent data values effectively and can lead to errors in your data analysis and presentation.

Labels given to data elements

Labels given to each data element need to make sense, not just to you but also your intended audience.

Abbreviations

If you use abbreviated labels, include the full name (and description where appropriate) of the data element in your documentation. For example the label “ID” isn’t very informative, but “OrganisationID” immediately informs the reader that this data element identifies the organisation being represented in your data, while “StaffID” identifies an individual staff member, etc.

Spaces

Avoid using spaces in labels as spaces can be interpreted by computers as delimiting the end of a field.

Use of identifiers

An identifier is a reference number or string of characters that can be used to uniquely identify a particular data element within your dataset. Identifiers should be both unique (within your own system) and persistent (they should never change over time). Examples of identifiers you may already be familiar with are:

the DOI (Digital Object Identifier) for electronically published documents
your La Trobe University staff or student number.

Identifiers are particularly useful for matching and linking data across two or more datasets and can help overcome problems caused by inconsistent data entry resulting from natural (uncontrolled) language. For example, if a person’s name is spelt or entered slightly differently in several places within your dataset, a unique identifier for that person can be used to link the items associated with that person instead of their name.

Controlled vocabularies

A controlled vocabulary is one where the language used is restricted to an authoritative list of terms. Common examples of controlled vocabularies are thesauri, glossaries, gazetteers, code lists and discipline-specific taxonomies. Controlled vocabularies are useful for several reasons:

discipline-specific vocabularies are generally well-understood by researchers in that field
similar terms can be more easily grouped together when analyzing the data which would be very difficult to do if natural language was used
like identifiers, terms from a controlled vocabulary can be used to link or compare data in other datasets that use the same vocabulary.

Encoding and character sets

Consider the encoding and character set in which to save your data. Data files will always -- of necessity -- be encoded and be in a particular character set. The selection of which encoding and character set is usually done behind the scenes, by the software through which the data is created. That is, if it is not set explicitly, a default character set will be applied to a saved document or data file.

Unicode/UTF-8 is the preferred encoding/character-set combination for ensuring durability of data, especially when dealing with multiple languages. The MS-Office applications (Word, Excel) deployed and available at La Trobe University will default to saving files in the preferred Unicode/UTF-8 encoding. For other applications, encoding/character set selection can often be checked (and if necessary changed) via the File -- Save As menu option.

Completeness of the data

When collecting and storing data, always include all of your data and never replace it with a summarised version that only includes averages or totals. While this raw data may result in a file that is large and difficult for humans to read and interpret when viewing it in its original form, there are programs that can be used which will identify and display your data in a summarised form that show averages and trends across your data elements. To maximise the power, effectiveness and flexibility of such an analysis, it’s important that the imported data is complete and has the finest granularity possible.

Useful links

Encoding and character sets