Please enable JavaScript.
Coggle requires JavaScript to display documents.
Secondary Data Collection (Data mining (Sample: Researcher must decide…
Secondary Data Collection
definition
Secondary data is information or data that has already been collected and recorded by someone else, usually for other purposes. Secondary data can be classified as written or electronic and internal or external data.
Internal sources are built up and maintained by the organisation or institution for which the researcher is working.They are available only to members of the organisation.
External sources are outside the organisation or institution.
Data mining describes the process of uncovering knowledge from databases stored in data warehouses with the aim to identify valid, novel, useful and ultimately understandable patterns or trends in data and infer rules about them.
external written sources
publishers of books, government and supranational institutions reports (IMF, OECD, EU, World Bank, UN), professional and trade associations, newspapers, magazines and organisational reports
external electronic sources
government websites, statistical offices (Australian bureau statistics (ABS)) as well as other forms of online databases
internal electronic sources
business info systems and accounting & sales or CRM records.
advantages
saves time and money, high quality and easily accessible, analysis can start asap. Well-respected institutions often have better access to information providers, have rather huge budgets for data collection and many experts involved in the process.
Disadvantages
Recency of data, may not provide the exact requirements and answers required for the research question, might not cover the same population and might not be reliable or detailed enough. The researcher has to question the secondary data’s purpose, scope, authority, audience and format to determine how suitable it may be for the research question.
Data mining
Sample: Researcher must decide whether to use the entire dataset or a sample of the data. If the database in question is not large, if the processing power is high or if it is important to understand patterns for every record in the database, sampling should be done.
Explore: Both visual and statistical exploration can be used to identify trends. Look for outliers to see if data needs to be cleaned, cases need to be dropped or a larger sample may need to be drawn.
Modify: If important constructs are discovered, new factors may be introduced to categorize the data into these groups. In addition, variable based on combinations of existing variables may be added, recorded, transformed or dropped.
Model: A model is created to test the pattern or trend discovered.
Assess: When assessing data, we have to look at the accuracy, reliability and do a reality check.
Accuracy entails avoiding assumptions about what the data measures.
Reliability is based on the idea to what extend the information obtained is independent from different settings, whether assumptions made hold in different circumstances.
A reality check occurs to prevent over-identification- model is fitted more and more to the data by making certain assumptions causing the data to fit perfectly but the model does not offer any generalisable results.
Capitalization on chance occurs when researchers fit thousands of model and then selects the model that offers the best results. We also need to be clear that being statistically significant does not necessarily mean it is practically relevant to the research question in mind. E.g. a difference in expenditure of $0.50 doesn’t justify a targeted marketing campaign.