BUSI652 Data Analysis Assignment
BUSI652 – Assignment 1
Weightage: 15% of the final grade
Submission deadline: Friday, May 09, 2023 @ 11:59PM (PST)
PROJECT #1: DATA ANALYSIS
This project should be done independently.
We’ve identified the following sources of data that we recommend using for your project.
1. Poverty Statistics
- Download link: Poverty Data
- Source: World Bank Data
- Description: For countries with an active poverty monitoring program, the World Bank — in collaboration with national institutions, other development agencies, and civil society — regularly conducts analytical work to assess the extent and causes of poverty and inequality, examine the impact of growth and public policy, and review household survey data and measurement methods. Data here includes poverty and inequality measures generated from analytical reports, from national poverty monitoring programs, and from the World Bank’s Development Research Group which has been producing internationally comparable and global poverty estimates and lines since1990.
2. Consumer Complaints
- Download link: Consumer Complaints (zipped csv file)
- Source: Consumer Complaint Database
- Description: A database of complaints the Consumer Financial Protection Bureau has received about financial products and services.
3. USA’s Consumer Price Index
- Download link: historicalcpi.xls
- Source: United States Department of Agriculture Economic Research Service
- Description: The Consumer Price Index (CPI) for food is a component of the all-items CPI. The CPI measures the average change over time in the prices paid by urban consumers for a representative market basket of consumer goods and services. While the all-items CPI measures the price changes for all consumer goods and services, including food, the CPI for food measures the changes in the retail prices of food items only.
4. Indicators on Women and Men
- Download links:
Women legislators and managers
- Source: United Nations Statistics Division(UNSD)
- Description: The Indicators on Women and Men provides the latest statistics and indicators on women and men in six specific fields of concern :population,women and men in families, health, education, work, and political decision making. The statistics and indicators refer to the latest year for which sex-disaggregated data are available. The data have been compiled from official national sources as well as international sources.
5. Startups: Funding and Acquisitions
- Download link: Data on Startup Companies, Investments, and Acquisitions (zipped folder with many csv files included)
- Description: Crunchbase data contains crowdsourced information on a large number of startupsincludingwhoinvestedinthemandhowmuch.DataincludesCompaniesacrossthe world that have raised money, Investors (individual and institutional) that have invested in those companies, Funding rounds of investment, and records of all acquisitions of these startups. Other information about the companies (e.g., category, location) is alsoincluded.
6. Crime and Socioeconomic Indicators
- Download links:
Crimes – 2001 to present (Chicago) (Press “Export”)
Census Data – Selected socioeconomic indicators in Chicago, 2008
2012Crime in the United States (USA)
- Source: City of Chicago,census.gov (via Big Data for Social GoodChallenge)
Crimes – Reported incidents of crime (except murders) in the city of Chicagofrom 2001 to present, minus the most recent seven days.
Small Area Income and Poverty Estimates – The files in the data directory contain estimates of poverty and income for 2013. There is one data file for each state and for the US, with data for all the 2013 statistics. Additionally, there is one file that includes data for the US and each state and county.
Census Data – A selection of six socioeconomic indicators of public health significance, and a hardship index.
(Example area to explore: Is there a relationship between the number of crimes in the city of Chicago and socioeconomic indicators such as median income, poverty, and education?)
7. New York City
- Download links:
- New York City Open Data
- New York City Restaurant Inspection Results
- Description: New York City Open Data contains data on a wide variety of NYC aspects (e.g., education, safety, recreation, and many more). New York City RestaurantInspection Results captures restaurant inspections, violations, grades, and adjudication information inNYC.
- Download links:
- stores.csv, features.csv, train.csv (download allbutton)
- Source: KaggleCompetition
- Description: Historical sales data for 45 Walmart stores in different regions. (Example area to explore: the effect on sales of weather, temperature, fuel consumption, the holiday season, and otherfactors.)
- Download link: Indicators – we recommend any of the datasets in the Healthsection
- Source: World BankData
- Description: Worldwide health data covering factors such as fertility rates, HIV, immunization, population, life expectancy, birth rates, death rates and manymore
The above datasets are all relatively easy to access and the data is consistent and complete. Two additional pages you might take a look at have links to many other datasets and include descriptive and motivational material:
- 100+ Interesting Data Sets for Statistics
- 19 Free Public Data Sets For Your First Data Science Projects
Be warned: In data analysis projects, it’s common for more than 90% of the overall effort to be in obtaining the data and getting it ready for analysis, with less than 10% going into the analysis itself. We’re trying to alleviate this imbalance and the attendant frustration by providing a menu of datasets we know are not difficult to work with.
Data formats: Most of the data listed above is provided in .xlsx (Excel) or .csv format. We’ve created a set of instructions for converting data from .xlsx to .csv format (for use in Google Sheets or Python programs), and from .csv to .dbformat (for use with SQL): Data Format Conversion.
Tools and techniques: You are welcome to use any of the tools and techniques learned in class and practiced on the assignments, or you may use other tools and techniques that you’re familiar with.
The project is intentionally a bit vague and open-ended; we’re looking for you to show initiative and inventiveness.Try to find something in the data that other students are not likely to find!
1. Project Proposal
The main purpose of the proposal is for us to check on whether the scope of the project is in the range of what we’re expecting, whether your plans are crisp enough, and in cases where you plan to use a different dataset than one from the list above, whether it looks suitable and promising. On average we expect proposals to be about half-a-page long, though we know the lengths will vary. Please create a document containing the following two parts.
- State what data you plan to use
- Describe the data. As part of this, please include the total size of the dataset(e.g.number of rows) and a small sample of the data.
- Include a link to the source of the data, and discuss any difficulties you anticipate getting the data ready for analysis.
- Formulate a specific set of questions you want to answer,points you want to make,or issues you wish to explore through the data. Be as concrete as possible.
What To Turn In
Your proposal should be in a pdf document named project1_proposal.pdf. Include clearly at the top of the document the name(s) and SUID(s) for the student submitting the proposal, then include the two parts of the proposal specified above. Upload the pdf document along with the complete project.
2. Complete Project
Use techniques and tools such as (but not limited to) those covered in class to manipulate, analyze, and possibly visualize the data in order to achieve your objectives. Here are a few tips and techniques:
1. How to implement data mining in Excel
2. How to treat missing values
3. How to import data from a website to Excel
It is likely you will end up developing a data processing pipeline, where in each step you transform or otherwise manipulate some or all of your data to get it into a form that’s suitable for the next step. In the final step your data should be in the best form to answer your questions or otherwise achieve your objectives.
In many cases the early steps in a pipeline are more about preparing the data — correcting mistakes, filling in missing values, creating consistent representations, mapping corresponding values — while the later steps are more focused on summarization and analysis. If you use one of the recommended datasets, your preparation steps may be minimal.
Jupyter notebooks can be a convenient method for constructing and maintaining data processing pipelines, which may include Python and/or SQL processing, but we are not requiring Jupyter for the project. If you plan to include spreadsheet manipulations then you will need to work outside of Jupyter regardless.
What To Turn In
You will be turning in a single PDF writeup to Gradescope.
The writeup should include parts 1 and 2 from the project proposal, discuss in reasonable detail how you went about your analysis, and finally (and most importantly) discuss the conclusions drawn from your data-driven study. On average we expect the writeups to be about 3-5 pages long, though we know the lengths will vary. Data visualizations can be pasted into the writeup, but it is likely you will need to include other artifacts such as spreadsheets, scripts, or Jupyter notebooks to document your analysis. At the end of your writeup, include a section titled Description of Files Used that lists all the artifacts that you used to generate the analysis and visualizations, with a clear description of what each one contains. For example:
- poverty_data_processing.py- This python script performs the initial data cleaning and processing
poverty_analysis.ipynb- This jupyter notebook performs the main data analyses,using both SQL queries and python data manipulation
- poverty_visualizations.xlsx- This spreadsheet performs additional data manipulations and contains the final visualizations
Here is a guideline for the sections in the main write up:
- Include clearly at the top of the document the name(s) and SUID(s) for the student or student-pair submitting the project.
- Dataset: as in project proposal (possibly modified based on feedback)
- Goals: as in project proposal (possibly modified based on feedback)
- Data processing: Description of steps that were taken from raw data to final results
- Visualizations: when relevant
- Conclusions: resolution of questions, issues, or points from part 2, based on your study
- Description of Files Used
Upload the pdf document under the Assignment 1 link provided