I am working on a problem in which I have a very large dataset in the form of a csv file. This csv file has various columns, one of which is state code. The dataset is significantly larger than I need it to be- I just need the values from one state.
A solution I was thinking of using would be to read in the csv file using Python and then write to a new file with just the rows that I need- removing 49 of the 50 states (U.S).
The csv file has 3million+ rows. I am new to Python and I am not sure how I can effectively do this, what are the best ways to complete this task?
Thank you for your help and I apologize if this seems like a simple question- I am new to Python.
Deleting all the rows is different from extracting just the one state. Assuming what you want is to extract just the one state, here's one way you could do it.
with open("file.csv", "r") as file: for line in file: sline = line.split(",") if sline[n] == statecode: # Where n is the position of the statecode column zero indexed # and statecode is your target state code pass # this is where you do something with the data
With this pattern you implement pretty much any solution that involves incrementing through the csv line by line. This includes writing to a new csv, working with the data in memory,
dask solutions, and so on.
Your system very may well be able to handle just directly importing to
dask if that's what you want. These are popular data munging python tools with built-in