Bulk download sas7bdat, convert to csv and store in local #406
-
Hi all, Hope I am raising my query at a right place. I use SAS for some data cleansing and then generate respective outputs in SAS work library. All these are done entirely in SAS. Now I am trying to place this data cleansing step as a part of my pythonic workflow. Ideally, in python, I will use saspy to clean data and then download these clean data into local as csv so I can read them into python later without needing to run saspy everytime. So far I managed to read my data cleansing sas code in .sas format into python and execute this .sas via saspy.submit(). By specifying specific cleansed sas dataset, I am able to use saspy.sasdata(), saspy.to_df() and then saspy.to_csv() in that sequence to download cleanse data sets one-by-one to my local. I currently have about 20 cleanse data and I would like to "bulk" download them to local and store as csv. Instead of doing it file-by-file, would like to know if there's a more efficient way such as specifying multiply sas datasets in one single saspy.sasdata() or even use saspy to download only the .sas7bdat in saspy.dirlist(saspy.workpath)? Appreciate any insights. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 8 replies
-
Glad you're finding this useful, and you have the process working. I do have one question about the flow you described. I'm assuming the following; your SAS server is remote, and your cleansing code is SAS code, so the cleansing is happening in SAS. I think that's all correct from what you are describing. Sounds like your clean data sets end up in WORK. The one question is about where you are saying you use to_df() and to_csv(). SASData.to_df() will pull the dataset over from SAS to your client side in Python as a Pandas dataframe while SASData.to_csv() will use proc export on SAS to write out a .csv version, but that is on the server side. You could then use SASSession.download() to download that csv file to your client machine; directly to disk, not pulling it into Python. Pandas.write_csv(), would be a way to have Pandas write the dataframe from Python to your local filesystem as a .csv file, after you created the dataframe with sd2df() (to_df()). So, I'm thinking you're using Pandas to write out the csv file, not SAS, since you didn't mention using the download() method? As for doing any of this 'in bulk' for multiple files, whether it be sd2df, to_csv, or download, there's no a method that somehow works on some group of data sets or files. Given this is Python, it's just too easy to do that with on or two lines of code, specific to your use case. You can get the list of datasets you want to either pull over as dataframes, or files to download and simply iterate over that list calling the method in question. For instance, these 2 statements would give you the list of sas datasets in WORK, for either sd2df() methods to use, or the download() method to use: [x[0] for x in sas.list_tables('work') if x[1] == 'DATA'] [x for x in sas.dirlist(sas.workpath) if x.endswith('.sas7bdat')] If you used SASData.to_csv() then you would be downloading the .csv files, so get those filenames: You could simply iterate over those lists calling sd2df() or download(), depending upon which path you were trying to use (Having SAS write out csv on the server side, or pandas write it out on the client side). You could try both and see which one is faster, or gets you the .csv file that works the best.
Tom |
Beta Was this translation helpful? Give feedback.
-
Thanks, seeing the code helps. And, I'm afraid I left off a parameter on the to_csv() I hand typed in above; you do have to provide the file, as you said in number 2. log = sd.to_csv(sas.workpath+sd.table+'.csv')
>>> [x for x in sas.dirlist(sas.workpath) if x.endswith('.csv')]
['_df.csv'] I think there may be some confusion on the various csv methods. There is a to_csv() method on both the Pandas DataFrame object, as well as the SASData object. There's also a _CSV version of sasdata2dataframe, which isn't really solving anything for you in this case. That might just confuse things more. In your code above: data_in_server = sas.list_tables('work')
for row in data_in_server:
globals()[row[0]] = sas.sd2df_CSV(row[0],'work')
globals()[row[0]].to_csv(Path(csv_export_path,row[0]+".csv")) You don't have any SASData objects, you are using sasdata2dataframe (the _CSV version, but I would just use the regular version) to create Pandas DataFrames from the SAS data set and then using Pandas to_csv() method to write the DataFrame to a csv file (that is obviously happening on the client, as it's Python). The other case I was mentioning w/ download(), which is in the API doc, along with all the other saspy methods (https://sassoftware.github.io/saspy/api.html# scroll down, the methods are alphabetical) would use the SASPy methods to write a .csv file out on the SAS server, using proc export, and then the download() method to copy that server side csv file to your client filesystem. The different being who created the .csv file: Pandas or SAS. So, here's the two cases, based upon your code. They will be 'equivalent', though possibly not identical. data_in_server = sas.list_tables('work')
for row in data_in_server:
globals()[row[0]] = sas.sd2df(row[0],'work') # I would use the default case, MEM, as CSV isn't buying you anything here
globals()[row[0]].to_csv(Path(csv_export_path,row[0]+".csv"))
# and here's the download version using SAS to create the csv file on the SAS side, and downloading it to the client
data_in_server = sas.list_tables('work')
for row in data_in_server:
log = sas.write_csv(sas.workpath+row[0]+'.csv', row[0], 'work')
# if you had a SASData object it would use to_csv(), but that's just more work, so the above line is better in this case:
# sd = sas.sasdata(row[0],'work'); sd.to_csv(sas.workpath+sd.table+'.csv')
log2 = sas.download(Path(csv_export_path,row[0]+".csv"), sas.workpath+row[0]+'.csv') # this downloads the csv file I haven't run that code, just hand typed it in, so I hope I didn't make any mistakes like before. And, I don't know which case will be better for you, or if there's no significant difference. Just showing the possibilities. Hope this is more clear, |
Beta Was this translation helpful? Give feedback.
-
What access method are you using? And, what version of saspy? And, which method did you use; sd2df/pandas.to_scv, or SAS.to_csv/download? |
Beta Was this translation helpful? Give feedback.
-
You're using the COM access method. Yes, can you try using the IOM access method instead. It's much faster and more reliable. |
Beta Was this translation helpful? Give feedback.
You're using the COM access method. Yes, can you try using the IOM access method instead. It's much faster and more reliable.