In the course of my work, I am often times dealing with large CSV files that need to be broken up into smaller files, for whatever reason. If the CSV file has a header, I can’t simply use something that breaks up the file by number of lines because I need to retain the header for use in each file. So, I wrote up a script that I call Split-Csv.
Split-Csv takes a file path and a batch size as parameters, Path and BatchSize, respectively:
Split-Csv <InputFilePath> -BatchSize <int>
Script contents:
Param(
[Parameter(Mandatory=$True,Position=1,HelpMessage="File path")]
[string]$Path,
[Parameter(Mandatory=$True,HelpMessage="Batch size")]
[int]$BatchSize
)
If(!(Test-Path $Path -ErrorAction SilentlyContinue)) {
Write-Error "$Path not found."
Return
}
$inputFile = Get-Item $Path
$ParentDirectory = Split-Path $Path
$strFilename = dir $Path | Select-Object BaseName,Extension
$strBaseName = $strFilename.BaseName
$strExtension = $strFilename.Extension
$objFile = Import-Csv $Path
$Count = [math]::ceiling($objFile.Count/$BatchSize)
$Length = "D" + ([math]::floor([math]::Log10($Count) + 1)).ToString()
1 .. $Count | ForEach-Object {
$i = $_.ToString($Length)
$Offset = $BatchSize * ($_ - 1)
$outputFile = $ParentDirectory + "\" + $strBaseName + "-" + $i + $strExtension
If($_ -eq 1) {
$objFile | Select-Object -First $BatchSize | Export-Csv $outputFile -NoTypeInformation
} Else {
$objFile | Select-Object -First $BatchSize -Skip $Offset | Export-Csv $outputFile -NoTypeInformation
}
}
Easy enough.
There are a couple of things of note.
1) I just drop these back into the same directory as the original; and,
2) I name the output files the same as the original appended with the number represented by their batch, zero-padded.
That’s it.
Nice script, worked as expected, took me a little while to figure out what the “$Length = “D” + ([math]::floor([math]::Log10($Count) + 1)).ToString()” was doing, but after debugging it became clear
LikeLike
@Steve V, the problem that I tend to have is that filenames are sorted alphabetically. So, I wanted it to be zero padded so that the number of digits was a consistent length and would sort properly. For those that read this later, the PowerShell formatting options offer the ability to pad digits by using D followed by the number of digits to pad; so if you need 5 digits in total and you have a number like 27, D5 will cause it to be formatted as 00027. Using Log10 with the “floor” average let’s me figure out the maximum number of digits that will be necessary.
LikeLike